Binary Platform-independent. How to?

Well, after reading the topic "Force stream << >> to interpret as binary"I am trying to develop a BinaryStream class (for my fun, during free time).
-> the topic was that one: http://www.cplusplus.com/forum/general/97851/

The basic idea is simple (and probably a little dumb)... a class that can manage an already open stream in order to have a "pure" class that can have its own overloads of << and >> (the logic is similar to the QT class QDataStream).

This work probably is not hard itself, but I soonly figured that, if I want to provide a BinaryStream, possibly platform independent, you should provide manage binaries in an uniform way.

This mean... manage and check endianness and care of basic data size.

If I limit my work to "little" or "big" endian the endiannes check would be easy... (making a pre-configure-program that checks endiannes and creates a subheader file that will specify endiannes type in a #define macro)...

The big problem comes in data types, becouse I readed that you cannot be sure of the actual size for basic data (it can happen, for example, that some system "short" can have the same size of "char").

But, if I think well, when you write BINARY FILES (where endiannes and data size truely needed to check correctly to be correctly readed) I think that a byte would be always 8bit long (or I hope).
Surely... the pre-configure-program could check the actual size for all basic data types (with sizeof) but then the management could be impossible to implement for my limited skills.

Checking QT I see that QT 4.6.x simply consider int8 = char, int16=short, int32=long, int64=long long so probably I am caring of a false problem.

What do you think about? any hints?

----------------

Question 2: indirectly related to the first one.
I am thinking about also creating a sort of ByteArray class (similar to the QByteArray on qt).... is it possible to create a derived class of std::string specialized to manage char arrays but not for intendent as TEXTUAL but as BINARY (without "/0" termination).
Thank a lot
Last edited on
1. Platform independent binary representation of objects used to be a good idea some years ago, about the time RPC, COM, Java, QT etc were being designed. Now, processors are much faster, and it is a terrible idea.

For portability, represent data as pure text, with a portable encoding like UTF-8.

See: http://www.faqs.org/docs/artu/ch05s01.html

As an academic exercise, you could consider using the integral types available in <cstdint>.
The question still remains: what do you do with floating point values?


2. I am thinking about also creating a sort of ByteArray class

How is that ByteArray class going to be different from:
1
2
using byte = unsigned char ;
using ByteArray = std::vector<byte> ;
Last edited on
hmm as far as I readed in the article there are still good reasons why you may want to still use binary format (the PNG example) like when you need to read a particular file format.
For the same reason you can expect to be able to have also a binary tool aside from the textual one ;)
> there are still good reasons why you may want to still use binary format (the PNG example)
> like when you need to read a particular file format.

Yes. For example to read or write an mp3 file, a clear understanding of the file format, along with binary i/o would be required. The good part is that these formats use a very small number of simple data types - this field is a non-negative integer represented as a 16-bit binary value in TCP/IP network byte order. And we can then use ntohs() and htons() for conversions.

When it comes to generalized binary i/o for any object, the types involved are far too many, and they can be far more complex.
ntohs() and htons() aren't windows-specific functions?
I am searching for an OS-undependant approach (when I develop I try to focus code portable both to linux and to windows, at least)

However the endianness itself (as I prefaced) is not a problem. It is not so hard to implement a byte order swapper and to check the endianness to see what is the default endiannes for the host OS.

My doubts mainly involves the problems related to the SIZE of the basic datas (like short etc etc)
Last edited on
> ntohs() and htons() aren't windows-specific functions?

No, they are 4.2BSD functions and are available everywhere.
http://www.freebsd.org/cgi/man.cgi?query=ntohs&sektion=3&manpath=FreeBSD+9.1-RELEASE&format=html


> It is not so hard to implement a byte order swapper and to check the endianness
> to see what is the default endiannes for the host OS.

The BSD byteorder functions are BSD (non-viral); you can freely rip off the source code and use it for your own purposes.
http://www.freebsd.org/cgi/man.cgi?query=byteorder&sektion=9&manpath=FreeBSD+9.1-RELEASE


> problems related to the SIZE of the basic datas (like short etc etc)

A simple solution would be to restrict the types to the exact fixed-width integral types in <cstdint>
http://en.cppreference.com/w/cpp/types/integer

Somewhat more elaborate would be to place a header at the begining of the binay file containing (textual) information about basic data types - std::numeric_limits<unsigned char>::digits, sizeof(short) and so on. Of course, at the place where you are reading the file, you will have to jump through hoops if these don't match what is on the target platform.
Sorry if, after time, I am still asking about this subject but there is a thing I am missing.

Preface: for me "portability" means "the code can work at least in Windows, Linux and Mac"

But if the system has a byte > 8bit (ex. 16) how text files are written correctly?

After all even text files are binary (I mean any letter is a 8bit byte that is an Ascii code, that is interpreted as a letter).

So I am asking myself.... but an unsigned char, if restrained to 0-255, will be correctly represented in a file as a 8bit byte? There can be an "endiannes"? (I assume not becouse endiannes, if I understood correctly, refers to byte order, not bit order)

In this case a solution can be (even if it is slow) to interpret basic datas as an array of chars (limited from 0 to 255, unsigned) even if slow (bitshift can be unsafe, becouse the macro that defines the size of char in bits could be wrong, if the compiler was not compiled in that exact system you are using... you can trust it on Linux - where the system usually is builded with the same compiler provided in the distro - but not completely on windows)
Last edited on
> But if the system has a byte > 8bit (ex. 16) how text files are written correctly?

We have to rely on the creator of the text and the reader/interpreter of the text agreeing to a set of minimal conventions. For example:

For instance HTTP (RFC 2616):
OCTET = <any 8-bit sequence of data>

TEXT = <any OCTET except CTRLS, but including space, newlines, and tabs>
The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 only when encoded according to the rules of RFC 2047.


And then, using TEXT as defined in RFC2616, specify the character encoding of the contents of a document: http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

The most commonly used portable encoding is UTF-8, which every platform supports.

You might find this an interesting read:
http://www.joelonsoftware.com/articles/Unicode.html



Last edited on
those links are all interesting :)
However those links refers to files like HTML and XML that are "formatted" (they define UNICODE in header file).

But when I speak about text files (for problem "portability") in order to investigate if we can trust or not on "char" 0-255 for creating a sourcecode that is portable to read/write binary files.

The most proper example to think is bash files (for linux / Mac Os X) or linux script file (for example /etc/fstab) or cpp file (usually sourcecode = 1 char of text is 1 OCTECT) or bat file (for windows) etc etc

So... I don't know if those files are considered ASCII or UTF8, however they are strictly contained inside the limitation 1 letter = 1 OCTECT and they are "system" files... This why I am trying to ask myself if a char limited from 0 to 255 can be trusted to be printed exactly as 8bit even if "char" itself != 8bit long.... overwhise I cannot figure how you could correctly write a .cpp .sh .bat files (or similar)...
@Nobun thanks for starting this thread, I was interested in the same thing myself.

@ general: The mentioning of text formats seems like a good idea. I'm thinking a combination of JSON and NBT would be good - does one already exist?
> This why I am trying to ask myself if a char limited from 0 to 255 can be trusted
> to be printed exactly as 8bit even if "char" itself != 8bit long

There is a difference between a 'char' in memory and the same 'char' stored in a file system. For instance, C provides the escape sequences '\n'. The C standard guarantees that
1. The escape sequences '\n' maps to a unique implementation-defined number that can be stored in a single char.
2. when writing a file in text mode, '\n' is transparently translated to the native newline sequence on the the system (which may be longer than one character; for instance it is two bytes on windows). When reading in text mode, the native newline sequence is translated back to the single char '\n'. (In binary i/o mode, no such translation is performed.)

Whenever text is to be written on one machine and read on another, these differences have to be accounted for; and de jure or de facto standards are always involved. For instance, text transmitted over HTTP uses a windows-like two character sequence to represent a new line; so the native single character newline on POSIX would have to be translated if the text is sent or received over HTTP.

FTP specifies four modes - ASCII mode, EBCDIC mode, Image mode and Local mode - which specify how data is represented. The text modes require that, if needed, the sending host's character representation must be converted to an octet representation before transmission, and a similar conversion is performed at the receiving end. (Local mode, as the name suggests, can only be used to transmit data between two machines with identical data representations; no conversions are performed in local mode).

UDF (Universal Disk Format) the vendor-neutral file system used for data storage used in portable media (like DVD) specifies nine permitted character encodings. Microsoft's Joliet (supported everywhere now) adds UCS-2 encoding to these nine.


Coming back to the original question:
If a text file is sent from a windows machine to a POSIX machine, can it be trusted to be printed exactly as the original?

Yes, if it is transmitted over FTP in text mode. Or written and read from a UDF file system.

No, if it transmitted over FTP in binary mode. Or sent over a socket as raw bytres.
Yes, you are right when speaking about newlines mark. However, for example, under windows (32bit, XP) they are simply translated as 2 bytes (so it fits the rule 1 letter = 1 octect) (0x0d 0x0a).

I wonder why if we transmit file in binary mode the result can differ if transmitted in text mode. I don't know networking, I am mainly thinking about how a file is written (or transferred too, perhaps).

When we use std::cout<< in text mode and we put for example std::cout<<"help" -> it is expected that it will produce 4 bytes "h-e-l-p" of 8bit each (octect). And every letter is a "char". Even if we are going "in the text way" I assume that the library, internally, has to process it in a binary way in order to write correctly the text file.

but, the "unit" for transmission still remain a byte (that is the minimal unit that can be allocated)... so we can find the situation where a system with char != 8bit (probably char > 8bit) must find a way to write exactly 8bit for every letter inside the file (I assume that files are always parsed as 1 byte = 8bit, becouse - for example - an HD capacity - as far as I know - is misured assuming 1 byte = 8bit).

So... if the "textual" routine is able to address a char (that is larger than 8bit) and put it in text exactly fitting 8bit.... how can do it? Perhaps there is a "safe way" to put char in 8bit? If so, it can be possible it is the same also for char if parsed in a raw mode binary?

Sorry if I am insisting so much, I am trying to understand... A scientific approach usually requires that a common definition can be found...
However, also, I couldn't figure how int8_t can fit exactly 8bit if the minimal memory allocation is wider than 8bit in a system where a byte is > 8bit.

(this chaos becouse there is not a proper definition about what a byte is.... The science would be ruined if a such mess could be happen for definition of "meter" or other basic mesures wouldn't be conventionally defined in a fixed way).
> When we use std::cout<< in text mode and we put for example std::cout<<"help" ->
> it is expected that it will produce 4 bytes "h-e-l-p" of 8bit each (octect).
> And every letter is a "char".

Not necessarily. A "letter" may consist of multiple "char"s. Or in other words, a character may be represented by more than one bye.
See http://www.cplusplus.com/forum/general/99880/#msg537226


> the "unit" for transmission still remain a byte

The standard unit of transmission is typically on octet - a sequence of 8 bits.


> If so, it can be possible it is the same also for char if parsed in a raw mode binary?

If data is transmitted in binary mode, it is treated as a stream of bits. Character encodings do not enter the picture.


> I couldn't figure how int8_t can fit exactly 8bit if the minimal memory allocation is wider than 8bit

std::int8_t would not be available on such an implementation; there would be a std::int8_t if and only if the implementation directly supports an addressable 8-bit type.

In practice, every current implementation does support an addressable 8-bit integral type.
In theory there is no difference between theory and practice. In practice there is.


> this chaos becouse there is not a proper definition about what a byte is

In C++, there is:
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.


That is to say: A byte is a char (or unsigned char or signed char)

A byte is implementation specific; in contrast, an octet (a sequence of 8 bits) means the same thing everywhere. Most data transmission standards are specified in terms of octets, or right at the outset state something like: "in this standard, the term byte stands for an octet".

Again, in practice, a byte is an octet except in some legacy systems.
So the final question, after all this interesting (and I'm afraid, exausting - for you) discussion...

Is there a safe way to trasmit OCTECTS with stream? (in this way we can find a definition to translate basic data to octects). How can I define an OCTECT in c++ if char can be != than an OCTECT?

for example:
ofstream out("file", ios::out | ios::binary)
out.put(SOMETHING THAT IS SURE TO BE AN OCTECT and can be defined as UNSIGNED VALUE 0-255?).

If binary stream cannot do the task, there is a C or C++ standard function that can ensure that stream of bits will passed as stream of octects?
> Is there a safe way to trasmit OCTECTS with stream?
> SOMETHING THAT IS SURE TO BE AN OCTECT and can be defined as UNSIGNED VALUE 0-255?)

In practice, just use std::ostream::write() and std::ostream::read(). In every platform that you are even remotely interested in, a byte happens to be an octet.
Yes... I finally got it.... in the Linux / Windows / Mac Os X it seems that binary models assume unit always == 8bit :)

So the original idea (that was exactly the one you figured... using internally std::ostream::write() and std::istream::read() ) seems can be confirmed :)

I found those links:

http://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

http://www.unix.org/whitepapers/64bit.html

They seem to confirm that Linux, Mac Os X, BSD and Windows (both 32 and 64 bits) can have some differences about some types (expecially the meaning of "long") but the basis is char == 8bit.... in this case I can develop my class without too much troubles :D
I don't know of C++ compilers for systems with non-8bit bytes, but C compilers certainly exist, and are in use: see for example this comp.lang.c post: https://groups.google.com/d/msg/comp.lang.c/tTJzN9zuRiI/pGG4uKMvquYJ

and for something that's not a power of 2, there's a port of GCC for PDP-10 floating about which has CHAR_BIT=9, but that's of course obsolete.
Topic archived. No new replies allowed.