Object Serialization. Help!

I am currently working on learning how to store data and objects in binary files, so that they can be loaded and used every time as you start up your program again. I think that this is called Serialization according to my interpretation (please tell me if I am wrong).

Howerver, I have looked all over the Internet where there have been different lessons of how to do this in many different ways. My problem is that I do not want to learn the "dirty" or "bad" way.

I want to learn how to do this, in a way that offers compativility so that my program and files avoid crashes because of these threats:
- System Endiannes differences.
- Compiler differences.
- Padding differences.
- That standard datatypes like int have different sizes on different machines.
- Making different versions might make it crash.

I am a beginner and I have just read about these threats and issues.

So what do I want long story short?: I want help to find an article/lesson/tutorial/book or from a professional who can give me a good principle of how to do serialization in programs that are compatible between machines, compilers and versions and that avoid the mentioned threats above as much as possible.

Before I have encountered many tutorials that just hand out the simplest ways where you store entire objects at once, without considering those issues. I do not want to learn that, I want to learn a "proper" way.

I hope that you know what I mean and thanks for all help. I am just a beginner so I do not know if it is to complicated to dive into this already, but I am willing to give it a try, the main problem has been to find a good source of information about what I have just mentioned. It has been like "walking in a jungle" so far.
Last edited on
You're looking to make a REALLY long program. First: A program can't possibly be compatable on every concieveable machine/compiler, its just not possible without making it unessesarily long. It woul have to sense the OS and use conditional statements to operate under it. I would save that kind of thing for if your actually making somthing a lot of people will use.

2nd: Writing data into binary files is no different than writing it into any other file type unless it's encrypted. You can write data into a binary, and read it with notepad and see everything, even though it was saved useing a binary output.

Now, there really is no "proper" way to store data. You store data based on it's purpose, and necessity. The best way to store data is by sorting it. This is more commonly reffered to as a Data Structure. A data structure is ANY data that is sorted. a good example:

I wrote a program which allows the user to create a budget. To do this, we need the data:

Item, cost, Description(optional), time it was modified, and group.

We sort this so that it is easier to retrieve. I used a "line" type of structure. I basically put each piece of data on a line in the file in an order.

1
2
3
4
5
Item
cost
description
time
group


So, if I wanted to (lets just say) get the name of every Item in that file, i would put all the data into a vector, close the file, get every 5th line in the vector(starting with the 0th), and put each one into a new vector so we can easily do whatever we want with those names (like display them in a menu).

So, as you can see, it really doesnt matter, as long as the data is "formatted" (or sorted) in a way thats easily retriev-able by your program. The reason we do this, btw is to save space. Why make a file for each item in my budget, when I can just put all that data into 1 file?

You really just have to start trying it out.

I guess a good example of a "Serialization" I am emplementing for my budget program is (sort of) like a loader. It will perform the following task(s):

1. Get the current month and year
2. Check all the months/years of out items
3. store out-dated item_names in a vector
4. if vector.size() > 0, prompt the user to save the data to an archive

This is so that we wont have to delete all the items when the next month arrives. I made it optional, so that if you want to look at or save some of that information you can. I'm also going to add in the additional option of saveing an expense report, as I already have it displaying it in the program it will be easy to do.

Really all "Serialization" is, is making your program retrieve data relavant to it's operation on startup.

I also think you may be over-reacting with all this "compiler differences" and "padding differences". You're going to write the program on only 1 compiler, so it doesnt matter if your program cant compile with another comiler. Binary is Binary, no matter what program translates your C++. And integers having different sizes? It's an INCREDIBLY small difference. I would only worry about that if you were going to include an array of 1000 longs in your program.

So, just go man. Make the program.
@IWishIKnew
What you are describing isn't object serialization http://en.wikipedia.org/wiki/Serialization

I also think you may be over-reacting with all this "compiler differences" and "padding differences". You're going to write the program on only 1 compiler, so it doesnt matter if your program cant compile with another comiler. Binary is Binary, no matter what program translates your C++.
He's not really over reacting, if you write portable code (or at least attempt to) you need to take these kinds of things into consideration.
And integers having different sizes? It's an INCREDIBLY small difference. I would only worry about that if you were going to include an array of 1000 longs in your program.
It wouldn't be a small difference, a long is often a 32bit int on 32bit system and is often a 64bit int on 64 bit systems.
Last edited on
Disch gave me a long explanation of that you need to consider theese things if you want to make portable progams, which I want. I do not just want to make programs that can on my machine only, I want it to be portable, at least to such extent, that it can be run by the same operative system. (When I talked about compatibility, I did not mean making programs that can run in all OS, since I use Windows, I want to learn how to make programs that can run on different Windows systems, at least to begin with).

I still need the help to find could source of information where they give you tips of how to do this. Or should I start from scratch on the drawing board and find my own way? I have heard from other programmers before that you should not "re-invent" the wheel twice. Since I am a beginner, I do not even know if I would be able to find all that out by myself.

I have gotten one more question in my mind when it comes to storing objects in a binary file. How do you usually do this? I will give two examles:

1: You first create an object with default data, then you load the saved data from a binary file and then you copy the values of the file data to the already created object's data manually.

2: Or do you usually store an object literally into the file so you can do this:
Object obj = (returned object by reading from the file).

No matter how it works, I do not want to do it the simple way as I mentioned above, I want to learn how to make it somewhat portable within the same OS.

Thanks for your replies.
Last edited on
I am currently working on learning how to store data and objects in binary files, so that they can be loaded and used every time as you start up your program again.

If you just want to store/read something, choose an existing format. If you want to be human-readable and easy to debug, choose something text-based, such as XML (boost.serialization will help you there)

If you think text wastes space and *really* want a binary format, the most popular choice today is Google protocol buffers ( http://developers.google.com/protocol-buffers/docs/overview )

There are other binary formats, too: I've used XDR from RogueWave and CDR from ACE.


For the rest of this post, let's assume your goal is not to actually store/read, but to invent a new binary format that can be, some day, used to store/read objects.

If that is the case, then you're thinking in the right direction. Here are a few notes:

- System Endiannes differences.

Your options here:
1) decide which endianness will your binary files have (I like "network byte order", since then you can use the standard conversion functions)
2) invent a BOM (byte-order marker) and put it in the header of your file.

- Compiler differences.

Assume standards-compliant compiler. If you decide to take advantage of some extensions in future, you may, but start with the standard set.

- Padding differences.

So don't attempt to copy object representations wholesale. Serialize them member by member.

- That standard datatypes like int have different sizes on different machines.

Yes, so your data file has to have its own set of types. One database I use simply stores everything in 64-bit numbers internally.

- Making different versions might make it crash.

Either introduce a version number in your header, or design the format in a way that doesn't make assumptions, e.g. embed type information and size information in the stream.

article/lesson/tutorial/book or from a professional who can give me a good principle of how to do serialization in programs that are compatible between machines, compilers and versions and that avoid the mentioned threats above as much as possible.

As a professional, I very strongly advise to use an existing format/library.
Last edited on
I agree with everything Cubbi has stated.


@IWishIKnew:

First: A program can't possibly be compatable on every concieveable machine/compiler, its just not possible without making it unessesarily long.


A portable program will be able to be run on many different machines (and to do this, you will have to compile it with many different compilers). So these are very valid concerns.

Additionally, the concern here is to make a file format that is consistent across platforms. The file must be defined in such a way that when read on other platforms it will be read properly. Before you say "that's a waste of time", look at virtually every existing popular file format (zip, png, bmp, jpg). They all do it.


And integers having different sizes? It's an INCREDIBLY small difference


It's a huge difference. So huge that it will utterly destroy file loading if not taken into consideration.

Consider this simple read/write method:

1
2
3
4
5
// write a long
file.write( (char*)&foo, sizeof(long) );

// read a long
file.write( (char*)&foo, sizeof(long) );


What happens when you save this file on your 32-bit machine, then give it to your friend to run on his machine (which is running a 64-bit version of the same program).

foo is written as 4 bytes. But then possibly read as 8 bytes. Not only does this give you an incorrect value for 'foo', but now you have "desynced" your position in the file, so everything read from the file after this point will be incorrect.


You're going to write the program on only 1 compiler, so it doesnt matter if your program cant compile with another comiler. Binary is Binary, no matter what program translates your C++.


I see you have never written anything portable.

That's fine. But please don't discourage other people from writing portable code.


EDIT:

I'm sorry for being so harsh, IWishIKnew. =(

I know you are just trying to help. It's just that I've already invested a decent amount of time helping this guy in another thread and your post basically was spinning him in another direction and telling him to disregard everything I said. You probably didn't realize that because it was out of context though =(



Also don't open binary files in a text editor like notepad. Open them in a hex editor -- then you can actually read them and visualize how the information is stored.

@Zerpent:


If you haven't already, I advise you find a free hex editor and download it. Write a few binary files, then open them in a hex editor so you can see how the data is actually stored. It will greatly help you visualize how all this works.
Last edited on
Thanks for the replies and for you help. I will probably try to make my own format first. But I still have this question in my mind:

How do you load and write objects into binary files in practice?

1: You first create an object with default data, then you load the saved data from a binary file and then you copy the values of the file data to the already created object's data manually.

2: Or do you usually store an object literally (with constructors, functions, private data) into the file so you can do this: Object obj = (returned object by reading from the file). I know that I should serialize objects and not store them whole at once, but when you say that you serialize an object, do you only save it's attributes or do you somehow store it's operations as well, by using serializaion?

How do the libs work that you mentioned above that I could use? Do they just store the attributes of the class or an actual entire class with all its functions and cunstructors? As you can see I am very confused about how this is made in C++.

Disch I will download a hex editor and find out how it works. Thanks a lot, you are great.
Last edited on
You'd write individual members one at a time. Functions/constructors would not get written, as those are not data... but private members would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Example
{
private:
  string somedata;
  int moredata;

public:
  void Read(istream& file)
  {
    somedata = ReadString( file );
    moredata = ReadU32( file );
  }

  void Write(ostream& file)
  {
    WriteString( file, somedata );
    WriteU32( file, moredata );
  }
}


It's possible to combine both the read and write function with clever use of polymorphism, but I'm too lazy (and late for work!) to explain it. Besides it might end up just confusing you because it seems a little like magic.
Ok, thank you, I am quite close now I think to understand how to do this. I only need to get more info. about how to make a file format. I posted more questions about that in the other topic Disch where we talked about it before.

I can not thank you enough.
I got a link that explained Serialization and I have read about it at more sites.
I just want if I got this right about serialization and deserialization:

Serialization means that you convert an object or a variable, into a stream/sequence of raw bytes, so that it can be stored in a memory or file.

Deserializations means the opposite, you read the stream of bytes, then "put" the bytes together to create an object or variable.

Am I right?
Topic archived. No new replies allowed.