Parsing binary data from file

Forum

Forum
Beginners
Parsing binary data from file

Parsing binary data from file

Oct 21, 2011 at 7:59am

Hello all, and thank you in advance for your help!

I am in the process of learning C++. My first project is to write a parser for a binary-file format we use at my lab. I was able to get a parser working fairly easily in Matlab using "fread", and it looks like that may work for what I am trying to do in C++. But from what I've read, it seems that using an ifstream is the recommended way.

My question is two-fold. First, what, exactly, are the advantages of using ifstream over fread?

Second, how can I use ifstream to solve my problem? Here's what I'm trying to do. I have a binary file containing a structured set of ints, floats, and 64-bit ints. There are 8 data fields all told, and I'd like to read each into its own array.

The structure of the data is as follows, in repeated 288-byte blocks:
Bytes 0-3: int
Bytes 4-7: int
Bytes 8-11: float
Bytes 12-15: float
Bytes 16-19: float
Bytes 20-23: float
Bytes 24-31: int64
Bytes 32-287: 48x float

I am able to read the file into memory as a char * array, with the fstream read command:

1
2
3

char * buffer;
ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes

So, from what I understand, I now have an array of pointers. If I were to call buffer[0], I should get a 1-byte memory address, right? (Instead, I'm getting a seg fault.)

How can I access the data stored in the "buffer" array as an ordered set of ints and floats? I would want to have an array dedicated to each field above. I.e., if the binary file contained 100 blocks of data, I'd want the first and second extracted data arrays to contain 100 ints apiece.

Since I have the binary data in memory, I basically just want to read from buffer, one 32-bit number at a time, and place the resulting value in the appropriate array.

Lastly - can I access multiple array positions at a time, a la Matlab? (e.g. array(3:5) -> [1,2,1] for array = [3,4,1,2,1])

Thank you!
Robert

Last edited on Oct 21, 2011 at 7:25pm

Oct 21, 2011 at 8:32am

Gaminic (1621)

Hey Robert,

First off, I'm a bit confused on the whole "repeated 288-byte blocks" and I'm not really experienced with anything beyond the regular long int/double, so I'm going to politely ignore that part.

Anyway, char * buffer is not an array of pointers. It's an array of characters. buffer[0] would access the first character, if it were initialized (i.e. given a value). However, as it is not yet initialized, there is no memory assigned to it, thus you're accessing something you can't/shouldn't touch.

Pointers (and arrays of pointers) are still type-dependent. A pointer to an int is of type int*, a pointer to a float is of type float*. To store your data in an array, you'd have to initialize it first. If you know the size of the array (i.e. number of elements) up front, you can use static declaration:
int myints[50];
If you don't know them at code-time, but it can be determined at run-time, you can use dynamic declaration:

1
2

int size = someFunctionToDetermineSize();
int *myints = new int[size];

If you can't (efficiently) determine the size at the start of the runtime, you can use a 'dynamic array' (i.e. an array that can dynamically adjust its size), called 'vector'. There is a pre-made one available in the STL library. You can include the vector class by using #include <vector> .

A vector works just like an array, with the exception that it provides some flexibility:

std::vector<int> myvec; 
myvec.resize(size); // Resizes the vector to 'size'. If size > old size, new spots are created. If size < old size, the (old size-size) last elements are dropped.
myvec.push_back(an_int); // Adds 'an_int' to the end of the vector and increases vector size by 1.
myvec.pop_back(an_int); // Removes the last element and decreases vector size by 1.

There are many more functions available, but these are the basics [and also the main things vectors are supposed to be used for, efficiently].

For accessing array/vector values, you'll always have to loop, either by using index notation (myarray[i]; i++; ..) or by using an iterator. Iterators are a bit complex if you're new to C++ (they're pointerlicious), and aren't really that necessary if you're just using vectors. You 'can' access multiple positions, in the sense that you can make a temporary copy (which is probably what Matlab does behind the screens):

std::vector<int> my3ints;
myints.push_back(5);
myints.push_back(6);
myints.push_back(3);
std::vector<int> my2ints;
my2ints.resize(2);
copy(my3ints.begin()+1, my3ints.begin()+3, my2ints.begin());
// 'my3ints' contains {5, 6, 3}, 'my2ints' contains {6, 3}

I couldn't really tell you the difference between fread() and istreams, except that fread() is C, and istreams are C++. They'll both work, but generally you'd want to use the C++ version. They're generally safer and more user-friendly. And by 'user' I mean newbies like myself. C scares me.

Oct 21, 2011 at 7:22pm

Robert Crabbs (4)

Hello Gaminic, and thank you for the quick reply!

By 288-byte blocks, I mean that the binary file is organized into consecutive chunks, corresponding to individual events in a detector array we have. Each block contains 11 general quantities of interest (like position, deposited energy, etc.), which add up to 288 bytes.

So...I am now a little confused about the char array. When initializing "buffer" via

char * buffer

What does the asterisk correspond to? I thought that defined "buffer" to be full of pointers. I'd expect

char buffer[N]

to be an array containing N chars...? (Also, I do calculate the required length of buffer beforehand, equal filesize in bytes, if buffer is a char array.)

As far as I know, once I initialize "buffer" and assign "filename", the commands

1
2

ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes

should populate the buffer array with the binary data from my file, split into 1-byte blocks. Right?

Now, the issue I am having is that I do not want my data as raw 1-byte chunks. I need to reinterpret the raw char data in "buffer" into a series of ints, floats, and int64s. (The first 4 chars in buffer should be read as 32-bit int data, the next 4 chars should be read as float, etc.)
What is the cleanest way to do this?

Lastly, I have seen people refer to "safe" coding practices a few times now, but I do not know exactly what it means to code safely...

Last edited on Oct 21, 2011 at 7:24pm

Oct 22, 2011 at 12:04am

mzimmers (578)

Robert –

The asterisk tells the compiler that you're defining a pointer to the data type, not an instance of the data type itself. So:

char * buffer;

Creates a pointer. (Note that you don't have anything to point *to* yet.)

I've read the C++ tutorial on this:

http://www.cplusplus.com/doc/tutorial/files/

And, while I'm sure the author knows infinitely more about C++ than I do, I don't think I'd want to read binary data into a character array.

I get the feeling that the project is trying to get you to define a struct or class (probably class, since it's C++) that mirrors the structure of your 288-byte block. So:

class InputBlock {
long int1;
long int2;
long int3;
long int4;
float float1;
etc.

That way, when you execute your read, you should be able to access the elements of the class by their member names.

Let me know if I've misunderstood your assignment.

Oct 22, 2011 at 7:27am

Robert Crabbs (4)

Thanks mzimmers,

I've been doing a lot of reading myself...still very confused.

I'm using a char* array to store the binary data, because the ifstream documentation told me to do so. The code

1
2

ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes

expects buffer to be a char* array...
http://www.cplusplus.com/reference/iostream/istream/read/

I did experiment with classes to get a change of pace from my ifstream struggles, and did as you suggested. I have a class which has 9 associated arrays which are allocated on construction. (I do not know their sizes at compile time.) So this, at least, was a success for me - much needed in an otherwise fruitless and frustrating day. :-(

What I now need to do really ought to be very simple. After executing the above ifstream code, I should have a fairly long buffer populated with a number of 1's and 0's. I just want to be able to read this stuff from memory, 32-bits at a time, as integers or floats, depending on which 4-byte block I'm currently working on.

I do not want to use get() or something that reads the file itself in 32-bit increments. Since the files I'm reading are GB+ in size, that could mean millions of random access reads, which of course would be very very slow. Hence, I'm trying to read the file in one fell swoop into memory, and then parse that raw binary stream that supposedly ifstream saves in memory.

Oct 22, 2011 at 3:03pm

mzimmers (578)

Well, you could use a union. I agree the example isn't very good at explaining how to access the binary data once you've read it into the character array. I'm sure there's a way to do it, but...I'll have to think about it a bit, and get back with you.

Oct 22, 2011 at 11:35pm

mzimmers (578)

This is an example of what I was talking about. It seems to work, but may not be the most sensible implementation.

#include <fstream>

using namespace std;

int main ()
{
	int length;

	ifstream is;
	ofstream os;
	struct nBuffer {
		int32_t	int1;
		int32_t int2;
		float	float1;
		float	float2;
		float	float3;
		float	float4;
		int64_t	int64;
		float	aFloat[64];
	} ;

	union {
		char*		cBuffer;
		nBuffer*	myBuffer;
	} myUnion;

	is.open ("/Volumes/1_TB_HD/Users/mzimmers/Desktop/in.txt", ios::binary );
	os.open ("/Volumes/1_TB_HD/Users/mzimmers/Desktop/out.txt", ios::binary);

	// get length of file:
	is.seekg (0, ios::end);
	length = is.tellg();
	is.seekg (0, ios::beg);

	// allocate memory:
	myUnion.cBuffer = new char [288];

	// read data as a block:
	is.read (myUnion.cBuffer,length);
	is.close();

	myUnion.myBuffer->int1 = 1;
	myUnion.myBuffer->int2 = 2;
	myUnion.myBuffer->float1 = 1.0;
	myUnion.myBuffer->float2 = 2.0;
	myUnion.myBuffer->float3 = 3.0;
	myUnion.myBuffer->float4 = 4.0;
	myUnion.myBuffer->int64 = -1;
	for (int i = 0; i < 64; i++) {
		myUnion.myBuffer->aFloat[i] = i * 2.0;
	}

	os.write (myUnion.cBuffer,288);

	delete[] myUnion.cBuffer;
	return 0;
}

Edit & run on cpp.sh

Using the union allows you to treat the buffer numerically, and still gives you a char * for passing to the write() function.

Note that if your floats are 4-byte, you have 64, not 48, of them at the end of your struct.

Oct 24, 2011 at 4:01am

Robert Crabbs (4)

Hello again mzimmers,

You're right, I do have 64 numbers at the end of the struct. Thanks for pointing it out - that no doubt saved me from some confusion in the future.

As far as the union goes, I am a little unclear how the buffer is being used above. I understand that we are reading stuff into a char * buffer from fstream. Why is the buffer only defined to be 288 bytes long, when the filesize will be 288 * N bytes, for some large number N?

I think I understand the data structure you've created, called nBuffer. That is what I hope to extract from the binary stream in the end, except that I want its fields to be arrays, not single numbers. (Again, there will be many 288-byte blocks per file.) But, where are we taking data from the binary stream and saving it to the nBuffer struct? It seems to me like we're simply defining the fields to be constants, like float1 = 1.0... I'd like to do something more like float1 = [first 4 bytes of cbuffer].

Robert

Oct 24, 2011 at 6:07am

mzimmers (578)

Hey, Robert -

Regarding your first question: I was just trying to demonstrate proof of concept by reading in a single 288-byte block into the buffer. In your program, your myUnion will be an array (or vector) of many, many of these blocks.

Regarding your second question: are you familiar with unions? (If not, they're described in K&R.) In simplest terms, a union allows you to overlay two or more dissimilar data types (or composites). In our example (using pointers), the character buffer and the numeric buffer occupy the same space in memory.

With a union of the character array, and your data structure of ints/floats, we can use the character array (the pointer, actually) for the binary read, and then use the nBuffer to access the data numerically. And, we're not defining the fields as constants; we're doing what you want (except that the first four bytes of the buffer are int1, not float1).

It would probably be easier to explain this with a picture, but since I can't do that, I suggest you brush up on unions, see if it becomes clearer, and report back.

Oct 25, 2011 at 3:12am

bennyp (2)

Hi all,
I've been following your thread. I am not a great c++ coder, but i have recently worked on some similar problems. There still may yet be a better way, however, it seems like this is what you want to do:

#include <iostream>
#include <fstream>

using namespace std;

int main ()
{
// I think this is how you were going to use the file...
//	char* filebuffer;
//	ifstream inputfile;
//	inputfile.open("filename.xxx", ios::binary|ios::ate);
//	unsigned int filelength= inputfile.tellg(); inputfile.seekg(0);
//	filebuffer= new char[filelength];
//	inputfile.read(filebuffer, filelength);
//	inputfile.close();
//	int numberofrecords= filelength/288;

	//demonstration array.  ifstream read function will put the file in an array, much larger, like this
	char inputfilebuff[100]= {"abcdefghijklmnopqrstuvwxyz0123456789"};
	
	//creating a special pointer
	unsigned int manualpointer=0;//all pointers are 4 bytes (unsigned of course) on my compiler/system
								//this is system dependent so be careful
	//demonstration values
	int val1=1000000;
	float fval1=12345.67;

	//dumping buffer in ascii
	for (int x=0; x<100; x++) cout << inputfilebuff[x] ;
	cout<< endl;

	//assigns the char* inputfilebuff to the pointer
	manualpointer = (unsigned int)inputfilebuff;

	//writing to the char array
	*(int*)manualpointer= val1;
	manualpointer += 4;//changes the pointer position in the array
	*(float*)manualpointer= fval1;
	
	//another ascii dump. looks different...
	for (int x=0; x<100; x++) cout << inputfilebuff[x];
	cout<< endl;

	//reading from the char array
	manualpointer -= 4;
	int val2=0;
	val2= *(int*)manualpointer;
	manualpointer += 4;
	float fval2=0.0;
	fval2= *(float*)manualpointer;

	//in and back out... no problems
	cout<<val2<<endl;
	cout<<fval2<<endl;

	cin.get();
//dont forget!
//	delete[] filebuffer;
	return 0;
}

Edit & run on cpp.sh

Just fill in your data structure of choice...
Don't forget that manualpointer uses normal integer math and not pointer math.

Topic archived. No new replies allowed.