I need to parse (100k+, 1-3KB binary data files); any advice?

I'm embarking on a project to extract data from many sensor binary data files. There are many of them (~27 million total, ~1-3KB each) but they're well organized (by date on the file system; stored on 3 separate external USB drives). "Extracting" involves using its binary specification to read out specific bytefield values. I need to make all fields accessible because this could be the basis for GUI or database applications. I've done this in C# but am trying to do this in C (making it available for C++ as well - for GUI/DB apps).

I'm using this project to get into multithreading, to learn how to use pthreads and learn how different strategies for reading these files impact execution time. There are two approaches for bulk-parsing I can think of: (1) Each thread can read just selected fields across all assigned files; multiple threads are needed for multiple fields (2) Each thread reads all fields from all assigned files.

Some questions I have:

1. I don't do any GUI programming (looking into Qt atm) but once I parse data values, how do I make them available "as soon as they're parsed/real-time" to other applications (say a Qt program with a basic GUI to display plots of data or progress bar for data parsing)? I can have threads write to a file or a database table but file io is more expensive than working in-memory. Should I use TCP/IP, sockets, namedpipes, or some other interprocess communication mechanism to communicate parsed results?

2. Should I have each thread use mmap (memory-mapped files) or read entire file content into disk? I read on wikipedia that memory-mapping files is a good technique for processing manageable chunks of data from large files (i.e., the file size is a significant proportion of total memory) but is it a good idea for many small files?

[2] shows that mmap'd data can be shared across processes but how do I get a Qt C++ application (which might just handle the display) to share the mmap'd region of a C program (which just handles the parsing)? This, I think, I'll investigate separately with a more simple example/exercise.

3. How do I measure performance? A simple approach is just to measure and record the elapsed time of execution around a snippet of code but is a profiling tool more appropriate? I'm certainly interested in how long it takes a thread to parse fields from a single file but what should I measure to figure out if thread performance deteriorates as more threads run concurrently?

Sources
----------
[1] The binary format can be found here (Section 5, Section 7 gives a high-level overview of how to parse one of these binary files):

https://www.glerl.noaa.gov/res/recon/data/misc/adcpbin2txt/WorkHorse_ADCP_Output_Data_Formats.pdf

[2] Lec30 Memory Mapped Files (Arif Butt @ PUCIT) at 11:20
https://www.youtube.com/watch?v=z0I1TlqDi50
Last edited on
if I read that right, you have 27 million * 3000 bytes roughly is 1GB? That can all go into memory, either as one chunk that is fed by all the threads or as a bunch of chunks.
One big memory mapped file seems most efficient to me, but it depends on your needs somewhat too, which I don't know enough about.

do both in c++. there isnt any advantage to using C here, and I think your shared memory may be easier to code that way (dunno, I haven't done pure C in a long time but it usually makes things a little harder).

I glanced at the file format and it appears to be variable length records, which is a shame. It would be done in seconds (or possibly subsecond!) if it were fixed width. Bloody government programmers (I used to be one, so I can say that!).

performance is best measured by what you want to accomplish. Often, 'wall clock' time is superior info over 'cpu clock time' for multi threading. No one cares that it took 100 hours of cpu time, they want to know that your 1000 cpus cut it down to 5 seconds total, usually. If you need to know how much hardware you are burning, then the other question becomes more relevant. Using a lot of hardware for a small time period isnt usually a problem, though. And this is a small problem, if its 1gb total info. It seems like a lot, but I parse text single threaded in the 3-4 gb realm (xml files, wretched bloated format) and it takes only a few moments on a fairly crappy laptop.

if you did it in C# you should be able to get it working without rewriting it, and as much as I hate C#, its a decent language and its programs are not painfully slow. I would revisit making them work together; a rewrite may save you a few seconds tops, probably less, if your c# is efficient.
Last edited on
you have 27 million * 3000 bytes roughly is 1GB?


I may have remembered wrong. But I'm 95% sure I actually have 27 GB of data and much more binary files.

Do both in c++. There isnt any advantage to using C here ...

Ok

I glanced at the file format and it appears to be variable length records, which is a shame. It would be done in seconds (or possibly subsecond!) if it were fixed width. Bloody government programmers (I used to be one, so I can say that!).


The format was actually developed by one instrument manufacturer (TRDI), whose dominant presence in the ocean-current-measurement industry made their data format a de facto standard. Compared with how other companies designed & documented their data formats, this spec is "one of the best".

Performance is best measured by what you want to accomplish. Often, 'wall clock' time is superior info over 'cpu clock time' for multi threading ... If you need to know how much hardware you are burning, then the other question becomes more relevant.


In this case, I'm more interested in learning how to evaluate thread performance given any task (and not just this particular file-parsing task, which is really trivial). For example, I'm going to play with a signals processing library later on, taking FFT's and convolutions, so it'd be nice to apply the same measurement techniques in C++ to evaluate performance.

Will keep your remark about 'wall clock time' in mind. It makes sense and I see it in other programs too.

if you did it in C# you should be able to get it working without rewriting it, and as much as I hate C#, its a decent language and its programs are not painfully slow. I would revisit making them work together; a rewrite may save you a few seconds tops, probably less, if your c# is efficient.


Nothing against C#. This is mostly to learn about multithreading in C/C++.
Ok. Threading it will certainly go faster if each thread handles one file at a time being an easy way to break it up.

27gb still fits in ram on most boxes but all one file / block may be asking a lot depending on the hardware.

so breaking it up to thread, and too much to dump into memory at once unless on a higher end machine, that begs the question of how to share it.. a few MM files, or a bunch... both work. Ive not tried to do a large # of files before, maybe someone else can comment on whether that is good or bad or ugly.

Threading is threading, syntax will be differentish but its surely pretty close to C#'s take on it. this will be good practice as I feel you can avoid race conditions most likely, if you set up the target in a friendly way that does not need blocks.
Topic archived. No new replies allowed.