I'm embarking on a project to extract data from many sensor binary data files. There are many of them (~27 million total, ~1-3KB each) but they're well organized (by date on the file system; stored on 3 separate external USB drives). "Extracting" involves using its binary specification to read out specific bytefield values. I need to make all fields accessible because this could be the basis for GUI or database applications. I've done this in C# but am trying to do this in C (making it available for C++ as well - for GUI/DB apps).
I'm using this project to get into multithreading, to learn how to use pthreads and learn how different strategies for reading these files impact execution time. There are two approaches for bulk-parsing I can think of: (1) Each thread can read just selected fields across all assigned files; multiple threads are needed for multiple fields (2) Each thread reads all fields from all assigned files.
Some questions I have:
1. I don't do any GUI programming (looking into Qt atm) but once I parse data values, how do I make them available "as soon as they're parsed/real-time" to other applications (say a Qt program with a basic GUI to display plots of data or progress bar for data parsing)? I can have threads write to a file or a database table but file io is more expensive than working in-memory. Should I use TCP/IP, sockets, namedpipes, or some other interprocess communication mechanism to communicate parsed results?
2. Should I have each thread use mmap (memory-mapped files) or read entire file content into disk? I read on wikipedia that memory-mapping files is a good technique for processing manageable chunks of data from large files (i.e., the file size is a significant proportion of total memory) but is it a good idea for many small files?
[2] shows that mmap'd data can be shared across processes but how do I get a Qt C++ application (which might just handle the display) to share the mmap'd region of a C program (which just handles the parsing)? This, I think, I'll investigate separately with a more simple example/exercise.
3. How do I measure performance? A simple approach is just to measure and record the elapsed time of execution around a snippet of code but is a profiling tool more appropriate? I'm certainly interested in how long it takes a thread to parse fields from a single file but what should I measure to figure out if thread performance deteriorates as more threads run concurrently?
Sources
----------
[1] The binary format can be found here (Section 5, Section 7 gives a high-level overview of how to parse one of these binary files):
https://www.glerl.noaa.gov/res/recon/data/misc/adcpbin2txt/WorkHorse_ADCP_Output_Data_Formats.pdf
[2] Lec30 Memory Mapped Files (Arif Butt @ PUCIT) at 11:20
https://www.youtube.com/watch?v=z0I1TlqDi50