Thanks for the responses. I did some simple benchmarking and the biggest time consumer is the read and rewrite operation (which makes sense).
One of the most common mistakes people make using large vectors is to let the run time constantly reallocate the vector. Every time this happens, the run time has to allocate a new memory space for the larger vector, then copy the old vector to the new space and then release the old space. This is very time consuming.
If you have an idea how large the vector needs to be, use vector::reserve to allocate a sufficiently large vector.
http://www.cplusplus.com/reference/vector/vector/reserve/?kw=vector%3A%3Areserve |
I'll try this; putting the data in the variables still takes an appreciable amount of time and this might speed it up.
I/O is slow. Unless you get rid of the write, I'm not sure you can speed things up appreciably. Can't you just skip writing the temporary files, and if you need the backup data just read the original file again, but grab the other columns? |
This is how I originally had things. The problem is that the data is not serialized very well... each column in the data file may consist of a different number of digits. For example, there is a time column and some of its members may look like this:
1 2 3
|
.00001
1.005
1093.29031
|
This means that I can't very easily seek to specific lines in the data file, since the lines have a variable number of characters. To get to a certain line in the data file, I have to call a
readLine()
function, and it is very expensive time-wise. So, let's say that I have a 100 second file with a 100 kHz sample frequency. This means the file has 10 million data points. If I want to access the 9 millionth data point, for example, I have to call
readLine()
ten million times. Following the example, if I partition the large file into say, 5 files, then to get the 9 millionth data point I only need to open the 4th file and
readLine()
one million times.
That was my rationale, at least.