I am trying to work out the quickest way to run a set of permutations. I have a file of results from another analysis that looks like this:
1234567 0.4837
98765 0.0029
This file is 86 million rows long. I have to look up each integer value (first value in each row) and see if it is in a list (list is about 2000 integers), and if it is, do some processing of the float (second value). After doing this to every row, I then have to permute the look-up list (the one with 2000 integers) and do it again. I have to do this a total of 10 000 times.
My problem is the size of the input file. I can't input the whole thing into arrays of integers and floats as I would like - I get an 'out of memory' failure. I know how to read one line at a time but I fear this would be extremely slow - reading 86m rows, processing each one and repeating 10000 times. I already read/processed/ouput a slightly different input file that was 86m rows and it took about 3 days - without the 10 000 permutations!
Another option could be to divide the file into large but manageable chunks and read/process/output each chunk. However I believe the methods to read the 'chunk' would give me a very long array of unformatted characters. How could I then extract one 'row' at a time in order to look up the integer (first value) and process the float (second value)? I have seen classes such as istringstream but this works on strings - my input chunk is just a very big array of characters, isn't it? How can I separate it into lines and process them using functions like string.substr() to extract the different values?
I'd be very grateful if someone could point me in the right direction - I'd be happy to receive sample code, or just a point in the right direction to some class or function names that would help me. I have searched the help archives but can't find anything quite like what I have.
Instead of reading the whole file, read and process one line at a time. The performance penalty will be unnoticeable, since a lot more time is spent processing the data than reading it.
Use std::getline().
I have a question: why are you permuting the list? From what you said, you're just using the list to find a number in it. The order of the elements doesn't matter for that.
Throw it in a database, then use queries and cursor to return sub-sets to work with. You could also cluster something like postgres across multiple systems to reduce the load and make it faster.
Thanks to both of you. After having said I didn't think it would be fast enough, last night I tried just running a program that did nothing but get each line ( while (!getline(infile, line).eof()) with nothing in the loop). This only took about 20 seconds so I think that might be good enough. If not I will investigate using a database.
I didn't explain the permutation step quite right - the list I am permuting is not the look-up list, but the position of each value in this list will help determine the values for the look-up list for the next permutation. It's all to do with the fact that I have a lot of data that is non-independent, so I can't assume any type of standard null distribution for my test statistic. I have to create a null distribution by permuting my results and re-calculating test statistics under the hypothesis of no association (which should be true when the data is randomly permuted).
Many thanks for helping a newbie (and I hope my statistics lecture wasn't too boring),
Jen