extremely large data file and permutatio

Forum

Forum
Beginners
extremely large data file and permutatio

extremely large data file and permutations

Hello,

I am trying to work out the quickest way to run a set of permutations. I have a file of results from another analysis that looks like this:

1234567 0.4837
98765 0.0029

This file is 86 million rows long. I have to look up each integer value (first value in each row) and see if it is in a list (list is about 2000 integers), and if it is, do some processing of the float (second value). After doing this to every row, I then have to permute the look-up list (the one with 2000 integers) and do it again. I have to do this a total of 10 000 times.

My problem is the size of the input file. I can't input the whole thing into arrays of integers and floats as I would like - I get an 'out of memory' failure. I know how to read one line at a time but I fear this would be extremely slow - reading 86m rows, processing each one and repeating 10000 times. I already read/processed/ouput a slightly different input file that was 86m rows and it took about 3 days - without the 10 000 permutations!

Another option could be to divide the file into large but manageable chunks and read/process/output each chunk. However I believe the methods to read the 'chunk' would give me a very long array of unformatted characters. How could I then extract one 'row' at a time in order to look up the integer (first value) and process the float (second value)? I have seen classes such as istringstream but this works on strings - my input chunk is just a very big array of characters, isn't it? How can I separate it into lines and process them using functions like string.substr() to extract the different values?

I'd be very grateful if someone could point me in the right direction - I'd be happy to receive sample code, or just a point in the right direction to some class or function names that would help me. I have searched the help archives but can't find anything quite like what I have.

Many thanks for reading,
Jen

Last edited on

helios (17574)

Instead of reading the whole file, read and process one line at a time. The performance penalty will be unnoticeable, since a lot more time is spent processing the data than reading it.
Use std::getline().

I have a question: why are you permuting the list? From what you said, you're just using the list to find a number in it. The order of the elements doesn't matter for that.

Zaita (2770)

Throw it in a database, then use queries and cursor to return sub-sets to work with. You could also cluster something like postgres across multiple systems to reduce the load and make it faster.

jem8271 (11)

Thanks to both of you. After having said I didn't think it would be fast enough, last night I tried just running a program that did nothing but get each line ( while (!getline(infile, line).eof()) with nothing in the loop). This only took about 20 seconds so I think that might be good enough. If not I will investigate using a database.

I didn't explain the permutation step quite right - the list I am permuting is not the look-up list, but the position of each value in this list will help determine the values for the look-up list for the next permutation. It's all to do with the fact that I have a lot of data that is non-independent, so I can't assume any type of standard null distribution for my test statistic. I have to create a null distribution by permuting my results and re-calculating test statistics under the hypothesis of no association (which should be true when the data is randomly permuted).

Many thanks for helping a newbie (and I hope my statistics lecture wasn't too boring),
Jen

Last edited on

Topic archived. No new replies allowed.

C++

Forum

extremely large data file and permutations