Parsing through large data sets

May 30, 2013 at 10:21pm
Hi everyone,

So I'm attempting to write a program that will parse through a large file (genome sequences) and I'm basically wondering what options I should consider if I wanted to either:

a) store the entire genome in memory and then parse through it
b) parse through a file in small portions

If I go with "a", should I just read a file into a vector and then parse through it? And if I go with "b" would I just use an input/output stream?

Thanks in advance for your help.
May 30, 2013 at 10:48pm
Will it fit in memory? If it will, then a vector might do, it depends on how you want to access the data. If you do use a vector, remember to set the capacity or size before filling it.
May 30, 2013 at 10:54pm
Human DNA in uncompressed form, where each AT or GC pair takes two bits, will take up around 700MB. If you can accept using that amount of ram plus extra, go with A. If not, go with B.
May 31, 2013 at 1:18am
Yes, it would fit in memory. Does it make sense then to use the IO library and put all that data into a vector? I have way more theoretical knowledge of C++ than practical knowledge.

Thanks again!
May 31, 2013 at 1:22am
std::vector<bool> is a specialization of the usual std::vector template and optimizes for memory use, I would recommend using it. Just remember that one AT or GC pair is two elements, not one. One bit is for type (AT or GC) and the other is for orientation (GC vs CG, AT vs TA).
May 31, 2013 at 11:22am
std::vector<bool> optimises space but compromises access time. 1G is no big deal these days, so I wouldn't use it here.
Topic archived. No new replies allowed.