Parsing through large data sets

Forum

Forum
General C++ Programming
Parsing through large data sets

Parsing through large data sets

Hi everyone,

So I'm attempting to write a program that will parse through a large file (genome sequences) and I'm basically wondering what options I should consider if I wanted to either:

a) store the entire genome in memory and then parse through it
b) parse through a file in small portions

If I go with "a", should I just read a file into a vector and then parse through it? And if I go with "b" would I just use an input/output stream?

Thanks in advance for your help.

kbw (9488)

Will it fit in memory? If it will, then a vector might do, it depends on how you want to access the data. If you do use a vector, remember to set the capacity or size before filling it.

LB (13399)

Human DNA in uncompressed form, where each AT or GC pair takes two bits, will take up around 700MB. If you can accept using that amount of ram plus extra, go with A. If not, go with B.

Yoni Revah (7)

Yes, it would fit in memory. Does it make sense then to use the IO library and put all that data into a vector? I have way more theoretical knowledge of C++ than practical knowledge.

Thanks again!

LB (13399)

std::vector<bool> is a specialization of the usual std::vector template and optimizes for memory use, I would recommend using it. Just remember that one AT or GC pair is two elements, not one. One bit is for type (AT or GC) and the other is for orientation (GC vs CG, AT vs TA).

kbw (9488)

std::vector<bool> optimises space but compromises access time. 1G is no big deal these days, so I wouldn't use it here.

Topic archived. No new replies allowed.