Whats the totoal memory used by my set<ulong>?

Greetings,

Is there any way to get the total memory used by a set of ulongs (or any object for that matter)? I'm trying to shove at most 1,045,898,160 unsigned longs into a set. By my count that would amass to 4,183,592,640 bytes or about 3.9 GB. However, my program exceeds my 15.7 GB RAM limit. Why is this? How many more bytes does the set container use for every element?


Thanks in advance.
Several more. std::set is usually implemented as a red-black tree, so that means each node contains at least two pointers to its children, its color, and the data itself. Since you're working on what appears to be a 64-bit system, I'd say each node uses 8*2+8+8=32 bytes, including padding, not counting the overhead by heap allocation. Using these figures, the set would use 31,918.3 MiB.
Actually, it would be far more efficient to store a bitmap 2^29 bytes long describing which integers an equivalent set would contain.

Why do you need to do this, anyway?
Last edited on
WOW, I definitely cant use std::set then. Essentially, I need to know the size of the intersection of two sets of DNA sequences. Each set has about 26 million sequences, all of which are 51 nucleotides long. Since I'm working on a 64-bit system, the biggest sequence I can represent as a integer is 32 (each of the four bases can be represented as 00, 01, 10, and 11). So there are 20 'words' of size 32 in a 51 nucleotide long sequence (plus 20 more for the complementary strand). So that's where 1 billion integers comes from (26 million x 40).

Finding the intersection is really easy if I had all these numbers for each set sorted and printed to a file. Would I be able to hold 1 billion unsigned longs in a std::vector? If so, how long do you think sort() would take? I guess I could try it. I would only need to make minor changes in the code.

I don't really know anything about bitmaps, but would it be able to hold the number in a sorted fashion? That would be great since the bitmap would only use about .5 GB of RAM.
A bitmap is just what its name suggest: an array of bits.
Since all a set does is hold whether a value was added to it or not, when handling integers it's sometimes more efficient to allocate a bitmap and set bits as the integers are found. The problem is that it involves bit twiddling, but the advantage is that it's the most space-efficient representation.
You can use std::vector<bool>, which is a template specialization that creates a vector as a bitmap.
1
2
3
4
5
6
7
8
9
//with std::set:
std::set<unsigned long> set;
set.insert(1478523);
bool exists=set.find(1478523)!=set.end();

//with std::vector<bool>
std::vector<bool> set(1<<32); //2^32 elements long. Will only use 2^29 bytes.
set[1478523]=1;
bool exists=set[1478523];


EDIT: Also, finding an element in a set takes O(log n) time, while in a vector it takes O(1) time.
Last edited on
Thanks for the input again, but std::vector<bool> isn't going to cut it. I must be able to record the presence of any number between 0 and 4^32 (4^32 being the total number of possible sequences of size 32). A std::vector<bool> of that size would take up an absurd about of memory. Instead I'm going to shove all the numbers into an array and preform a sort() on it. If the vector doesn't use up extra memory as does the std::set then that would put me right under my RAM limit. Hopefully it doesn't take years to sort.

Alas, if the sequences were not as long as they are then your std::vector<bool> idea would work very nicely.
A std::vector<bool> of that size would take up an absurd about of memory.
Didn't you see the comment on line 7 above? An std::vector<bool> of n elements uses n/8 bytes (technically, it may use more or less depending on the size of a byte on the platform in question). 2^32 elements would use just 256 MiB.
Is my math off? 51 nucleotides == 102 bits of data == 13 bytes of data.

You claim to need: 20 * 32 = 640 bits of data, or 80 bytes of data to represent 51 nucleotides.

Why the massive difference?

@helios
Sorry I wasn't very clear. For my purposes line 7 would have to read:

std::vector<bool> set(1<<64);

Since I need to know which numbers between 0 and 4^32 are in the set.
4^32 bits = 1.84467441 × (10^19) bits = 2.14748365 × (10^9) gigabytes <- No can do!

@PanGalactic
No, you have it wrong. I can represent a sequence of size 32 as one number (4 Bytes, an unsigned long). If you take a sliding window of size 32 over 51 nucleotides, you will find that it contains twenty 32-nucleotide long sequences ( (sequence size - window size)+1 ). BUT you also have to consider the complementary strand (another 20 sequences). So I need a total of 40 unsigned longs to represent one 51-nucleotide sequence. That comes out to 40*4=160 bytes (1280 bits).
Last edited on
Maybe you could break this up into smaller chunks. Instead of doing everything as a set / vector... perhaps split the task.

Maybe have 0x10000 sets of vectors (or deques). You can use the high bits to determine which set the number belongs in.

This would consume more memory than a single massive vector/deque but would make sorting a bit faster because the sets would be like a preliminary sort.



Although I wonder if a massive set might be the better approach anyway. Even if it sucks up RAM, can't you use virtual memory? There's always room on the HD. 15.7 GB is nothing. I have more than that in music.
I don't get it. What's the point of the sliding window?
You can only represent 16 nucleotides in 4-bytes (32-bits) @ 2 bits per nucleotide.

Why would you expand the data in this manner when memory and memory bandwidth are the primary limitation in modern hardware?
Could this just be done on disk?

For example, in Linux:
> sort file1
> sort file2
> diff file1 file2
@Disch
I wouldn't know how to do that. I've only been programing in C for about a year.

@helios
Its biologically relevant. Any portion of two sequences could be the same. (e.g. the last 25 of one sequence could be the same as the first 25 of another, etc). I would like to know this as its likely that they were both sequenced from the same microbial genome. I wouldn't be able to tell this if I compared the reads as one 51 nucleotide long sequence.

@PanGalactic
You are right. My ulongs are actually 8 bytes long (I'm using a 64 bit system). So I'm really using 40*8=320 Bytes per sequence.

@moorecm
Would that be faster or slower than doing it in memory?

Looking at the STL sort() function, I found that it takes iterators. Looks like I'll have to try a vector<ulong> after all.
Well, you have to make some compromises if you don't want to buy 30 GiB of RAM. What's the problem with using a short array for a sequence and then comparing it to another in every possible way? Sure, it'll take longer, but it's a trade-off.
Well I didn't have to buy 30 GB of RAM. It took about 9 GB to hold 1 billion 8-byte integers, and it was able to sort it in 2 minutes and 7 seconds. Does that sound about right? I head'ed and tail'ed the file two which I wrote the vector to and it looked like they were all sorted. This surprised me, I thought it would have taken a lot longer! Preforming the intersection is now trivial.

Anyway my initial question was answered a while ago so thanks everyone!
Topic archived. No new replies allowed.