std::bad_alloc with large std::map and std::unordered_map

I am writing a program that loops through a filelist of files in a user-specified directory. If they are not present in a reference set, the 31-character strings (called kmers) read from each file are encoded as 64-bit integers (uint64_t) and inserted as keys into a std::map (I have also tried using a std::unordered_map with the same result). Each 64-bit integer is inserted into the map as a key for a boost::dynamic_bitset which has the same length as the length of the filelist; then, each position in the bitset represents whether the key is present/absent in the corresponding file (0 = absent, 1 = present). Based on some analysis in Python, for the dataset I'm using the resulting map will contain ~20 million key-value pairs. The actual code is at the end of this message. I am compiling this program with MinGW on 64-bit Windows.

So here's the problem: about midway through the filelist, the program exits with std::bad_alloc . Based on some cout experiments, I think I've established he line triggering the bad_alloc exception is the line where a new key-value pair is inserted (flagged in the code below with an arrow). bad_alloc would tend to indicate I'm out of memory, but I'm certainly not -- my machine has 16 gigabytes of RAM, and based on TaskManager, at the point in time when the exception is thrown, the program is using 2 gigabytes (a lot, but nowhere near the limit). I've checked using map.max_size and map.max_buckets and both indicate that the maximum number of elements I can insert into the map is >>> than the 20 million key-value pairs I'm wanting to insert. Based on the other posts I've read here and at StackOverflow, I suspect this is some problem with dynamic memory allocation -- I don't see why that should be, however, because I haven't specified the allocator, the std default allocator should be used by the map for insertion, which should be able to allocate memory from the heap without any problems.

Any ideas what I'm doing wrong here? Is there something specific I need to do when inserting new key-value pairs into a std::map or std::unordered_map to ensure memory is allocated from the heap and/or to ensure that the program has access to more than just 2 gigabytes of RAM on my 16 gigabyte machine?

The code below BTW is the function that does all the work and this is where the error occurs. I haven't included main or any of the other functions because they aren't directly relevant and none of them are the ones throwing the error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
  //Loop through all fna files in the designated target directory and count kmers. If a kmer OR its reverse
//complement is present in the refseq it's ignored. All others are added to an unordered map.
//Later we will loop back through the unordered map and eliminate redundant k-mers.
map<uint64_t, boost::dynamic_bitset<> >  parse_target_genomes( const vector<string>& file_list,
                                                              const unordered_set<uint64_t>& refseq_set ) {
    const int num_genomes = file_list.size();
    map<uint64_t, boost::dynamic_bitset<> >  kmer_directory;
    int position_in_bitstring = num_genomes - 1;
    string last_kmer(kmer_length, ' ');
    string current_kmer_string(kmer_length, ' ');
    uint64_t current_kmer_bitset;
    uint64_t reverse_comp;
    bool failure_code = false;
    char current_char;
    int last_empty_kmer_position = 0;
    ifstream current_fna;

    for(auto itr = file_list.begin(); itr!=file_list.end(); itr++) {
        cout << (*itr) << " in progress..." << endl;
        current_fna.open( (*itr) );

        while( current_fna.get(current_char) ){
        //If current char is '>', we're reading a header. Reset the current_kmer_string to whitespace (to discard
        //the last-read kmer) and ignore all characters up until the next '\n'.
            if( current_char == '>' ){
                current_fna.ignore(5000, '\n');
                current_kmer_string.assign(kmer_length, ' ');
                last_empty_kmer_position = 0;
            }
        //If current char is '\n', we should skip it.
        //If current char is anything else, we should read it into our kmer.
            else if(current_char != '\n') {
                //If there is any whitespace left in current_kmer_string, we don't have a full kmer yet. Add the current_char to the kmer.
                if(last_empty_kmer_position < kmer_length ){
                    current_kmer_string[last_empty_kmer_position] = current_char;
                    last_empty_kmer_position += 1;
                }
                //Otherwise, we have an existing kmer. Convert it to the new kmer by combining the last kmer_length - 1 chars with current_char.
                else{
                    last_kmer = current_kmer_string;
                    current_kmer_string = last_kmer.substr(1, kmer_length - 1) + current_char;
                }
                //Check again whether we now have a full kmer and if so translate it to a binary representation.
                if(last_empty_kmer_position == kmer_length ){
                    current_kmer_bitset = kmer_to_bitset(current_kmer_string, failure_code);
                    //If we got a failure code when trying to convert the kmer, it probably contained an n.
                    //So just skip it. Any kmer containing an 'n' value is not useful to us. That's just a place in
                    //the sequence where information is missing.
                    if(failure_code == true)
                        failure_code = false;
                    //Otherwise, check whether kmer or its reverse complement are in the refeseq set. If in
                    //refseq set, ignore, otherwise...
                    else{
                        reverse_comp = kmer_to_reverse_complement(current_kmer_string, failure_code);
                        //There is no situation where we should get a failure code on the reverse complement that we did NOT
                        //get on the forward string, but just in case...
                        failure_code = false;
                        if (refseq_set.count(current_kmer_bitset) == 0 &&
                            refseq_set.count(reverse_comp) == 0 ){
                            //Ok, so this kmer is NOT in the refseq. But is it in the map we're building? Need to check
                            //reverse complement as well of course. If neither is in map, add just the kmer (not its reverse
                            //complement, do not need to store separately) plus a bitstring to indicate which genomes it is found
                            //in (we do not count # occurrences, simply presence or absence).
                            if (kmer_directory.count(current_kmer_bitset) != 0 )
                                kmer_directory[current_kmer_bitset][position_in_bitstring] = 1;
                            else if (kmer_directory.count(reverse_comp) != 0 )
                                kmer_directory[reverse_comp][position_in_bitstring] = 1;
                            else{
                            kmer_directory[current_kmer_bitset] = boost::dynamic_bitset<> (num_genomes, 0);    //Problem line here!!! <-----------
                                kmer_directory[current_kmer_bitset][position_in_bitstring] = 1;
                            }
                        }
                    }
                }
            }
        }
        current_fna.close();
        //Update the bitstring position so we increment the correct position of the bitstring next time.
        position_in_bitstring--;
    }
    return kmer_directory;
}
32-bit processes can't allocate more than 2 GiB of memory, regardless of how much memory the system has. You need to build your program for 64 bits.
MinGW64 should default to 64 bits. https://mingw-w64.org/doku.php
Thanks! You are correct! I switched to MinGW64 and problem solved -- it works fine now for 20 million entries!!! Sorry, clearly a dumb question
How many files are we talking about here?

As others have said, make sure you're compiling for a 64-bit application.
Topic archived. No new replies allowed.