Fastest way to read data into an vector?

Forum

Forum
Beginners
Fastest way to read data into an vector?

Fastest way to read data into an vector?

Jun 26, 2019 at 4:32am

Can someone tell me if there is a faster than this?

My text file has 40 million lines, so basically I am storing 40 million strings into my vector.

    ifstream infile;
    infile.open("numbers.txt");

    while(!infile)
    {
         exit(EXIT_FAILURE);
    }
        
    vector<string> lotto;
    string getlottoline;
    
    while(infile.good())
    {
         getline(infile, getlottoline);
         lotto.push_back(getlottoline);
    }

Last edited on Jun 26, 2019 at 4:34am

Jun 26, 2019 at 4:50am

dutch (2548)

What is the maximum size of the strings?
Do they contain spaces?
How are they stored in the file?

Why do you need to read them all in at once?

Jun 26, 2019 at 6:28am

JLBorges (13770)

Reading directly into a string in the vector may be faster.

#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <cstdlib>
#include <chrono>

std::vector<std::string> read_lines( const std::string& path_to_file )
{
    if( std::ifstream file{path_to_file} )
    {
         std::vector<std::string> lines(1) ;
         while( std::getline( file, lines.back() ) ) lines.emplace_back() ;
         lines.pop_back() ;
         return lines ;
    }

    else return {} ; // failed to open file, return empty vector
}

int main()
{
    const std::string file = "/tmp/test.txt" ;

    {
        std::ofstream test_file(file) ;
        for( int i = 0 ; i < 25 ; ++i ) test_file << std::ifstream{ "/usr/include/elf.h" }.rdbuf() ;
        // std::system( "sync; echo 3 > /proc/sys/vm/drop_caches" ) ;
    }
    
    const auto start = std::chrono::steady_clock::now() ; 
    const std::vector<std::string> lines = read_lines(file) ;
    const auto end = std::chrono::steady_clock::now() ; 
    const auto millisecs = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() ;
    
    std::cout << lines.size() << " lines were read in " << millisecs << " millisecs\n" ;
}

Edit & run on cpp.sh

http://coliru.stacked-crooked.com/a/6b26ef889217a6f2

Jun 26, 2019 at 11:32am

salem c (3715)

> Can someone tell me if there is a faster than this?
And then what?

I mean, "OK, so now you've got 40M records in memory".
Presumably you have a plan to do something with all that data.
How does the performance of "the something" compare to the act of reading the file in the first place.

For example, if "the something" takes 10 minutes, then the +/- a few seconds to read the file one way or another isn't going to change at all the fact that it takes about 10 minutes.

You can only optimise WHOLE programs based on actual profile data, not random fragments based on your gut feeling.

Next, understanding that vector itself isn't necessarily the elephant in the room. Just reading the file and doing nothing else at all is expensive in itself. A spinning disk does one revolution about once every 10mS. That's an awfully long time compared to the nanoseconds it takes your CPU to execute instructions.

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <chrono>
using namespace std;

string test0(void) {
    ifstream infile;
    infile.open("foo_40m.txt");

    string getlottoline;

    while(getline(infile, getlottoline))
    {
    }
    return getlottoline;
}

string test1(void) {
    ifstream infile;
    infile.open("foo_40m.txt");

    vector<string> lotto;
    string getlottoline;

    while(getline(infile, getlottoline))
    {
         lotto.push_back(getlottoline);
    }
    return lotto[10];
}

string test2(void) {
    ifstream infile;
    infile.open("foo_40m.txt");

    vector<string> lotto;
    lotto.reserve(50000000);
    string getlottoline;

    while(getline(infile, getlottoline))
    {
         lotto.push_back(getlottoline);
    }
    return lotto[10];
}


int main()
{
    typedef string (*fnt)(void);
    struct {
        string description;
        fnt    func;
    } tests[] = {
      { "Basic read", test0 },
      { "Store in vector", test1 },
      { "Store in reserve vector", test2 },
    };
    for ( int i = 0 ; i < 3 ; i++ ) {
        // Thanks to JLBorges for the timing code
        const auto start = std::chrono::steady_clock::now() ;
        std::string result = tests[i].func();
        const auto end = std::chrono::steady_clock::now() ;
        const auto millisecs = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
        std::cout << "Test=" << tests[i].description
                  << ", Result=" << result
                  << ", Time=" << millisecs
                  << endl;
    }

    return 0;
}

$ g++ -std=c++11 proj.cpp
$ ./a.out 
Test=Basic read, Result=, Time=1289
Test=Store in vector, Result=Line 10, Time=4677
Test=Store in reserve vector, Result=Line 10, Time=2568
$ ./a.out 
Test=Basic read, Result=, Time=1309
Test=Store in vector, Result=Line 10, Time=4682
Test=Store in reserve vector, Result=Line 10, Time=2571
$ ./a.out 
Test=Basic read, Result=, Time=1365
Test=Store in vector, Result=Line 10, Time=4669
Test=Store in reserve vector, Result=Line 10, Time=2566
$ g++ -O2 -std=c++11 proj.cpp
$ ./a.out 
Test=Basic read, Result=, Time=1111
Test=Store in vector, Result=Line 10, Time=2722
Test=Store in reserve vector, Result=Line 10, Time=1821
$ ./a.out 
Test=Basic read, Result=, Time=1176
Test=Store in vector, Result=Line 10, Time=2720
Test=Store in reserve vector, Result=Line 10, Time=1817
$ ./a.out 
Test=Basic read, Result=, Time=1162
Test=Store in vector, Result=Line 10, Time=2712
Test=Store in reserve vector, Result=Line 10, Time=1814

Edit & run on cpp.sh

Compare the unoptimised "read into a vector" with the optimised "read into a reserved vector".

Also note that optimisation basically has no effect on the time it takes to just read the file. That's because -O2 doesn't change any laws of physics. Your disk still spins at the same speed.

Jun 26, 2019 at 1:40pm

advancedip (33)

Reading into a vector is just a part of my program.

I just wanted to know if there is any faster way than a conventional read.
The strings are stored like this:

1 2 3 4 5 6
7 8 9 10 11
12 13 14 15
16 17 18 19
20 21 21 22
...
and so on

So each cell will take in every line.
Will bookmark this thread.

Last edited on Jun 26, 2019 at 1:41pm

Jun 26, 2019 at 2:32pm

salem c (3715)

> Reading into a vector is just a part of my program.
My point exactly, just a part.
You don't know whether it's on the critical path at this moment.

Save your effort and make the whole program work in the first instance using idiomatic constructs.

People spend months chasing microseconds in the wrong place. Your first order of business is delivering something which works. Your optimisation might be very clever, but your users are left waiting because you're late delivering something.

Then (and only then) start to run profiling with real-world data.

Here's why:
1. Your program is complete and as far as you can tell, bug-free.
2. You have real data.
3. You have real test cases.
4. You have a source control system in place (git, perforce, etc).

The real data and test cases, along with the profile data tell you where the real hot-spots are. These are cases your users will care about and blog that your program sucks.

You have a working program under source control. This means you can make changes in a controlled environment. You make a change, make sure the program still works, and then re-profile the tests.

Then you ask yourself some questions:
1. Does this change actually improve things? If it doesn't, you can back out the change and try something else.
2. If it does improve things, does the improvement justify the increased complexity or decreased maintainability of the code?

Jun 26, 2019 at 2:55pm

jonnin (11492)

If you are in control of this stuff, a binary file might be better.
the number 123 as text is 3 bytes. The number 123 can be stored in binary is 1 byte. The spaces and end of lines and other fluff characters add even more. That whole file is very likely at least 4 if not 8 or 16 times bigger than it needs to be. Its a bit large to be opening in a text editor, and it does not appear to be 'free form' text, so why is it not in binary?

also if you are willing to break it apart yourself, one big read (just get every byte in the file into a vector of bytes) is sometimes better than looped getlines (but, the IO is buffered for you, so the difference depends on your hardware and more). Then you have to deal with it being in a clunky format on the back end, though...

I agree with above. Get it working first. And do not test with the whole big file, test with the first 10 lines in a new file or something. I see new guys all the time trying to test their code on a full bajillion rows of data to see if it works, and waiting 30 min to find out they had a bug. That is for much later... early on, it needs to run in a second or two so you can see any glaring problems without the wait.

Last edited on Jun 26, 2019 at 3:04pm

Jun 26, 2019 at 6:22pm

dhayden (5799)

Reserve space in the vector for the approximate number of lines as salem C has indicated.

Use C I/O instead of C++. You might experiment with a larger I/O buffer for the C I/O too if your system supports setvbuf().

Jun 26, 2019 at 10:10pm

Niccolo (720)

As others have already pointed out, there's a lot to consider with such volume.

Most are posting on the assumption the data is being read from a hard disk, citing disk rotation and head seek times, but such drives can sustain data transfer in the range of 150 Mbytes per second in a stream under best case conditions, dipping down to below 5 Mbytres per second for lots of complex reasons. SSD's can more easily sustain data transfer about 250 Mbytes per second, some reach beyond 1 Gbyte per second.

That said, as one posted (at least, I skimmed), the re-allocation of a dynamically expanding vector is a performance issue itself. Noting that in your code example makes me wonder what is the overall purpose. This, too, other posters have asked ("what next"). A lot depends on that context.

Is this a school project, some personal study, a personal or academic research task?

In my own work I've found memory mapped file I/O is the fastest in such situations, but that either involves OS specific code or the boost library for memory mapped files.

When used, you don't even write code to read from a file. The file is mapped to RAM, meaning that your code operates as if you have a pointer to raw memory of the file content, which is subsequently filled by the operating system's virtual memory subsystem. Sometimes this is the best approach, but it depends on what the context is.

Topic archived. No new replies allowed.