Unsolved problem, don't let it sink: Read huge txt's into memory efficiently?

Pages: 1234
Could you not try finding out how big the file is before reading it? That way, you don't need to resize anything.

Something like:
1
2
3
4
5
6
7
8
size_t fsize(std::ifstream& file)
{
        size_t nbytes = 0;
        file.seekg(0, ios::end);
        nbytes = file.tellg();
        file.seekg(0, ios::beg);
        return nbytes;
}

I would use strtof, it (being a specialized function) will probably be much faster than the generalized stringstream interface.
Given the header looks like:
ncols 5
nrows 3
xllcorner 581585.1569801
yllcorner 4612170.4651427
cellsize 2
NODATA_value -9999
wouldn't it be quicker to read it using:
1
2
3
4
5
6
    DEMFile >> tmpstr >> ncol;
    DEMFile >> tmpstr >> rcol;
    DEMFile >> tmpstr >> xllcorner;
    DEMFile >> tmpstr >> yllcorner;
    DEMFile >> tmpstr >> cellsize;
    DEMFile >> tmpstr >> NODATA_value;


And similarly, read the grid values with:
1
2
3
4
5
6
7
8
9
10
    if (nrows * ncols > 0)
    {
        data.resize(nrows);
        for (unsigned row = 0; row < nrows; ++row)
        {
            data[row].resize(ncols);
            for (unsigned col = 0; col < ncols; ++col)
                DEMFile >> data[row][col];
        }
    }


The idea is to avoid the temporary creation of thousands of istringstream objects, which can't be cheap.
Last edited on
You'll get even more out of it if you iterate over the vectors instead of using operator [] every time.

1
2
3
4
5
6
7
8
9
data.resize( nrows );
for (vector <vector <float> > ::iterator row = data.begin(); row != data.end(); ++row)
    {
    row->resize( ncols );
    for (vector <float> ::iterator col = row->begin(); col != row->end(); ++col)
      {
      f >> *col;
      }
    }

The problem with the speed is that you are doing a zillion string-to-float conversions (118,733,148 according to your OP), and that is going to take a while no matter what.

Do you need to have all that data all at once? Or can you get it in parts? Do you need to have all those floats converted from string at once? Or can you do it as needed?
Besides, the istringstream is the worst way to do this (speed-wise). I have now confirmed this. Using an istringstream takes 11 ns, sscanf takes 10 ns, and strtof takes 5 ns (!). So maybe read the file in large chunks using fread, and use strtof's ability to return a pointer, rather than the istringstream. A general rule for reading is that the larger the chunks you read it in, the faster it is (fewer kernel calls.)
Last edited on
I read it in two 2 minutes 13 seconds using the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>

#include <deque>
#include <vector>

using namespace std;

/* ////////////////////////////////////////////////////////////////////////////
  timer stuff
//////////////////////////////////////////////////////////////////////////// */

//----------------------------------------------------------------------------
typedef double epoch_time_t;

//----------------------------------------------------------------------------
#ifdef __WIN32__

  #include <windows.h>

  epoch_time_t get_epoch_time()  // Since 1601 Jan 1
    {
    FILETIME f;
    GetSystemTimeAsFileTime( &f );
    LARGE_INTEGER t =
      { {
      f.dwLowDateTime,
      f.dwHighDateTime
      } };
    return t.QuadPart / 10000000.0;  // 100 nano-second intervals
    }

  void waitforkeypress()
    {
    WaitForSingleObject(
      GetStdHandle( STD_INPUT_HANDLE ),
      INFINITE
      );
    }

//----------------------------------------------------------------------------
#else // POSIX

  #include <sys/time.h>

  epoch_time_t get_epoch_time()  // Since 1970 Jan 1
    {
    struct timeval tv;
    if (gettimeofday( &tv, NULL )) return 0.0;
    return tv.tv_sec + tv.tv_usec / 1000000.0;
    }

  #include <unistd.h>
  #include <termios.h>
  #include <poll.h>

  #define INFINITE (-1)

  void waitforkeypress()
    {
    struct termios initial_settings;
    tcgetattr( STDIN_FILENO, &initial_settings );

    struct termios settings = initial_settings;
    settings.c_lflag &= ~(ICANON);
    tcsetattr( STDIN_FILENO, TCSANOW, &settings );

    struct pollfd pls[ 1 ];
    pls[ 0 ].fd     = STDIN_FILENO;
    pls[ 0 ].events = POLLIN | POLLPRI;
    poll( pls, 1, INFINITE );

    tcsetattr( STDIN_FILENO, TCSANOW, &initial_settings );
    }

#endif

//----------------------------------------------------------------------------
ostream& print_epoch_time( ostream& outs, epoch_time_t elapsed_time )
  {
  unsigned minutes    = (unsigned) elapsed_time        / 60;
  unsigned seconds    = (unsigned) elapsed_time        % 60;
  unsigned hundredths = (unsigned)(elapsed_time * 100) % 100;

  outs << "It took you "
       << setfill( '0' ) << minutes    << ":"
       << setw( 2 )      << seconds    << "."
       << setw( 2 )      << hundredths << endl;

  return outs;
  }

/* ////////////////////////////////////////////////////////////////////////////
  loading stuff
//////////////////////////////////////////////////////////////////////////// */

typedef vector <float>     container;
typedef deque  <container> container2;

//----------------------------------------------------------------------------
void load_at_once( const char* filename, string& result )
  {
  ifstream f( filename, ios::binary );
  result.assign( istream_iterator <char> ( f ), istream_iterator <char> () );
  }

//----------------------------------------------------------------------------
struct data_t
  {
  unsigned nrows;
  unsigned ncols;
  double   xllcorner;
  double   yllcorner;
  int      cellsize;
  double   nodatavalue;

  container2 data;
  };

void load_in_pieces( istream& f, data_t& data )
  {
  string s;
  f >> s >> data.ncols
    >> s >> data.nrows
    >> s >> data.xllcorner
    >> s >> data.yllcorner
    >> s >> data.cellsize
    >> s >> data.nodatavalue;
  data.data.resize( data.nrows );
  for (container2::iterator row = data.data.begin(); row != data.data.end(); ++row)
    {
    row->resize( data.ncols );
    for (container::iterator col = row->begin(); col != row->end(); ++col)
      {
      f >> *col;
      }
    }
  }

//----------------------------------------------------------------------------
int main()
  {
  epoch_time_t start, stop;
  string       raw_data;
  data_t       data;

  cout << "Please wait while loading...\n";
  start = get_epoch_time();
  load_at_once( "data.txt", raw_data );
  istringstream iss( raw_data );
  load_in_pieces( iss, data );
  stop = get_epoch_time();
  print_epoch_time( cout, stop - start );

  return 0;
  }

You shouldn't get any performance increase by using C stuff, but it is possible with some current versions of the STL. (I used the GCC's for this.)

I would be interested to know how much difference there is between using subscripting and using iterators to access the vector/deque. I wouldn't really expect too much of a difference. I have been running some test to try and determine if vector<bool> is really any slower than an array of bool. I have been getting almost the same speed, which other programmers I know did not think was possible.

The original code is most likely slow due to using getline plus creating a new istringstream inside the inner loop. You would be executing unnecessary string and istringstream constructors along with copying the actual string in question, possibly more than once. I would expect most of the speed of your code is due to getting the data more directly from the input stream without unnecessary intermediates:

f >> *col;

I would expect that loading the entire file into an istringstream is also an unnecessary intermediate, but that would vary depending on OS, filesystem, and physical media.
I would be interested to know how much difference there is between using subscripting and using iterators
I'd expect iterators to be a bit quicker as you're always saving yourself the unnecessary column lookup.

I have been running some test to try and determine if vector<bool> is really any slower than an array of bool.
I'd expect vector<bool> to be slower as the bits are packed into a byte.

I would expect that loading the entire file into an istringstream is also an unnecessary intermediate, but that would vary depending on OS, filesystem, and physical media.
I can't see how it depends on OS, filesystem or physical media.
Access times are, from most expensive to least: network media (like the internet), external media (like hard disks), RAM, and CPU files (like registers).

The amount of data you are accessing will fit easily in RAM, so load it from disk to RAM first, using the least expensive operation possible, and then go for the expensive processing.


The deque works better for large memory access, because it can fragment its contents to better fit available resources. That is why I used it for row data. The vector is faster, because it guarantees sequentially allocated data. Hence, I used it for column data.

Using iterators does actually shave significant time, as I have learned recently with my bignum challenge, so it is worth it. The operator [] does require calculating and dereferencing an offset, which is obviated by using an iterator.

The amount of data you are accessing will fit easily in RAM, so load it from disk to RAM first, using the least expensive operation possible, and then go for the expensive processing


Even faster would be to load the file in smaller chunks fitting in the L1 or L2 cache (e.g. in 64 kB blocks), and process each block at a time, forming a pipeline (and destroying the blocks after they are processed). Loading the raw data into RAM entirely is just wasting the memory and also wasting the L2 cache.
Last edited on
The operating system should indeed maintain a RAM buffer for each file (network, disk, etc) and that should be enough for local accesses. Probably this is what jimc meant. But so does the fstream implementation I believe. To my knowledge win32 goes further to maintain big cache for disk data. May be other OSes too.

Anyways, congrats for the effort. Looks yummy except for the alignment of the braces. You must be using some of those exotic editors Duoas...
Thank you all very much! I do appreciate your effort and believe the problem has been solved to some extent.
You shouldn't get any performance increase by using C stuff, but it is possible with some current versions of the STL. (I used the GCC's for this.)

Wrong. I should get MASSIVE performance increases by doing C stuff instead of C++ stuff.
I will explain.
istringstream inherits from istream, and can be passed as one. The >> operator is a virtual function. Therefore the vtable slows you down. Additionally, in code like f >> x >> y; the compiler cannot know that f >> x is the same reference as f.
Further, the istringstream must check on various settings before doing anything, whereas strtof just assumes everything is fine. Finally, in f >> x x is passed by reference (invisible pointer). Therefore x must be in *memory*, not a register, which slows you down. In x = strtof(s,NULL); x is returned using the floating point stack.
In general, the C library ways of doing things are faster, more suited to the machine, but less extensible and less safe. I would, instead of a vector or deque or whatever, use a simple malloc, or maybe mmap,to make an array. This is guaranteed not to have any time-wasting checks of things. The problem is, even if you just told vector to be 10 ints long, v[1]=7; *still* checks if 0<=1<10. Iterators do that too. In fact, if v is a vector and a is an array (of ints), v[1] is a function call, which means at minumum pushl $1, pushl %eax, call vector_brackets_rvalue, three instructions plus the function itself, maybe 10 more. a[1] is mov %eax,4(%ebx), ONE instruction.
Bottom line, by its very *nature*, idiomatic C++, with STL, virtual functions, and safety, is slower than idiomatic C, with libc, arrays, and horrible unsafe optimizations. Your compiler is just very, very smart, and knows a lot about what the STL is "supposed" to do.
Last edited on
simeonz wrote:
Looks yummy except for the alignment of the braces. You must be using some of those exotic editors Duoas...
I abhor most IDEs.

I am familiar with quite a few different languages, and my brace style reflects that. I've developed and used it for over 20 years. When writing personal software, I use a style that I find easy to read.

In the C/C++ world's nomenclature, what I use is most closely identified with Whitesmith's style.

rapidcoder wrote:
Even faster would be to load the file in smaller chunks fitting in the L1 or L2 cache (e.g. in 64 kB blocks), and process each block at a time, forming a pipeline (and destroying the blocks after they are processed). Loading the raw data into RAM entirely is just wasting the memory and also wasting the L2 cache.
Yes, I was considering suggesting that as well, but we'll see what kind of hoops the OP wants to jump after he gets a handle on this. I don't consider the speed gains from fully loading the file to be any sort of waste, considering the whole file only takes about 800-900 megabytes, beyond which, it obviously isn't a fully-developed answer.

I admit that my knowledge of L2 caching and the like is limited, but I have no interest in playing compiler.

[edit] Sigh:

rapidcoder wrote:
Wrong.
I really don't want to rehash this again. Your arguments are wrong, and frankly, they expose you for your lack of knowledge. You've been brainwashed into believing crap by anti-C++ propaganda.

istringstream inherits from istream, and can be passed as one. The >> operator is a virtual function.
The extraction operator is not a virtual function. (Duh!) It may use virtual functions, but its exact character is resolved at, pause for halo sound, compile time. That's what templates do for you. Pretty swell stuff, ain't it?

Therefore the vtable slows you down.
It has very minimal impact.

Additionally, in code like f >> x >> y; the compiler cannot know that f >> x is the same reference as f.
Why would the compiler care? The fact is that f >> x might not be the same as f.

Further, the istringstream must check on various settings before doing anything, whereas strtof just assumes everything is fine.
This argument is stupid in four ways. One, strtof() doesn't deal with streams, just addressable memory. Two, the istringstream only needs to "check on various settings" once before doing anything, and O(1) is nothing. Three, strtof() does not assume everything is fine. It has very complete error checking, just like the istringstream. Four, saying mystringstream >> myfloat is nearly identical to saying myfloat = strtof( mystring, ... ) -- oh, except for all the compiler protections against stupid stuff and guarantees against buffer overrun.

Finally, in f >> x x is passed by reference (invisible pointer). Therefore x must be in *memory*, not a register, which slows you down.
Not necessarily. The whole point of references is really sweet optimizations that you don't see.

In x = strtof(s,NULL); x is returned using the floating point stack.
So? istringstream has to use the FPU also. And that fancy reference optimization may actually keep the reference on the FPU stack, too.

In general, the C library ways of doing things are faster, more suited to the machine, but less extensible and less safe.
That's all true except for the part about 'faster'. (Well, maybe not the "more suited to the machine" bit either, but that's another argument.)

I would, instead of a vector or deque or whatever, use a simple malloc, or maybe mmap,to make an array. This is guaranteed not to have any time-wasting checks of things.
Only if you consider secure software to be a waste of time. Fortunately, all these "checks of things" you are so worried about are very few, and very cheap, and they can make your software so much more robust.

The problem is, even if you just told vector to be 10 ints long, v[1]=7; *still* checks if 0<=1<10.
Actually, it doesn't. The standard is pretty specific on this point. http://www.cplusplus.com/reference/stl/vector/operator%5B%5D/

Iterators do that too. In fact, if v is a vector and a is an array (of ints), v[1] is a function call, which means at minumum [nonsense about second guessing the compiler]
Well then, go write in assembly. Chances are my compiler will do a better job than you. The actual fact is that iterators are very highly optimizable, much more than you think. Oh, and v[1] is not likely a function call at all, because it is easily inlined. Even if it were, the performance hit, again, is so insignificant that you are wasting breath. This false argument has been put to rest so many times that it is amazing that people still breath it.

Bottom line, by its very *nature*, idiomatic C++, with STL, virtual functions, and safety, is slower than idiomatic C, with libc, arrays, and horrible unsafe optimizations.
No, the bottom line is that the way you use it is fast or slow.

Here's something I bet you didn't know. A lot of older STL implementations used to just be a wrapper around C library calls. Now that it has matured, a lot of implementations, like Dinkumware and STLPort, don't bother with the C library because it is so flippin' s-l-o-w. Used properly, the STL blows the pants off of the C standard libary.

Don't believe me?
How about experts in the field.
Here's one:
http://www.daniweb.com/forums/thread40450.html

I'm sure you can find more:
http://www.google.com/search?q=iostream+vs+stdio

Seems you forgot about this thread:
http://www.cplusplus.com/forum/general/31527/

/me waits for the extremist anti-C++|anti-me verbiage to start

I'm glad it is so common to come to the forums lately and engage in such a pleasant exchange of ideas and solutions. Makes me want to wretch how abusive some of your language is towards each other.
Here's something I bet you didn't know. A lot of older STL implementations used to just be a wrapper around C library calls. Now that it has matured, a lot of implementations, like Dinkumware and STLPort, don't bother with the C library because it is so flippin' s-l-o-w. Used properly, the STL blows the pants off of the C standard libary.


Sometimes we cannot fault some developers from having certain mindset. It could be that developer dabble in C++ STL in the early years and most likely he will come from C background. He will do comparison definitely and as we all knew, all new initiatives implementation early versions may not be efficient or even complete.

Then the developer get turned off and move on to other areas still having that experience in their mind. A few years later, C++ STL implementations have matured and improved leaps and bounds but that developer never keep up-to-date to the progress and still stick to his own experience views.

This is classic behavoir pattern and I got to admit I fall into this trap with JavaScript. I used to know JavaScript as kiddie language for doing simple input validation on web-pages until recently re-touch JavaScript and it has improved by leaps and bounds.

The moral of the story is as software developer once a while we need to keep ourselves up-to-date with each language progression. We should not use our past bad experience with that language and condemn it forever since unless there is *NO* progression made at all over the years :P

Just want to share my experience.
None of those sources you listed have actual statistics, so I tried making some. Here is test.cpp:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
using namespace std;
#define TIMES 1000000
int main(){
	char const *data = "100.111 12.34 5.23e3 21234 200 0.0001 ";
	string s=data;
	s.resize(s.length()*TIMES+1);
	int i=TIMES-1;
	while(i--)s+=data;
	istringstream iss(s);
	vector <float> v(TIMES*6);
	for(vector<float>::iterator it=v.begin();it != v.end();it++)
		iss >> *it;
}

Here is test.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TIMES 1000000

int main(){					
	char const data[]="100.111 12.34 5.23e3 21234 200 0.0001 ";
	int i=0;
	int len=strlen(data);
	char *s=malloc(len*TIMES+1);
	while(i<TIMES){
		strcpy(s+len*i,data);
		i++;
	}
	float *v=malloc(TIMES*6*sizeof(float));
	i=TIMES;
	char *t=s-1;
	while(i--)v[i]=strtof(t+1,&t);
}

And here is my command line:
oren-laptop% pkill -STOP firefox                     
oren-laptop% g++ -O3 test.cpp   
oren-laptop% time ./a.out       
./a.out  0.47s user 0.30s system 98% cpu 0.777 total
oren-laptop% time ./a.out
./a.out  0.52s user 0.26s system 99% cpu 0.783 total
oren-laptop% time ./a.out
./a.out  0.48s user 0.30s system 99% cpu 0.782 total
oren-laptop% time ./a.out
./a.out  0.50s user 0.29s system 99% cpu 0.789 total
oren-laptop% gcc -O3 -std=gnu99 -Wall -Wextra test.c 
oren-laptop% time ./a.out                            
./a.out  0.42s user 0.07s system 98% cpu 0.498 total
oren-laptop% time ./a.out
./a.out  0.44s user 0.06s system 98% cpu 0.500 total
oren-laptop% time ./a.out
./a.out  0.43s user 0.05s system 97% cpu 0.498 total
oren-laptop% time ./a.out
./a.out  0.44s user 0.05s system 99% cpu 0.492 total
oren-laptop% pkill -CONT firefox                     

libc is faster than STL.
BTW, I do use the STL in my programs, but I recognise that it isn't the only way to do things. When I want my program to be simple to understand, and speed is not an issue, then of course I use the strings, vectors, maps, etc. I do prefer printf and printon methods to iostream, but that's a stylistic thing IMO. Also, using my evil macro let() helps a lot when you are having trouble spelling some super long type, as in:
1
2
for(let(it,v.begin());it != v.end();it++)
		iss >> *it;

I invented the let() macro for that purpose.
edit: Sohguanh, I started with Perl, actually, then went to GameMaker, and switched to C because I was trying to do some math really fast, and I heard that C was fast. I was of course astounded at the improvement, from 200 bullets at ~10 fps to 10000 bullets at 60 fps.
Last edited on


int main(){
char const *data = "100.111 12.34 5.23e3 21234 200 0.0001 ";
string s=data;
s.resize(s.length()*TIMES+1);
int i=TIMES-1;
while(i--)s+=data;
}
//above give sys 0m0.158s


You have to do a equivalent C++ with your C code. I focus on the string portion and discover you "mis-use" string class methods. If in advance you already know the size, then you don't need to use the += or append method. You use the replace method.

1
2
3
4
5
6
7
8
9
10
11
12
int main(){
       char const *data = "100.111 12.34 5.23e3 21234 200 0.0001 ";
        string s2(data);
        string s(s2);
        s.resize(s.length()*TIMES+1);
        int i=0;
        while(i<TIMES){
          s.replace((i*s2.length()),s2.length(),s2);
          i++;
        }
}
// above time command give sys     0m0.056s 


I will post the next portion later but above change of using STL string gives drastically different timing!
That's not what I'm getting.
Although I admit that the replace method does seem to be equivalent to strcpy, it's worse:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
using namespace std;

#define TIMES 1000000

int main(){
	char const *data = "100.111 12.34 5.23e3 21234 200 0.0001 ";
        string s2(data);
        string s(s2);
        s.resize(s.length()*TIMES+1);
        int i=0;
        while(i<TIMES){
          s.replace((i*s2.length()),s2.length(),s2);
          i++;
        }
	istringstream iss(s);
	vector <float> v(TIMES*6);
	for(vector<float>::iterator it=v.begin();it != v.end();it++)
		iss >> *it;
}

oren-laptop% pkill -STOP firefox
oren-laptop% g++ -O3 test.cpp                                      
oren-laptop% time ./a.out       
./a.out  6.22s user 0.19s system 99% cpu 6.461 total
oren-laptop% pkill -CONT firefox

besides, there should be a * operator for strings.
aside: I wish this forum had code/tag folding.
Last edited on

istringstream iss(s);
vector <float> v(TIMES*6);
for(vector<float>::iterator it=v.begin();it != v.end();it++)
iss >> *it;



Please comment out above portion first so we can do timing on each step to be clear. I am now working on the above portion.

We focus on the string portion and you will see my version using string.replace has cut down a lot of time.
Pages: 1234