Unsolved problem, don't let it sink: Read huge txt's into memory efficiently?

Pages: 1234
I am using the method below to read a Large space delimited txt files(About 900 Mb). It took me 879s to load the data into memory. I am wondering if there is a more efficient way to read the txt file?

Another associated question is: is it a good idea to store such a huge data set using a 2D vector?


Here is my code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
void Grid::loadGrid(const char* filePathGrid)
{
        // 2D vector to contain the matrix
        vector<vector<float>> data;
        
        unsigned nrows, ncols;
        double xllcorner, yllcorner;
        int cellsize, nodataValue;
        const int nRowHeader = 6;
	string line, strtmp;

	ifstream DEMFile;
	DEMFile.open(filePathGrid);
	if (DEMFile.is_open())
	{			
		// read the header (6 lines)
		for (int index = 0; index < nRowHeader; index++)
		{
			getline(DEMFile, line);	
			istringstream  ss(line);
			switch (index)
			{
				case 0: 
					while (ss >> strtmp)
					{						
						istringstream(strtmp) >> ncols;						
					}
					break;
				case 1:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> nrows;						
					}
					break;
				case 2: 
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> xllcorner;						
					}
					break;					
				case 3:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> yllcorner;						
					}
					break;						
				case 4:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> cellsize;						
					}
					break;						
				case 5:
					while (ss >> strtmp)
					{
						istringstream(strtmp) >> nodataValue;						
					}
					break;					
			}			
		}

		// Read in the elevation values
		if (ncols * nrows > 0)
		{		
			// Set up sizes. (rows x cols)
			data.resize(nrows);
			for (unsigned row = 0; row < nrows; ++row)
			{
				data[row].resize(ncols);
			}

			// Load values in	
			unsigned row = 0;
			while (row < nrows)
			{							
				getline(DEMFile, line);
				istringstream ss(line);
				for (unsigned col =0; col < ncols; col++)
				{
					ss >> data[row][col];
				}
				row ++;
			}
			DEMFile.close();			
		}		
	}
	else cout << "Unable to open file"; 
}




Below is the sample data:
// header
ncols 19092
nrows 6219
xllcorner 581585.1569801
yllcorner 4612170.4651427
cellsize 2
NODATA_value -9999
//data body
....................
....................
Last edited on
Have you tried with the highest optimization setting? The reason that I'm asking, is that stream usage often benefit a lot from optimization. Which compiler are you using?
With respect to your second question:

I would have chosen to use a one-dimensional vector, and then index it by (row*ncols+col).

This will at least reduce memory consumption, but it may also ahave a signmificant impact on speed.

I don't remember whether a 'vector of vectors' is an endorsed idiom by the standard, but there is a risk that too much copying and memory reallocation is going on, if there is no special handling of the 'vector of vectors' case.

Last edited on
Have you tried with the highest optimization setting? The reason that I'm asking, is that stream usage often benefit a lot from optimization. Which compiler are you using?

Sorry, I don't know how to use those "optimization settings" you referred to... , and the compiler I am using is visual studio 2008.
Last edited on
I know nothing about Visual Studio, but for now, you could check out this one:

http://efreedom.com/Question/1-1416891/Optimization-Options-Work-VSCPlusPlus-2008
I also recommend you try to change your implementation to use a one-dimensional vector as per my second post.
With respect to your second question:

I would have chosen to use a one-dimensional vector, and then index it by (row*ncols+col).

This will at least reduce memory consumption, but it may also ahave a signmificant impact on speed.

I don't remember whether a 'vector of vectors' is an endorsed idiom by the standard, but there is a risk that too much copying and memory reallocation is going on, if there is no special handling of the 'vector of vectors' case.

I am new to c++, I followed the suggestion given by a post in this forum(I could not find it now...) to use to 2D vector to contain the large data set. But I will try to follow your suggestion. Thanks for your help, and you have a nice day!
Last edited on
I also recommend you try to change your implementation to use a one-dimensional vector as per my second post.

I modified my 2D vector into 1D, however, the speed is almost the same...
Alright,

Then I suggest that you change

1
2
3
4
5
6
7
8
9
10
			while (row < nrows)
			{							
				getline(DEMFile, line);
				istringstream ss(line);
				for (unsigned col =0; col < ncols; col++)
				{
					ss >> data[row][col];
				}
				row ++;
			}


to

1
2
3
4
5
6
7
8
9
10
11
			istringstream ss;
			while (row < nrows)
			{							
				getline(DEMFile, line);
				ss.str(line);
				for (unsigned col =0; col < ncols; col++)
				{
					ss >> data[row][col];
				}
				row ++;
			}


It could be quite expensive to recreate a string stream from scratch at every line.

If the does not help significantly, I urge you to find out how to try to max out you optimization settings, and see what that does.
1
2
3
4
5
6
while (ss >> strtmp) //what are you doing here?
{
	istringstream(strtmp) >> ncols; //the value of ncols will be over written
}
while( ss>>ncols ) //quasi-equivalent code
  ;

Try to use a binary file instead of plain text
I am trying to read in a string (e.g. "1234"), and convert it to a number.
It seems that C++ is not capable to load large txt files into memory efficiently? Doubt about it...
Last edited on
Don't use a std::vector. Use a std::deque.
The vector doesn't play well with high memory usage, but the deque does.

Another option is to memory map the file.
Another option is to memory map the file.

Could you please give a simple code about using "memory map" technique? Sorry, I am new to C++. Thanks!
Hi all,
Is this problem really a big challenge? Please help, don't let it sink before it is solved, thanks!
Last edited on
+1 for using std:deque as Duoas proposes

Looking at your code, you load the data in parsed form, so it is never completely loaded in memory per-se.

std::deque is better suited for arbitrary growth, because it does not copy all elements every time it has to increase the space. It just allocates additional storage and links it. I think it is implemented as vector of pointers to fixed-size arrays. When it has to grow, it creates a new array and adds its pointer to the vector.

Regards
Reading 884MB with a simply
1
2
while( input.read(&c, sizeof(char)) ) //1 byte at the time
  v.push_back(c);
it took ~ 1m30s
with 4 bytes at the time -> 36s

Maybe if you get rid of those string to number conversions (or get a better hardware)
Reading 884MB with a simply
1
2


while( input.read(&c, sizeof(char)) ) //1 byte at the time
v.push_back(c);

it took ~ 1m30s
with 4 bytes at the time -> 36s


ne555, your response makes me see the hope! Could you please kindly give further help by providing the code to load the following sample data I made? I guess the code you posted is used to read in strings, right? While, I need to convert the strings of numbers to float numbers. I tried to mimic your code but failed.

ncols 5
nrows 3
xllcorner 581585.1569801
yllcorner 4612170.4651427
cellsize 2
NODATA_value -9999
1.0 1.1 1.2 1.3 1.4
3.2 3.5 2.3 3.1 4.4
2.3 2.5 2.6 2.9 5.1
Last edited on
+1 for using std:deque as Duoas proposes

Looking at your code, you load the data in parsed form, so it is never completely loaded in memory per-se.

std::deque is better suited for arbitrary growth, because it does not copy all elements every time it has to increase the space. It just allocates additional storage and links it. I think it is implemented as vector of pointers to fixed-size arrays. When it has to grow, it creates a new array and adds its pointer to the vector.

Regards

Simeonz, and Duoas,
Thanks for your responses. I modified my vector to std::deque, however it seems that the code is even slower... Probably I am idiot, I am really new to c++.


Well, actually that is because the vector in your code doesn't grow. My mistake.

EDIT: Is there anything in the number structure that can be exploited. What is the range? Are all numbers given with the same precision? Like d.d?
Last edited on
Simeonz, not really. Some of them are integers (-9999), and some of them are with four decimals(dd.ffff).
Pages: 1234