Hi folks, I'm trying to import a large numeric file into C++, but the processing time is slower than I expected, so I'm wondering if there is any way that I could speed up the processing time? My codes are as follows:
(file is about 3520748(row) x 11(column) )
for (int i=0;i<3520748;i++){ // i, number of rows in the file
getline(crspfile,ssn);
istringstream iss(ssn);
for (float xx; iss>>xx; ){
obs.push_back(xx);
}
obsmatrix.push_back(obs);
obs.erase(obs.begin(),obs.end());
}
Why not just show us a complete program? What is obs and obsmatrix? Also show us a small example of what the input file looks like and tell us what the program is for.
How long does it take specifically? In the original you say that you are creating over 3.5 million vector objects and then you are copying each into another 2d vector. Sometimes programs take a long time because there is a lot to be done. At least you seem to know how many of the objects that there needs to be. So I would take the advice of others and use reserve on obsmatrix. In the current implementation you could also call the obsmatrix.capacity() function at the end so that you can see how much memory was actually allocated. I'll be that it could be much more than you think. Since you hard code a number within a loop you must know how much data to expect. For obs don't use push_back and clear() at all. Simply construct the vector with the number of columns which appears to be fixed for your program. You can use operator[] and there is no need to clear the object each time and then insert the elements all over. Just keep overwriting the slots within the array.
3520748 x 11 x 8 (bytes per entry - out of my head, based on the example data he provided) = 309.825.824 Bytes (295 MB) to read. That should not take long to read, ~7s @ 40,0 MB/s which is quite slow).
RAM size doesn't have too much to do with processing speed, unless... what system are you running this on and what are you running on it that's slowing it down so much?
Neither your RAM, your IDE nor your operating system make any real difference, it's the CPU that matters.
Have you implemented the suggestion already? Resize obsmatrix to 3520748 right at the start, so you can get rid of obs completely. Try and see if reserving or resizing the inner vector to 11 elements makes any difference.
If it's still not fast enough after this, you might want to relinquish the comfort of istringstream and parse each line yourself. You can actually do this in-place if you replace the spaces with null bytes and then call atof on each part.
The last thing you can do is to use multiple threads to process the lines. This is not as simple as the other measures and will only make any difference if you have more than one CPU/core.
Thanks for all the responses. I replace obs with two dimension array, but this doesn't make any difference, so I guess the problem may come from "istringsteam" function..
int main{
string tempstr;
float obs[3520748][11];
ifstream crspfile("crsp.txt");
if(crspfile.fail()){
cout<<"Cann't open file crsp11.txt.\n"<<endl;
system("pause");
exit(1);
}
for (int i=0;i<3520748;i++){
getline(crspfile,tempstr);
istringstream iss(tempstr);
int j=0;
for (float xx;iss>>xx;j++){
obs[i][j]=xx;
}
}
crspfile.close();
return 0;
}
Try reading the files line-wise and doing the break up and parsing of numbers yourself. It might actully be that isstream is "slow" since it does all the parsing of the numbers. Your implementation will probably not be much faster.
Thanks folks. Actually, if I define a two-dimension array with the length arry1[3520748]X[11], my codes do not run at all, unless I decrease the 1st dimension below 40000, i.e.,
arry1[40000][11], this way, the code could execute properly. Any ideas? Thank you.