I have an output file which has more than 1,000,000,000 lines. I am accessing this file in another C++ program. Now while accessing the output file using cin, I want to jump, say, to the 5,000,000th line directly and start accessing data from there. Is this possible? Could someone please give me a small C++ code for the same?
That is a ridiculously large file. I would think you could even run into problems with different file systems not being able to handle them that large.
Also, if you could identify proximity to the required line from the content of another line then you could use some kind of successive approximation method to cut down access time considerably.
But that very much depends on each line having some kind of location bearing information relative to the ones below it and above it..
I'd recommend using system calls to seek and read the file (e.g. Windows' SetFilePointerEx() and ReadFile()), rather than standard C++ functions. 32-bit implementations tend to have problems handling files larger than 2 GiB.
@ Galik - Thanks for all the help.
Yes, this is my particle data that I had asked about previously. And yes, it is ordered in a particular sequence.
The ordering is done in increasing values of the paramter 'y'.
The output file starts off with y = 0. And i want the particles with y = 10 (say). Now if i read each line, it is taking a massive amount of time to reach till y = 10 because of the large no. of particles. I also know the no. of particles contained in each value of y ( = total no. of particles / no. of layers in the y direction). So if the value of y does not match the value i want, i can actually skip these many no. of particles!
Could you help me out with this?
You say the lines are of fixed length. From your previous data you have this:
21.2342 11.2430 23.5453 0.005 2.25 86
Can you absolutely guarantee that the number of digits in each tab-separated field is fixed? For instance can the last number ever go above 99? Or can any of the xyz values go above 99?
Because if the lines are absolutely fixed length this will be much easier. And I mean not just a fixed number of values but a fixed number of characters?
Well, the problem is that the values of x,y and z are floating point values. But sometimes the value of x may be 21.2342 and some other time it might be 21.23423. So basically, the no. of digits may not always remain fixed, and hence the no. of characters in each line might not remain fixed. Can this problem still be resolved?
I suspected as much. The problem can still be solved in that the access can be made more time efficient. That is as long as your file does not break the physical capabilities of the iostream library implementation on your system!
Because you have variable length lines it is not possible to hit the exact line with a seekg(). However you can make a reasonable guess. If you were to undershoot then you could guarantee not to miss the data and still get reasonably close to it.
I have another question. Can you predict with accuracy the value of y for each line of data? Does y increment by a fixed amount?
Oh! I thought that I could possibly use seekg() as I was under the impression that it seeks based on the size of the parameters ( I thought that since the values of x, y, z etc are all floating point values, the memory size of those parameters would be the same i.e. 8 bytes, irrespective of the no. of digits it has, and hence seekg() would probably work! )
Anyways, yes we can predict the value of y of each line accurately. And yes, the value of y increases by a fixed amount each time. Will this ease the problem somehow?
We can still use seekg(), but not to hit the exact line as the number of characters varies.
What I would be tempted to do is read in a portion at the beginning of the file to calculate a good average line length, or write a separate program to do that.
Then I would seekg() to a location calculated from the average line length, and the number of required lines to skip. The number of skip lines can be calculated from the initial value of y (zero) and the value of y you are seeking and the number of y lines per layer.
Once at the guessed position you could then track back to the start location if you went too far with the seekg().
Sorry, but i didn't get this part -
"Once at the guessed position you could then track back to the start location if you went too far with the seekg()."
Could you please explain what you meant again.
However we are only using an *average* line length so we might be slightly off.
We can align our data to the start of a line with a getline() read:
1 2
std::string line;
std::getline(ifs, line); // Align to start of line
We may be one or two lines too far So we could then make a loop to read lines in reverse until we reach a value of y that is less that the one we are searching for.
Then we know we are at the very beginning of the layer and we can simply do a sequential read forward from there on in.
Thanks for the detalied explaination. I took the code that you had previously given me (in the other thread) and modified it a bit for this particular situation. However, the seekg() command doesnt seem to be working. The code compiles and executes correctly. However, it is still taking a large amount of time for reading out the desired particles. Please can you look at the code and tell me what the problem could be?
#include <string>
#include <sstream>
#include <iostream>
#include <stdio.h>
usingnamespace std;
struct particle
{
double x;
double y;
double z;
double radius;
double dencity;
int type;
};
int main(int argc, char* argv[])
{
if(argc < 2)
{
cerr << "Error, need to supply y as argument." << endl;
return 1;
}
double y;
bool next = false;
int count = 0;
istringstream iss(argv[1]);
if(iss >> y)
{
string line;
while(getline(cin, line))
{
istringstream iss(line);
particle p;
iss >> p.x;
iss >> p.y;
iss >> p.z;
iss >> p.radius;
iss >> p.dencity;
iss >> p.type;
if (p.y != y && count == 0) // For jumping once only
{
int line = int (2024807438/15000); //no. of particles / layers
int size = 84; // average size of each line
int jump = int (y/0.0021334 + 0.5); // layers to be skipped
int pos = line * size * jump; // total jump
iss.seekg(pos); // this doesnt seem to work. If I do a cout here to check if its working or not, the cout does output only once, but the program still takes a long amount of time for larger values of y.
count++;
}
getline(cin, line);
if(p.y == y)
{
cout << p.x << '\t';
cout << p.y << '\t';
cout << p.z << '\t';
cout << p.radius << '\t';
cout << p.dencity << '\t';
cout << p.type << endl;
next = true;
}
if (p.y != y && next == true)
{
exit(EXIT_FAILURE);
}
}
}
else
{
cerr << "Argument y was not valid." << endl;
return 1;
}
return 0;
}