Hello forum,
I have a question about whether ifstream::open loads the entire contents of the file into memory when it opens it? I am having to create a format for a data library and the files are going to be huge. I am going to reserve the first several lines for indexing the file so that you can jump exactly where you need to go to access the data you want rather than having to search through the entire file. Since the files are going to be large, I do not want them loaded into memory. I want to only read the info needed for indexing as well as the actual data. Could anyone tell me if ifstream::open works like this, or how I could accomplish this if not? thanks
Thanks Galik, that is what I was hoping to hear. Another issue has come up...with my indexing scheme, the index will tell me exactly which line in the file I need to go to to get the data I need. I was assuming there was a built in function in c++ to jump direction to any line, but I guess there is not. I think I am now going to have to search line by line and count the number of new line characters until I reach my desired line. In that case, do you know if the entire contents that I search through before I reach the line I want are loaded and saved in memory? Or does the buffer get full and it will dump the old stuff? Also, do you know if this searching method of counting new line characters is efficient? I am imagining some of my files could potentiall have tens of millions of lines and could be up to 8000 characters long...
It depends on the size of the buffer. The buffer is likely quite small. And if your files are huge as you say then they are not likely going to fit in RAM.
If you are using an index then why not store the file offset of the data rather than the line number? Then you can go directly to it using file.seekg().
The data are matrices so each line will correspond to a row, and each row will have multiple values for the columns. I plan on storing the data in fixed sizes, say 100 lines. So that even if I had a 2x2 matrix, I would add 98 rows of nothing (or really comment symbols #) to fill the 100 lines. That way, if I know I want to get the 5th matrix stored, I could just read lines 500-599 and get all the data there. The indexing is a slightly more complicated, I have a list of several parameters are the beginning of the file and from those I can calculate (rather than store) the line where the data will be located. This is my first attempt at building any sort of data library, so if you have any suggestions on maybe a more efficient system let me know. Thanks again
The ability to go directly to a line is only really possible if the lines are of fixed length. So it might be worth coming up with a format that uses fixed length lines rather than fixed numbers of lines.
So, for instance, you could have one value per line and the very first two lines give the dimensions of the array:
So if you make every line 10 digits and remember that each line has an end-of-line character '\n' you can find a specific line a bit like this:
1 2 3
std::ifstream ifs("mydata.dat", std::ios::binary);
std::streampos pos = line * 11;
ifs.seekg(pos);
Then you should be able to read in your array something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
std::vector<std::vector<int> > array;
int xdim, ydim;
if(ifs >> xdim >> ydim)
{
array.resize(xdim, std::vector<int>(ydim));
for(int x = 0; x < xdim; ++x)
for(int y = 0; y < ydim; ++y)
ifs >> array[x][y];
if(ifs)
{
// use array[0][0] here
}
}
I used a std::vector rather than a raw array, as a std::vector is like a managed array. But you can use a raw array just as well.
Then your index would need to link each array item with a beginning line number. Or you could avoid using an index by skipping through the file reading in only the dimensions of each array in order to calculate the position of the next one until you reach the one you are looking for.