Reading data from file - line to line and fast

Hi,

I have a question about performance of fgets(). My program reads a file and stores each line in a vector for later sorting the content. Usually the input files are quite large - around 300 MB and they are ASCII text and lines can be variable length.

Is there a faster way to read the contents of a file and populate a vector than fgets()?

Important bits of the code as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FILE *inputfile;
inputfile = fopen (inputfilename, "r");

vector<string> data;

if(inputfile != NULL)
{
  data.clear();

  while(!feof(inputfile))
  {
    fgets(inputline,250,inputfile)
    if(feof(inputfile))
    {
      data.push_back(inputline)
    }
  }
}


280 MB file takes around 9 minutes to be read.




I'd say first load the file completely in memory (or in bigger chunks than one line), then do the parsing in memory. This will reduce system calls to read the file.
If performance is really important, you can also do the 'parsing' in an other thread on a block of file, while the reading thread loads the next block, since disk IO won't saturate the CPU while waiting.
Or simply use a std::ifstream and see if it is faster.
Thanks bartoli, I did just that - and this thing flies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
char* buffer;
char linearray[250];
int lineposition;
vector<string> data;

FILE *inputfile;
inputfile = fopen(inputfilename, "r");

fseek(inputfile, 0, SEEK_END);          //find the filesize
filesize = ftell(inputfile); 
rewind(inputfile);

buffer = (char*) malloc (sizeof(char)*filesize);      //allocate mem
fread (buffer,filesize,1,inputfile);         //read the file to the memory

while(*mempointer)          //loop thru the buffer
{
  if(*mempointer !=0)
  {
    linearray[lineposition] = *mempointer;        //put every character on array
    lineposition++;

    if(*mempointer == 13 || *mempointer == 10)      //until we hit newline
    {
      string proper(linearray);                //construct a string
      data.push_back(proper);              // and push it to vector
      lineposition = 0;                          //init variables
      linearray[250] = { 0 };
    }
    *mempointer++;                      // advance pointer
  }
}


Same 280 MB file - reading the file takes maybe 10-15 seconds, and about 30 seconds goes for reconstructing the lines.

Using push_back for the vector then... That takes quite a long time now. I will have to try this on a different machine with more RAM - this machine is swapping like mad.

If I understand correctly, memory usage (for a 280MB file) should be around:

280MB for copying the file to the memory
+280MB for copying the strings to the vector
-280MB for freeing the "buffer"
+280MB extra for sorting the vector out
=600MB total memory usage?
1
2
string proper(linearray);                //construct a string
data.push_back(proper);  


I still remember Scott Meyers Effective C++ books and removing un-necessary object temporaries can help in performance. I don't know if for above 2 lines of code you can reduce them to

 
data.push_back(proper(linearray));  


?
I reduced it like this...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

string linedata;

while(*mempointer)          //loop thru the buffer
{
  if(*mempointer !=0)
  {
    linedata.push_back(*mempointer);             //push character into string

    if(*mempointer == 13 || *mempointer == 10)      //until we hit newline
    {
      data.push_back(linedata);              // Push string to vector
      linedata.clear();                         // Clear "temporary string"
    }
    *mempointer++;                      // advance pointer
  }
}

free(buffer);


Didn't know that string has push_back...
It works quite well now. Thank you all for your input and ideas.
Topic archived. No new replies allowed.