I have a question about performance of fgets(). My program reads a file and stores each line in a vector for later sorting the content. Usually the input files are quite large - around 300 MB and they are ASCII text and lines can be variable length.
Is there a faster way to read the contents of a file and populate a vector than fgets()?
I'd say first load the file completely in memory (or in bigger chunks than one line), then do the parsing in memory. This will reduce system calls to read the file.
If performance is really important, you can also do the 'parsing' in an other thread on a block of file, while the reading thread loads the next block, since disk IO won't saturate the CPU while waiting.
Or simply use a std::ifstream and see if it is faster.
char* buffer;
char linearray[250];
int lineposition;
vector<string> data;
FILE *inputfile;
inputfile = fopen(inputfilename, "r");
fseek(inputfile, 0, SEEK_END); //find the filesize
filesize = ftell(inputfile);
rewind(inputfile);
buffer = (char*) malloc (sizeof(char)*filesize); //allocate mem
fread (buffer,filesize,1,inputfile); //read the file to the memory
while(*mempointer) //loop thru the buffer
{
if(*mempointer !=0)
{
linearray[lineposition] = *mempointer; //put every character on array
lineposition++;
if(*mempointer == 13 || *mempointer == 10) //until we hit newline
{
string proper(linearray); //construct a string
data.push_back(proper); // and push it to vector
lineposition = 0; //init variables
linearray[250] = { 0 };
}
*mempointer++; // advance pointer
}
}
Same 280 MB file - reading the file takes maybe 10-15 seconds, and about 30 seconds goes for reconstructing the lines.
Using push_back for the vector then... That takes quite a long time now. I will have to try this on a different machine with more RAM - this machine is swapping like mad.
If I understand correctly, memory usage (for a 280MB file) should be around:
280MB for copying the file to the memory
+280MB for copying the strings to the vector
-280MB for freeing the "buffer"
+280MB extra for sorting the vector out
=600MB total memory usage?
string proper(linearray); //construct a string
data.push_back(proper);
I still remember Scott Meyers Effective C++ books and removing un-necessary object temporaries can help in performance. I don't know if for above 2 lines of code you can reduce them to