This function is part of a small program whose job is to take a delimited file and create a sql insert statement. It doesn't do much the delimited file is a 1gb and processing that many records takes about an hour. I would like to speed it up to about 10 or 15 minutes. Where do I begin?
void getting_a_record (rec* r, insert* in, std::ifstream& ifs, std::ofstream& ofs)
{
//extracting record from the source////////////////////////////////
std::string temp;
std::getline (ifs, temp);
if(!ifs.eof())
{
std::istringstream iss {temp};
if ( !iss.good ())
throw std::runtime_error("no stream");
std::vector<std::string> v; //temporarly holding the values in an vector
for (; iss.good (); )
{
std::getline (iss, temp, '|');
v.push_back (temp);
}
///////////////////////////////////////////////////////////////
r->reset_value (); //Just ensuring that all values from previous recorded are cleared.
//placing the data in the class////////////////////////////////
for(int i = 0; i < r->size (); ++i)
r->set_value (v[i], i);
///////////////////////////////////////////////////////////////
//writing it to an out file////////////////////////////////////
ofs << in->simple_insert () << '\n';
///////////////////////////////////////////////////////////////
}
}
The vector is your problem, it tries to keep all of it's memory contents contiguous which means it has to copy it self in it's entirety to a new location in memory every-time it bumps up against something else.
EDIT: As for improving performance, a 1 GB file should be trivial to hold in memory for a modern machine. You could probably copy the whole thing into memory and not even worry about storing it in a vector. Other improvements could be had depending on the platform you are using; in Windows, creating a File Mapping object would allow you to copy the whole thing into memory. You could also try marking the memory section as not being pageable if you can although the improvement in performance for doing this is going to be variable.
Do you know the maximum number of columns in each record?
If so, you can allocate the vector so that the max number of elements are preallocated. That will eliminate having to reallocate the elements when you exceed the default number of elements allocated by vector.
So 15 strings of between 1 and 15 characters and you're still hitting 1GB of space on the disk? How is the file fragmentation on the disk? What model HDD are they? Are you going through a network? Or is the data being kept local on the machine? What platform is this? <- That one is important.
Also, what does the systems performance monitor say is the slow down? Are you using SSD's? If you're writing to a mirror then that will be your most likely bottle neck regardless of what you try to do.
The source file is a delimited file of about 1gb. There are about 750k records in this file. I take delimited file and change each record into a sql insert statement.
I am on Windows, the data is on a Windows server.
I don't know of any space problem I am having.
EDIT:
Also, what does the systems performance monitor say is the slow down?
I am sorry but I don't know how to read the systems performance monitor. What should I look for?
EDIT: I wrote this before reading your edit. The problem is each line in the delimited file is a record and needs a separate insert statement written.
I missed this part earlier. Hitting the disk is expensive even if you are using SSD's. Based on that we would need to know what your performance monitor software says is the slowdown before we'll know if you can get that drastic of an improvement. Ah! You ninja'd me!
Start -> Run or Windows_Key + R and enter "perfmon" into the run box. This will bring up the performance monitor. You'll have to decide what snap-ins to use based on your application.
Are the import data sources and the SQL table both local? Going across a network connection isn't going to do you any favors here. How much free RAM is on the box before this process kicks off?
I fail to see how this function is the bottleneck of your entire program. From what I see, the program first reads a line from the input file stream into a string. This string is fed into a vector by splitting it on every occurrence of the '|' character. The maximum number of strings split is 15 as you have said. This entire function will not even take upto 10ms to complete.
If you are looking to tweak the function, there is one thing you can do:
void getting_a_record (rec* r, insert* in, std::ifstream& ifs, std::ofstream& ofs)
{
//extracting record from the source///////////////////////////////
if(!ifs.eof())
{
std::string temp, del;
std::getline (ifs, temp);
std::istringstream iss (temp);
if ( !iss.good ())
throw std::runtime_error("no stream");
r->reset_value (); //Just ensuring that all values from previous recorded are cleared.
for (int i = 0; iss.good (); i++)
{
std::getline (iss, del, '|');
r->set_value (del, i);
}
///////////////////////////////////////////////////////////////
//writing it to an out file////////////////////////////////////
ofs << in->simple_insert () << '\n';
///////////////////////////////////////////////////////////////
}
}
You say you have 750K records and are presumably calling this function once per record. That means you're constructing and deconstructing temp, iss and v 750K times.
I would suggest trying to eliminate those 2.25M allocations. i.e. Move the allocation of temp, iss and v outside your record loop.
Also, you haven't shown us the rec or insert classes, nor how rec gets to insert, so we can't give you any insight into impact that may have on your problem.
Are the import data sources and the SQL table both local?
Yes, they are both local. I am using sqlite3. I think you might have a misconception about what I am doing; what I am doing is quite simple.
I have a delimited text file that is 1gb. I am converting this into another text file that has 750(aprox) sql insert statements and importing that into sqlite3.
I can now do this but it takes approximately 1 hour to convert the delimited file into a sql file (just a text file full of sql statements). I have to do this approximately 36 times that is 36 hours.
To call what I wrote an application is a bit much. I just wrote a simple utility program to place a ton of data in a SQLite database for some simple analysis.
I am not an expert on SQLite and have not used it in a while, however:
When bulk importing data, looping over data sets, calling INSERT over and over, each call to INSERT is treated as a separate transaction, meaning the table and its indexes must be updated and the entire database written to disk with each INSERT (since SQLite is a single file database). The disk I/O is doing you in. The solution is to wrap groups of 1000 or even 10,000 INSERT statements into a single transaction. You may see an speed increase of a 100 fold or more.
If you are using the SQLite command line utility to do the actual insertions your text file will have transactions that look like this: