reading data from a large text file

Forum

Forum
Beginners
reading data from a large text file

reading data from a large text file

Feb 7, 2011 at 11:17am

Hello all

I am reading a large file (39.7 MB) and searching for a substring in each line. If the program finds one, then it stores that line in a vector. The way I am doing it is too slow. Could any one please suggest a faster way of doing it. Below is the code that I am currently working with.

while(getline(in, line))
{
	for( vector<string>::const_iterator itr = vect1.begin(); itr != vect1.end(); ++itr )
	{
		str = (*itr).substr(166, 8));
		
		if( line.find(str) != string::npos )
		{		    			
			vect2.push_back(line);	// 413: common between 413 and 403
				
			break;
		}
	}
}

Regards

Feb 7, 2011 at 11:44am

Abramus (285)

It looks that you are actually searching for multiple substrings (vect1) in each line. Is it intended?

You are also doing multiple memory allocations per line, even if you don’t actually execute “vect2.push_back(line)”. I would move creation of substrings outside of the “while” loop (you are repeating the same operation for each line).

Additionally, I would consider changing container type for vect2. Name suggests that it is a vector, and vectors can be very slow if you try to add elements to a vector which has already a lot of elements. Unfortunately, none of the standard containers is perfect in this situation but list may be considerably faster.

Feb 7, 2011 at 11:56am

GulHK (110)

Thanks Abramus

"I would move creation of substrings outside of the “while” loop."

I did it that way as well but the process was still very slow

"list may be considerably faster."

I am new to STL and have learnt about using vectors so far. Would you be able to give me a small working example of doing the same with list. I'd be gratef

Feb 7, 2011 at 12:09pm

Abramus (285)

Depends on what methods of std::vector you use. However, STL containers are designed to have very similar interface - "push_back", "begin", and "end" exist for std::list as well. So maybe all you need to do is to change declaration of vect2.

Feb 7, 2011 at 12:10pm

closed account (z05DSL3A)

Your substring is in a fixed position in each of the strings in vect1, is it also in a fixed position in the read-in line?

Feb 7, 2011 at 12:24pm

GulHK (110)

That's correct Grey Wolf. In read-in line it is at 38th position in a comma separated string and it is the substring substr(166, 8) in vect1.

Feb 7, 2011 at 12:42pm

closed account (z05DSL3A)

I was thinking that it may be faster to turn it around a bit and get the substring from the line and see if it is in vect1 (or a list of substrings generated from it) but I think it would depend on the number of elements in Vect1, how many repeats of the substring there are and so on.

Feb 7, 2011 at 1:22pm

GulHK (110)

Grey WOlf!

I have done it this way as well but the process is still very slow. Is it because there are two loops. One is WHILE and the other one is for iterating over a vector?

Feb 7, 2011 at 1:40pm

GulHK (110)

I dont know if I will be able to explain it correctly.

Almost each line contains one vect1 string element in it. e-g

vect1[0] = 'abc'
vect1[1] = 'def'
vect1[2] = 'ghi'
vect1[3] = 'jkl'
vect1[4] = 'mno'


line0 = 12s3a er34a 84hsk1 'abc' o987a 76kjgh1	// 'abc' once
line1 = 12s3b er34b 84hsk2 'def' o987b 76kjgh2	// 'def' once
line2 = 12s3c er34c 84hsk3 'ghi' o987c 76kjgh3	// 'ghi' once
line3 = 12s3d er34d 84hsk4 'jkl' o987d 76kjgh4	// 'jkl' once
line4 = 12s3e er34e 84hsk5 'mno' o987e 76kjgh5	// 'mno' once
line5 = 12s3f er34f 84hsk6 'def' o987f 76kjgh6	// 'def' twice
line6 = 12s3g er34g 84hsk7 'abc' o987g 76kjgh7	// 'abc' twice

I have to match each element of vect1 with a sub-string in each line, if a match is found then store that line in another vector called vect2.

while(getline(in, line))
{
	for( vector<string>::const_iterator itr = vect1.begin(); itr != vect1.end(); ++itr )
	{
		str = (*itr).substr(166, 8));
		
		if( line.find(str) != string::npos )
		{		    			
			vect2.push_back(line);
				
			break;
		}
	}
}

Feb 7, 2011 at 2:42pm

closed account (z05DSL3A)

I'm thinking along the lines of:

    set<string> str_set;

    for( vector<string>::const_iterator itr = vect1.begin(); itr != vect1.end(); ++itr )
    {
        str_set.insert( (*itr).substr(166, 8) );
    }
    
    while(getline(in, line))
    {
        if(str_set.find( line.substr(166, 8) ) != str_set.end() )  //Substring from wherever it is
        {
            //store line
        }
    }

NB: just a quick tought, may need some work

Feb 7, 2011 at 3:07pm

GulHK (110)

Grey Wolf

that is MUCH MUCH MUCH faster :D

THANKS A TON :)

Set is a sorted set of unique values. What if I want to retrieve multiple values? Do I need to use multiset? Will it effect the speed again???

Multisets are associative containers with the same properties as set containers, but allowing for multiple keys with equal values

Last edited on Feb 7, 2011 at 3:21pm

Feb 7, 2011 at 3:51pm

closed account (z05DSL3A)

What if I want to retrieve multiple values? Do I need to use multiset? Will it effect the speed again???

I'm not sure I understand the question.

In this scenario, you would pull out the substring from each of the elements in vect1 and store it in str_set. If an attempt to insert identical substring is made, it is ignored. This is fine because you are not trying to match the read in line with any particular element in vect1, just seeing is say "abc" is somewhere in vect1.

The main speed increase of the above is from moving the for loop outside of the while, you go through the for loop once instead of however many lines there are in the file.

Feb 7, 2011 at 4:31pm

GulHK (110)

Alright Grey Wolf

Thank you sooo much :)

Topic archived. No new replies allowed.

C++

Forum

reading data from a large text file