reading data from a large text file

Hello all

I am reading a large file (39.7 MB) and searching for a substring in each line. If the program finds one, then it stores that line in a vector. The way I am doing it is too slow. Could any one please suggest a faster way of doing it. Below is the code that I am currently working with.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
while(getline(in, line))
{
	for( vector<string>::const_iterator itr = vect1.begin(); itr != vect1.end(); ++itr )
	{
		str = (*itr).substr(166, 8));
		
		if( line.find(str) != string::npos )
		{		    			
			vect2.push_back(line);	// 413: common between 413 and 403
				
			break;
		}
	}
}


Regards
It looks that you are actually searching for multiple substrings (vect1) in each line. Is it intended?

You are also doing multiple memory allocations per line, even if you don’t actually execute “vect2.push_back(line)”. I would move creation of substrings outside of the “while” loop (you are repeating the same operation for each line).

Additionally, I would consider changing container type for vect2. Name suggests that it is a vector, and vectors can be very slow if you try to add elements to a vector which has already a lot of elements. Unfortunately, none of the standard containers is perfect in this situation but list may be considerably faster.
Thanks Abramus

"I would move creation of substrings outside of the “while” loop."


I did it that way as well but the process was still very slow

"list may be considerably faster."


I am new to STL and have learnt about using vectors so far. Would you be able to give me a small working example of doing the same with list. I'd be gratef
Depends on what methods of std::vector you use. However, STL containers are designed to have very similar interface - "push_back", "begin", and "end" exist for std::list as well. So maybe all you need to do is to change declaration of vect2.
closed account (z05DSL3A)
Your substring is in a fixed position in each of the strings in vect1, is it also in a fixed position in the read-in line?
That's correct Grey Wolf. In read-in line it is at 38th position in a comma separated string and it is the substring substr(166, 8) in vect1.
closed account (z05DSL3A)
I was thinking that it may be faster to turn it around a bit and get the substring from the line and see if it is in vect1 (or a list of substrings generated from it) but I think it would depend on the number of elements in Vect1, how many repeats of the substring there are and so on.
Grey WOlf!

I have done it this way as well but the process is still very slow. Is it because there are two loops. One is WHILE and the other one is for iterating over a vector?
I dont know if I will be able to explain it correctly.

Almost each line contains one vect1 string element in it. e-g

1
2
3
4
5
6
7
8
9
10
11
12
13
14
vect1[0] = 'abc'
vect1[1] = 'def'
vect1[2] = 'ghi'
vect1[3] = 'jkl'
vect1[4] = 'mno'


line0 = 12s3a er34a 84hsk1 'abc' o987a 76kjgh1	// 'abc' once
line1 = 12s3b er34b 84hsk2 'def' o987b 76kjgh2	// 'def' once
line2 = 12s3c er34c 84hsk3 'ghi' o987c 76kjgh3	// 'ghi' once
line3 = 12s3d er34d 84hsk4 'jkl' o987d 76kjgh4	// 'jkl' once
line4 = 12s3e er34e 84hsk5 'mno' o987e 76kjgh5	// 'mno' once
line5 = 12s3f er34f 84hsk6 'def' o987f 76kjgh6	// 'def' twice
line6 = 12s3g er34g 84hsk7 'abc' o987g 76kjgh7	// 'abc' twice 


I have to match each element of vect1 with a sub-string in each line, if a match is found then store that line in another vector called vect2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
while(getline(in, line))
{
	for( vector<string>::const_iterator itr = vect1.begin(); itr != vect1.end(); ++itr )
	{
		str = (*itr).substr(166, 8));
		
		if( line.find(str) != string::npos )
		{		    			
			vect2.push_back(line);
				
			break;
		}
	}
}
closed account (z05DSL3A)
I'm thinking along the lines of:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
    set<string> str_set;

    for( vector<string>::const_iterator itr = vect1.begin(); itr != vect1.end(); ++itr )
    {
        str_set.insert( (*itr).substr(166, 8) );
    }
    
    while(getline(in, line))
    {
        if(str_set.find( line.substr(166, 8) ) != str_set.end() )  //Substring from wherever it is
        {
            //store line
        }
    }
NB: just a quick tought, may need some work
Grey Wolf

that is MUCH MUCH MUCH faster :D

THANKS A TON :)

Set is a sorted set of unique values. What if I want to retrieve multiple values? Do I need to use multiset? Will it effect the speed again???

Multisets are associative containers with the same properties as set containers, but allowing for multiple keys with equal values
Last edited on
closed account (z05DSL3A)
What if I want to retrieve multiple values? Do I need to use multiset? Will it effect the speed again???
I'm not sure I understand the question.

In this scenario, you would pull out the substring from each of the elements in vect1 and store it in str_set. If an attempt to insert identical substring is made, it is ignored. This is fine because you are not trying to match the read in line with any particular element in vect1, just seeing is say "abc" is somewhere in vect1.

The main speed increase of the above is from moving the for loop outside of the while, you go through the for loop once instead of however many lines there are in the file.
Alright Grey Wolf

Thank you sooo much :)
Topic archived. No new replies allowed.