finding duplicates for part of a string

Hello everyone

I am trying to find duplicates for part of a string, not the whole string. The strings are stored in a file. Each line of file contains a string and many of which looks something like this (not all of the lines).

[code removed]


where the '0' at the very beginning is common throughout a block of lines. The other block will have a '1' common through out the block and so on. The string starting from CCD untill the end can be duplicated and I have to find how many such duplicate lines are there against each '0' and '1' and so on. The file can contain any combination of any strings, not just the one mentioned in the above example but if at all it contains duplicates then the string starting from position of 'C' of the 'CCD' till the end would be repeated (CCD can be any string).

After I find the duplicates. I have to compare it with the other file which contains all unique strings extracted from the first file that is having duplicates. I actually want to know if the file having the non-duplicate values contains all strings that appear in the first file (with duplicates). I want to make sure that all of the strings have been extracted uniquely and stored in the other file (with unique values).

Below are some more lines from the file

[code removed]
Last edited on by admin
See boost::split for getting the various parts of each line and boost::join for rejoining the parts that are relevant.
You need a map that maps the first number to a set of strings (the CCD* parts). Add each line to the appropriate set. If the set already contains the string, you have a duplicate, which you can discard or handle as needed.
After that you read the second file into another map and check for each map key/set element pair in the first map if it appears in the second one.
I am totally new to STL and do not have any idea about Boost. Could you please wrtie to me a small working example?
I suppose you'll need to do some reading:
http://www.cplusplus.com/reference/stl/map/
http://www.cplusplus.com/reference/stl/set/

The following code should give you a general idea, however you should treat it as pseudo-code.
explode should be equivalent to boost::split, implode to boost::join and StrToInt to boost::lexical_cast<int>.
http://www.boost.org/doc/libs/1_44_0/doc/html/string_algo.html (split, join)
http://www.boost.org/doc/libs/1_44_0/libs/conversion/lexical_cast.htm (lexical_cast)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
template <class StreamClass> void readFile(map<int,set<string> >& map,StreamClass& stream)
{
  foreach(line,stream)
  { //for each line, split it into parts delimited by ,
    auto parts=explode(line,',');
    map[StrToInt(explode(parts[0],':')[0])].insert(implode(parts,',',IDM_NOLASTDELIM,3)); //add "relevant" part
  }
}

int main()
{
  //open file1 and file2 here
  map<int,set<string> > map1,map2;
  readFile(map1,file1);
  readFile(map2,file2);

  bool map2ContainsAllEntries=true;
  foreach(kv_pair,map1)
  { //for each set... 
    auto& aset=map2[kv_pair.first]; //cache the corresponding set in the second map
    foreach(str,kv_pair.second)
    { //for each string in the set of the first map, check if it exists in the corresponding set in the second map
      if (find(aset.begin(),aset.end(),str)==aset.end())
      { //if not, we can stop right away
        map2ContainsAllEntries=false;
        break;
      }
    }
    if (!map2ContainsAllEntries)break;
  }
}

Thank you so much Athar

I will try to understand the syntax first :P. If I do not understand something then please do help.

Many thanks once again
Hello again Athar. Please go through my problem again and see if you can do it in a simpler way ;'(

Could you please help me how to compare 2 files containing lines of string? The two files contain similar data except that one contains data with duplicates while the other one contains all unique. I have to make sure if the file having unique data contains all the lines present in the file containing duplicates.

Both files look something like this and I have to compare only the lines which has the sub string CME in it which comes right before the first coma in the string

[code removed]
Last edited on by admin
Topic archived. No new replies allowed.