Deleting repeated lines in a text file

Jul 30, 2014 at 6:34pm
Hi everyone,

I have a text file with repeated lines, and i would like to get rid of the duplicate information, can you help me with an algorithm to achieve this?

Example of my text file (the actual file may be huge, up to 500MB):

1
2
3
4
5
6
7
8
9
10
+Hello
+Bye
*The sun
-The moon
+Bye
/One coin
+Bye
+Bye
+Hello
*The sun


And i would expect to get something like this:

1
2
3
4
5
+Hello
+Bye
*The sun
-The moon
/One coin


I know how to open and read a file with fstream and getline(), but i don't know how to make the comparison.

Thanks.
Last edited on Jul 30, 2014 at 6:35pm
Jul 30, 2014 at 7:15pm
each time you read a line you should check if the line is already stored in a list.

if it is we ignore line and proceed to next.

if it is not in list we add it.

once all lines from file has been processed we output the list to a new file which should contain no duplicate lines.
Jul 30, 2014 at 7:50pm
Perhaps a set would be useful? As you read a line, try to put the line into a set.
http://www.cplusplus.com/reference/set/set/
It probably won't preserve the order that you see the lines. Would that be a problem?
Jul 30, 2014 at 9:24pm
actually a set would work nicely if we directly output each line to output file after determining that it does not exist in the set and was added to set.


Jul 30, 2014 at 9:25pm
at the end the set may not preserve the original order but has now become insignificant.
Jul 30, 2014 at 9:31pm
Thank to both of you SIK and booradley60, it seems like using sets may solve my problem, sorry if it was too obvious from the beginning, but I didn't knew about the existence of sets.

And yes, the original order is not important.
Last edited on Jul 30, 2014 at 9:31pm
Jul 30, 2014 at 9:31pm
Yup, that would work nicely.
Topic archived. No new replies allowed.