Eliminating duplicate lines in a text file

Feb 22, 2009 at 11:10pm
I have a very large text file, with something like 39 million lines. Each line is a string of 23 integers (but I don't need to treat them as integers for my purposes). Each line will have between 0 and 4 duplicate entries, which is why the file is so big.

What I want to do is to create a new text file which doesn't have any duplicates.

I haven't started writing code for this yet, but hopefully tomorrow...not really sure how to begin. (I was thinking about nested "for" loops -- look at line 1 and compare it to lines 2-39,000,000...then line 2 compared to lines 3-39,000,000. But I don't think I can actually get rid of any duplicates that way...)

If you have any ideas, I'd really appreciate it!

Thanks for your time,
Zachary
Feb 22, 2009 at 11:14pm
Addendum:

As another option, I could go back to my previous program (which is the one that generated the 39 million line text file in the first place) and have it check for duplicate entries before writing the text file. I'm not sure how I'd do that, either. That program basically "does stuff" and then spits the results out to a text file. Is there a way to say "spit this out to a text file, unless it's already on that text file?" (that seems like I would have to be writing to and reading from a text file at the same time, which seems dangerous to me).

I'm just brainstorming...but I'd be thankful for any comments! :-)
Feb 23, 2009 at 1:58am
The best thing I can think of is just putting stuff into a linked list, checking for duplicates in the list and then, at the end, printing out the list in one go. But I don't exactly know what you're doing, to be honest :P
Feb 23, 2009 at 3:20am
Does it have to be a C++ program? You should be able to do this on the Linux/UNIX command line right quick: sort file.txt | uniq > newfile.txt

It sorts the lines of the file, then uniq removes adjacent duplicates, and the output is redirected to newfile.txt.
Last edited on Feb 23, 2009 at 3:21am
Feb 23, 2009 at 4:23pm
Seymore, I like that idea...but I've never used Linux/Unix either. Is there a way to run that command in a windows environment? (At the very least I think I can use a DOS command to sort the file, but I don't know if there is a DOS command for "uniq").

If not windows, I can get to an OS X power mac easily enough...

Thanks!
Feb 23, 2009 at 4:34pm
Wait, I think I got it. I'm using an SSH shell, and I think that just did it... Fingers crossed.
Feb 23, 2009 at 4:40pm
That did it!!!! Thanks, seymore. I'll have to keep these Unix commands in mind for the future. VERY helpful!
Feb 23, 2009 at 5:45pm
Alright, glad to help out.
Feb 25, 2009 at 1:28am
I do have to admit, I am still curious as to how to do it in c++ program. I also am working on something that generates large lists of possible combinations, and have to sort out duplicate combinations from the end result.

For brevaties sake, the generating program I wrote is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <fstream>
using namespace std;

int main()
{
int x,n,i,r1, r2, r3, r4, r5, r6, r7, r8, r9 ,r10;

   string c[37] = { "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
 "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "0", "1", "2", "3",
 "4", "5", "6", "7", "8", "9", /*" ",*/"_"}; //, "]", "[", ":"};
   char name[30];
   srand((unsigned)time(0));

cout << "Enter the number of realistic potential passwords to generate :>" << " ";
cin >> n;
cout << "Enter potential password length (1-10 chars) :>" << " ";
cin >> x;
cout << "Enter destination (entries without path save in local folder) :>";
cin >> name;



switch (x) {
  case 1:
     for(int i = 0; i < n; i++)
           {
              int r1Rand[1] = {(rand() % 36 +1)};
              r1 = r1Rand[0];
    
         string pout[1] = {c[r1]};
         string k[1] = {pout[0]};
         cout << k[0] << " " << i << "\n";
           ofstream output (name, ios::app);
           output << k[0] << " " << "\n";
           }
    break;
  case 2:
     for(int i = 0; i < n; i++)
           {
              int r1Rand[2] = {(rand() % 36 +1), (rand() % 36 +1)};
              r1 = r1Rand[0];
              r2 = r1Rand[1];    
         string pout[2] = {c[r1], c[r2]};
         string k[2] = {pout[0], pout[1]};
         cout << k[0] << k[1] << " " << i << "\n";
           ofstream output (name, ios::app);
           output << k[0] << k[1] << " " << "\n";
           }
       break;
  
    
    
  default: cout << "value unknown";
  }
  
  system("PAUSE");
  return 0;
}


The cases repeat (up to r10, though I removed 3-10 for spaces sake) with one character added to the output (still working on a cleaner way to do that without needing cases).

However, since I am using random letters from my computers time instead of just incrementing the letter by one each step and so on, I get alot of possibility for duplicates.

I am also running under a Windows environment which means that the Linux shortcut is out.

Also, first post ftw...
Last edited on Feb 25, 2009 at 1:30am
Feb 25, 2009 at 4:47am
Try the STL next_permutation algorithm on for size.
Topic archived. No new replies allowed.