Eliminating duplicate lines in a text fi

Forum

Forum
Beginners
Eliminating duplicate lines in a text fi

Eliminating duplicate lines in a text file

I have a very large text file, with something like 39 million lines. Each line is a string of 23 integers (but I don't need to treat them as integers for my purposes). Each line will have between 0 and 4 duplicate entries, which is why the file is so big.

What I want to do is to create a new text file which doesn't have any duplicates.

I haven't started writing code for this yet, but hopefully tomorrow...not really sure how to begin. (I was thinking about nested "for" loops -- look at line 1 and compare it to lines 2-39,000,000...then line 2 compared to lines 3-39,000,000. But I don't think I can actually get rid of any duplicates that way...)

If you have any ideas, I'd really appreciate it!

Thanks for your time,
Zachary

Zachary (38)

Addendum:

As another option, I could go back to my previous program (which is the one that generated the 39 million line text file in the first place) and have it check for duplicate entries before writing the text file. I'm not sure how I'd do that, either. That program basically "does stuff" and then spits the results out to a text file. Is there a way to say "spit this out to a text file, unless it's already on that text file?" (that seems like I would have to be writing to and reading from a text file at the same time, which seems dangerous to me).

I'm just brainstorming...but I'd be thankful for any comments! :-)

Odahk (9)

The best thing I can think of is just putting stuff into a linked list, checking for duplicates in the list and then, at the end, printing out the list in one go. But I don't exactly know what you're doing, to be honest :P

seymore15074 (449)

Does it have to be a C++ program? You should be able to do this on the Linux/UNIX command line right quick: sort file.txt | uniq > newfile.txt

It sorts the lines of the file, then uniq removes adjacent duplicates, and the output is redirected to newfile.txt.

Last edited on

Zachary (38)

Seymore, I like that idea...but I've never used Linux/Unix either. Is there a way to run that command in a windows environment? (At the very least I think I can use a DOS command to sort the file, but I don't know if there is a DOS command for "uniq").

If not windows, I can get to an OS X power mac easily enough...

Thanks!

Zachary (38)

Wait, I think I got it. I'm using an SSH shell, and I think that just did it... Fingers crossed.

Zachary (38)

That did it!!!! Thanks, seymore. I'll have to keep these Unix commands in mind for the future. VERY helpful!

seymore15074 (449)

Alright, glad to help out.

Leith (1)

I do have to admit, I am still curious as to how to do it in c++ program. I also am working on something that generates large lists of possible combinations, and have to sort out duplicate combinations from the end result.

For brevaties sake, the generating program I wrote is:

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <fstream>
using namespace std;

int main()
{
int x,n,i,r1, r2, r3, r4, r5, r6, r7, r8, r9 ,r10;

   string c[37] = { "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
 "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "0", "1", "2", "3",
 "4", "5", "6", "7", "8", "9", /*" ",*/"_"}; //, "]", "[", ":"};
   char name[30];
   srand((unsigned)time(0));

cout << "Enter the number of realistic potential passwords to generate :>" << " ";
cin >> n;
cout << "Enter potential password length (1-10 chars) :>" << " ";
cin >> x;
cout << "Enter destination (entries without path save in local folder) :>";
cin >> name;



switch (x) {
  case 1:
     for(int i = 0; i < n; i++)
           {
              int r1Rand[1] = {(rand() % 36 +1)};
              r1 = r1Rand[0];
    
         string pout[1] = {c[r1]};
         string k[1] = {pout[0]};
         cout << k[0] << " " << i << "\n";
           ofstream output (name, ios::app);
           output << k[0] << " " << "\n";
           }
    break;
  case 2:
     for(int i = 0; i < n; i++)
           {
              int r1Rand[2] = {(rand() % 36 +1), (rand() % 36 +1)};
              r1 = r1Rand[0];
              r2 = r1Rand[1];    
         string pout[2] = {c[r1], c[r2]};
         string k[2] = {pout[0], pout[1]};
         cout << k[0] << k[1] << " " << i << "\n";
           ofstream output (name, ios::app);
           output << k[0] << k[1] << " " << "\n";
           }
       break;
  
    
    
  default: cout << "value unknown";
  }
  
  system("PAUSE");
  return 0;
}

The cases repeat (up to r10, though I removed 3-10 for spaces sake) with one character added to the output (still working on a cleaner way to do that without needing cases).

However, since I am using random letters from my computers time instead of just incrementing the letter by one each step and so on, I get alot of possibility for duplicates.

I am also running under a Windows environment which means that the Linux shortcut is out.

Also, first post ftw...

Last edited on

seymore15074 (449)

Try the STL next_permutation algorithm on for size.

Topic archived. No new replies allowed.