My program is not running fast!

I have written a C++ program to remove duplicates using extensible hashing. Source file has lines, each holding a record. Program reads from this file into input buffer (vector<string>) 1MB size. When it is full, each element, if not already in hash, are pushed in the hash structure. This reading and pushing is continued until this file is completely read. Once done, the hash elements are pushed in an output buffer of 1MB size. When this output buffer becomes full, its contents are written to a 'distinct.txt' file. This writing is continued until hash is empty.

All this takes 5 seconds for 1MB source file. Now, I want to consider 1GB source file. But my criteria is that the running time should not exceed a minute, preferably in seconds.

How can I achieve this? My laptop has 1.87 usable RAM, core i3 2.27GHz and 32-bit. Im using Codeblocks in Windows 7, Geany in Ubuntu 11.10 and KDevelop in Fedora 14.

Issues regarding memory management and how to increase heap are welcome.
> All this takes 5 seconds for 1MB source file.
To clarify, that's the size of the file or the number of words.
It looks incredible slow either way, ¿how big are your words and the dictionary alphabet?
Start profiling your code.


$ sort input | uniq > output
Last edited on
1MB source file means that source file has 1MB/16Bytes records in it.
Each record is like: "123 123 123 123\n". With getline(myfile,line), you get 15 bytes exactle as: "123 123 123 123". Using a zip module: zip(string), this record is reduced to : "123123123123" 12 Bytes. This is done so as to reduce hash structure occupying lot of memory space. When retrieving, unzip(string) does the trick of reverse of zip().

By the way, your answer is not related to my question. My focus is to enhance my existing code and not just use Linux shell shortcut...
Last edited on
You didn't show your code, ¿how the hell could I comment on it?
I give something to compare against, so you realize how incredible slow it was.
I'm almost sure your usage of zip is slowing things down. Have you tried without zipping? Using zip to compress chunks of length 15 bytes is a WTF here. You are not gaining anything by doing so. For zip you should be using chunks of at least a few kB to get an acceptable compression ratio. And BTW zip is pretty awfully slow, so if you want to use compression to save I/O, you'd be much better off with Snappy or LZW.
Another idea:
Why do you read in everything into a vector first? The vector, if not preinitialized to its maximum size, also spends some time on each push operation to extend.
Topic archived. No new replies allowed.