High speed processor

Pages: 12
You just have to count the number of occurences of each name?
Then if you have enought memory you need only one pass:
1
2
3
4
5
6
7
8
std::unordered_map<std::string, int> lex;

std::ifstream f("file.txt");
std::string word;
while (f >> word)
{
++lex[word];
}


If you are limited on memory I have another more complex solution but i want to make sure this is your problem first.
>TheDestoyer
@kbw

The guy didn't share any code, and that's why the only suggestion we can provide is optimisations. You can't judge that the guy has written bad algorithms just because his program takes long time to process. In my masters thesis, I had to run a program that took 2 days to finish (it involved discritising continuum objects to a 3D grid of a 1 mm^3 sample in nm resolution). Does this make me a bad programmer or a bad algorithm writer? It's a very subjective thing and you have to give advice with what you have. The guy's asked whether he could improve the time of the execution of the program, and the answer is with optimisations. Not with accusing him of being a bad algorithm writer.
(Comparing reference counting and the space time continuum is not comparing apples and oranges)

The OP stated it does 50 million passes through the dataset, if this is actual 50 million through i/o then yes, it is a terrible algorithm, if it's 50 million passes through memory that's a different story. Assuming a name is 30 characters long (might be more, might be less) this is only ~2.49 Gigs. 10 Days to process this tells me this algorithm is in fact doing 50 million passes through the file, yes that is bad. It's ok to call it what it is, we are just trying to help with little info we have, all it's doing is counting names, not calculating implied volatility for an option chain.

To OP, as others have stated, if the entire dataset can be contained in memory, start there, with one pass through the file. If it cant be, do one pass through to your max size, process, then pass through the file again to max size. (etc.)

I once had a colleague of mine write some code where there was some processing done to a set of files... it took 45 minutes. When he showed us his work and told me the execution time our jaws dropped, looking at the code he was doing I/O operations with the files over an over, simply changing the pass through to one each for the files reduced the execution time to ~12 seconds. I'm certain that taking a closer look at the I/O portion of your program is the bottleneck, not your CPU.
Last edited on
There's no point in emphasizing optimistations if you don't know where the program is spending its time. If 5% of the time is spend doing processing and 95% of the time is spent doing I/O, compiling with some kind of optimisation or slating the CPU isn't really going to help.


50 million passes through a large dataset sounds like there's room for improvement algorithmically, IMO. However, your point is exactly right. The bottleneck must be identified first before wasting time with ANY optimizations.

Step 1 is to use a profiler (like gprof) to identify where the time is being spent.
It's kind of hard to determine where the speed is hurt, given the info in your post. Could you start out real simple, with just a fraction of the entire data set? This way you can you can tinker with the algorithm and see if there's some corners to cut or some completely other way of calculating the stuff you have to calculate.

Is it possible to parallelize the algorithm..? Could you share your algorithm?
OP is long gone, it seems.
Hopefully the OP is profiling their application!
In that case he should be back sometime next week, because he probably ended up using his original data set for the profiling.
Thanks all for your valuable replies...

Will actually this is my problem... I'm just preprocessing my dataset... So I'm going to run the algorithm only once that's why I wrote an algorithm that reads from the file over and over... That's why in my original question I was asking if there is a high speed server available in the Internet that I can use to store my file and run the algorithm. However, since this solution is not available as you all said I have to enhance my algorithm (even if im going to use it only once) such that it reads from the file once.

Aquaz, yes, I just have to count the number of occurrences of each name. My computer is now running another algorithm and might finish by tomorrow. I'll try your code tomorrow and see if I have enough memory or not...

Thanks again...
I'll try your code tomorrow and see if I have enough memory or not...

If you've been using neither std::map nor std::unordered_map, that's the first thing you should try...
It depends on the number of unique names you have, but this shouldn't really take much longer than an hour.
Why C++?
Topic archived. No new replies allowed.
Pages: 12