>TheDestoyer
@kbw
The guy didn't share any code, and that's why the only suggestion we can provide is optimisations. You can't judge that the guy has written bad algorithms just because his program takes long time to process. In my masters thesis, I had to run a program that took 2 days to finish (it involved discritising continuum objects to a 3D grid of a 1 mm^3 sample in nm resolution). Does this make me a bad programmer or a bad algorithm writer? It's a very subjective thing and you have to give advice with what you have. The guy's asked whether he could improve the time of the execution of the program, and the answer is with optimisations. Not with accusing him of being a bad algorithm writer. |
(Comparing reference counting and the space time continuum is not comparing apples and oranges)
The OP stated it does 50 million passes through the dataset, if this is actual 50 million through i/o then yes, it is a terrible algorithm, if it's 50 million passes through memory that's a different story. Assuming a name is 30 characters long (might be more, might be less) this is only ~2.49 Gigs. 10 Days to process this tells me this algorithm is in fact doing 50 million passes through the file, yes that is bad. It's ok to call it what it is, we are just trying to help with little info we have, all it's doing is counting names, not calculating implied volatility for an option chain.
To OP, as others have stated, if the entire dataset can be contained in memory, start there, with one pass through the file. If it cant be, do one pass through to your max size, process, then pass through the file again to max size. (etc.)
I once had a colleague of mine write some code where there was some processing done to a set of files... it took 45 minutes. When he showed us his work and told me the execution time our jaws dropped, looking at the code he was doing I/O operations with the files over an over, simply changing the pass through to one each for the files reduced the execution time to ~12 seconds. I'm certain that taking a closer look at the I/O portion of your program is the bottleneck, not your CPU.