@mbozzi
Thanks for your detailed and informative reply :+)
Yes -- the hash function is the hotspot that @Cubbi was mentioning. |
Awhile ago, I had a discussion with
JLBorges about
std::vector
vs
std::unordered_map
. IIRC we had 1 million random ints. It turned out quicker with the vector to insert
and sort them, as opposed to just insertion into the unordered map. So I learnt the hashing function takes a relatively long time.
What do you mean by "incremental resizing"? |
I missed a link for what I was reading:
https://en.wikipedia.org/wiki/Hash_table
wiki wrote: |
---|
Incremental resizing
Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually:
During the resize, allocate the new hash table, but keep the old table unchanged.
In each lookup or delete operation, check both tables.
Perform insertion operations only in the new table.
At each insertion also move r elements from the old table to the new table.
When all elements are removed from the old table, deallocate it.
To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.
Disk-based hash tables almost always use some scheme of incremental resizing, since the cost of rebuilding the entire table on disk would be too high. |
Having read that, I had an idea of my own ... But a huge disclaimer: there a lot of very smart and experienced people out there - I am not one of them :+D
Say we 1 million items in a hash table, then on say 5 separate occasions we inserted a further 1 million each time - giving a total of 6 million at the end. Also, we had no way to predict the final size.
So my proposal is to chain the tables together - have a vector of hash tables: each time we have
bulk insertions (or the size of an table is exceeded), create and append a new hash table and append the relevant hash function to a vector. The downside is that in order to find an item, one would need to possibly apply up to 6 hash functions on as many tables. But this might be better than having to rehash 6 + 5 + 4 + 3 + 2 = 20 million items versus 6 million with the chained tables.
This approach would have it's limitations, and like everything would have to be tested and compared. Just wondering whether that idea has any merit - or would it be a lead balloon compared to the really smart methods already implemented?
Thanks for the articles - I will have to ruminate on those for awhile :+)
So you're certainly onto something. |
Well, not from any particular knowledge or inventiveness on my part :+)
JLBorges recently posted a quote from
Stroustrup about how only the real experts know what happens in detail, especially with regard to cache effects.
Cubbi mentioned the concurrency - I guessed it was bound to offer some improvement.
With the move semantics I just remembered that it has a big effect on other containers.
With this whole Topic, it seems to me that it is not quite enough to look at the pro's and cons of std containers in terms of big O notation. With all these other factors going on, one really has to time and compare various options. I guess this goes back to
Cubbi's comment about why Engineers care about the actual number of steps involved. I am sure
JLBorges has mentioned the importance of timing over theory before.
I guess the clever part is trying to design something that will turn out to be better in particular situations.
Anyway, I look forward to any replies and hope all is well at your end - Regards :+)