Hi, I've written a rudimentary Linear probing hash table and I was wondering if you could suggest any optimizations for my search function.
essentially my function takes a key and hash's it, then search's through my table from its indexie to the maximum expected number of records, if it hits that number then it search's from the 0th element to the indexie (my idea here was to make the search be circular in the array) however the program is super slow when dealing with a large number of records so I was wondering if you could suggest any optomizations:
int LPTable<TYPE>::search(const string& key){
//creating indexie from key given
int indexie = hashedIndex(key);
//comparing key to value key @position indexie.
if (records_[indexie] == nullptr && size_ < maxExpected_){
return indexie;
}
elseif (records_[indexie]->key_ == key){
return indexie;
}
else{
int beginningToIndexie = indexie;
bool indexToMaxExpectedFull = false;
if (indexie <= maxExpected_){
indexie++;
}
//looking for a null element in records_
while (records_[indexie] != nullptr){
//First checks from indexie to maxExpected_.
//if no null elements found & key not found
//search from 0 to indexie;
if (records_[indexie]->key_ == key){
//Found the key in table
return indexie;
}
elseif (!indexToMaxExpectedFull && indexie <= maxExpected_/*max_*/ || indexToMaxExpectedFull && indexie <= beginningToIndexie){
indexie++;
}
else{
//resets indexie to 0 and searches
//from 0 to indexie to prevent overlapping
//in search.
if (indexToMaxExpectedFull){
//table is full returns -1 (false)
return -1;
}
indexie = 0;
indexToMaxExpectedFull = true;
}
}
//found a null element in table, can insert new data here!
return indexie;
}
}
(Didn’t read code) I’m not sure I understand the problem.
Another way of viewing a hash table is a bucket table. That is, when you hash a key it reduces that key to the index of a bucket. Put your object in that bucket.
Here’s the catch: your hash + table size must accommodate more buckets than necessary. That is, your table should never be more than 75 to 80 percent full. If it begins to overfill, you need to resize the table and rehash everything in it. (This is the ugly non-O(1) part of hash tables.)
For some datasets (particularly when you are dealing with large, sparse ones) your table’s buckets are themselves hash tables — a nested hash table, if you will. You only need to have a separate key-hash function appropriate for each level.
For large data it is worth your time to send just the keys through the hash algorithm and observe the # of collisions (distribution). so you can tweak it until that is minimal. Efficient hashing requires only 3 things.. unique keys for the data, more storage space than records by a wide margin, and a hash algorithm that gives uniform distribution when the keys are very similar (one or 2 bits different across the entire key).
Linear probing specifically needs a huge table. I would argue it should never be more than 1/3 full. Other approaches (linked list under each key, or second hash table under each key in a 2-d structure) can shrink the space needed by spending more time per record to store and fetch it. But probing is a double problem.. first, if you have a collision, it fills another spot, which causes a collision, which fills another spot, which... it chains, so unless every other spot is empty or better, it chains badly and quickly into just a linear list. Second linear searching is poor, as the worst case for normal searches is binary search (lg(n)). Thirdly, if your hash algorithm is not uniform distribution, it will make linear clusters in one area of the table and all the empty space in another area, and it will start doing that right away after a few keys are entered.
This echos the above... a bigger table will solve it if your hash algorithm is ok. Or don't use linear probe, use a better approach for big data.
Increasing the table size increased speed pretty dramatically, so thank-you. I also made some tweaks to better handle easy cases such as inserting when the table is no longer accepting arguments (maxExpected_ records was reached) and that also helped its speed.
I would seriously consider making it allocate max expected * X (you can search around for a good X, 1.25 to 3 range, bigger will do better but there is a happy medium).