Good Hashing

I had created a few encryption algorithms (whether they are strong or weak compared to the pro algorithms I havn't a clue), and I've come accross this problem before (without knowing just what the technical name for it was until now).

I want to know: Just what do you think is the best way to hash a string?

Sorry for the lack of mathmatic symbols... here is the one I read in a book:

E^{key_size - 1} KEY[key_size - i - 1] * 37ⁱ .
^{I = 0}

Any Suggestions, or is this good?

Please note, that the string is the key to some arbitrary piece of data.

Last edited on

Catfish4 (666)

Perhaps you'd want to implement a "real" hash function:
http://en.wikipedia.org/wiki/Cryptographic_hash_function

I'm hoping for someone with math know-how to come along and say if your hash function thing, that you posted above, is good or bad. Personally, I would take the clumsy approach of trying to plot it somehow.

That is, generate lots and lots of strings, hash them, then use the results as 2D or 3D coordinates. This thread might help in that direction:

http://www.cplusplus.com/forum/lounge/112320/

IWishIKnew (1364)

Well, it's not my hash algorithm, it's one from a text book. I can remember strugging direly for a solution for collisions, without mich success. I eventually turned to converting the string directly into a number (instead of 0-9, think 0-256). It had flaws, of course.

It is hard to have a perfect hash function, and since I'm obviously not qualified to judge your statement, I must ask yu to elaborate on why it is clumsy.

Also, I'm in college pre-algebra, but I use my critical thinking skills and learn fast.

Side note: this isn't for cryptography, If I didn't make that clear, my apologies. I'm just looking to see what people use, and maney discover somthin I might like.

Last edited on

IWishIKnew (1364)

No one?

Catfish4 (666)

Perhaps you should move this topic to the Lounge, as it's not that much related to C++ in the first place, and people such as helios might reply.

I must ask yu to elaborate on why it is clumsy.

Well to begin, how do you transform the hash into meaningful 2D or 3D coordinates, without losing information?

Plotting a hash seems to me like a stupid idea, even if I was the one who suggested it in the first place. There has got to be a more elegant, if math heavy method to determine the probability of collisions, than plotting.

Side note: this isn't for cryptography, If I didn't make that clear, my apologies. I'm just looking to see what people use, and maney discover somthin I might like.

If you want "good hashing" with low probability of collisions you should use tried-and-true algorithms (see Wiki again) instead of inventing your own.

Yamakuzure (2)

As I understand it, you want a hashing function to use a string as a key in a hash table?

The one I am using works as follows:
- Loop through the string and take each character as an unsigned 8bit integer (aka uint8_t).
- binary OR each in reverse on a uint32_t shifted by position % 4 * 8 bit to the left.
- Every four characters take the part and add it to the sum, determined by the position % 3.
0: sum += part >> 2
1: sum ^= part << 4
2: sum |= part >> 1
- If any part is left when the string is finished (last part has less than 4 characters), add it to the sum shifted right by 4 bit.

The result is a nice uint32_t hash.

The generation of 100,000,000 random C-Strings resulted in the following:

   ------------+-------------+----------+-------------+----------+------- 
   Type        | Unique rand |    Quota | Unique Hash |    Quota | Result 
   ------------+-------------+----------+-------------+----------+------- 
   C-String    | 100,000,000 | 100.00 % |  98,723,200 |  98.72 % | Passed 
   ------------+-------------+----------+-------------+----------+-------

When used in a chained hash table and an open hash table with 50,000 keys, each, the maximum number of entries in any chain was 3, and the maximum number of hops to find an empty place was 4. So clustering and secondary clustering seem to be alright.
But I have to admit that I limit open hash tables to a maximum load factor of 0.8, and my implementation uses a "Robin Hood insertion" approach.

Btw: I tried the "plotting" to find heavy clustering, but recording the maximum chain length / hops in a textfile is far more convenient.

Edit: I tried some "tried-and-true" algorithms, but for hashing strings they all went rather bad. However, if you want to hash integer numbers, I'd recommend the methods described by Thomas Wang and Robert Jenkins. Unfortunately Thomas's site is down (http://www.concentric.net/~ttwang/tech/inthash.htm) but Roberts is still there. (http://burtleburtle.net/bob/hash/integer.html)

A good read would be http://burtleburtle.net/bob/hash/ btw.

Last edited on

IWishIKnew (1364)

you know, I tried an interesting approach to hashing strings. I needed a good way to encrypt strings, so I came up with an interesting solution:

take a 2d array, fill each with chars 0-255, and shuffle one of them based on a string. The resulting 2D array could then be referenced both ways to encrypt and decrypt. I used it with much success!

number of unique encryption 'keys' (or 2D arrays) that can be generated: 255!. If you do that math, you will fill the screen of your calculator, lol. Ask me if you would like to see the code. I would also like to know if this is used in hashing at all. I did come up with it on my own, so I have no clue if anyone else has thought of it.

Also, thanks for the replies so far. I appreciate it.

Last edited on

naraku9333 (2163)

I'm not sure I understand what you are describing, but it doesn't sound like a hashing technique. I would be interested in seeing the code.

255!

Do you mean 255 factorial? If so, how did you come up with that number?

EDIT: Also, I believe std::unordered_map and std::unordered_set are implemented as hash maps. So if you're looking for an efficient hashing container you should probably use one of those, unless this is a learning exercise.

Last edited on

helios (17607)

take a 2d array, fill each with chars 0-255, and shuffle one of them based on a string. The resulting 2D array could then be referenced both ways to encrypt and decrypt. I used it with much success!

This is known as a substitution cipher. It's the easiest cipher to break because the ciphertext maintains all the statistical properties of the plaintext.

number of unique encryption 'keys' (or 2D arrays) that can be generated: 255!.

256!. Or are you not shuffling the zero, too?

I would also like to know if this is used in hashing at all.

Nope.

I did come up with it on my own, so I have no clue if anyone else has thought of it.

See http://en.wikipedia.org/wiki/Caesar_cipher for a dumber variant.

When I want a simple string hash, I tend to go with this:

unsigned multiplication_hash(const char *s){
    const unsigned factor = 0xDEADBEEF;
    unsigned ret = 0x8BADF00D;
    for (; *s; s++){
        ret *= factor;
        ret ^= *s;
    }
    return ret;
}

It's extremely fast and works generally well. It has a collision ratio of 0.23%.

Catfish4 (666)

It has a collision ratio of 0.23%.

Not to derail the thread, but how exactly do you calculate that?

helios (17607)

You generate a bunch of strings and divide collisions by total.

Catfish4 (666)

Ha ha, I was hoping for some statistics probability magic formula thing.

Yamakuzure (2)

When I want a simple string hash, I tend to go with this:
(...)
It's extremely fast and works generally well. It has a collision ratio of 0.23%

Just tried this, but changed to use explicit uint32_t.

 $ ./hashbuilder hashbuilder_out 100000000 char
Building hash lists in "hashbuilder_out".
Writing 100000000 values into "hashbuilder_out/hash_string.csv" : (1167526 ms) done
 $ cd hashbuilder_out && ./genLists.sh
hash_string.csv - sort - hash done
 100000000 hash_string.csv
  98844108 hash_string.csv_hash
 100000000 hash_string.csv_num
 298844108 total
 $ echo "(100000000-98844108)/1000000" | bc -l
1.15589200000000000000

So with 100M random strings the collision ratio is ~1.16%
Which, with all due respect, is simply great!

Note:
hash_string.csv is the file containing 100M lines of the form <string>;<resulting hash> ( * )

hash_string.csv_num is a sorted file with double entries eleminated. Having the same line count means, that my random string generator generates 100M unique strings in 100M iterations.

hash_string.csv_hash is a sorted file of only the hash values with double entries eliminated.
/Note)

( * ) The result looks like this:

 $ head hash_string.csv
bLknDZqLwpAYmuzU;1175423451
FXzyIvkERzVdYwzC;2093137218
jKNyFyxiqGZWXhtr;3962238790
ILZphdXblCTNtWMM;2006186418
djJFdPArAZrOBCTF;2694416189
PueuvnbevOTyDAnB;2208001120
ACQTgSrAbiNxOwmV;2826307077
OnWKhrtfbGdmEdJf;4128667791
TAQoSuBOwEIKmyAm;3087550481
GkWyMbHPVaZNXlQh;3118438482

Last edited on

IWishIKnew (1364)

@helios:

Yes it is 256 factorial. I'm curious, then, how is a cipher like mine so easily broken?? Every single element in the array is swapped with another using every character in the string given (resulting in a thoroughly shuffled array). I'm reluctant to post the code though... I might do some tests to see what kinds of collisions I get.

Theoretically, though, there should be absolutly no identical 'keys' created, since the order and content of a string is taken into account... unless I missed somthing.

so, to mabey demonstrate:

abcdefg...
abcdefg...

would turn into somthing like this:

abcdefg...
iskqocy...

of course, if your password consisted of all the same letters (i.e. "aaa") it would result in a shifted 'Ceasar Cipher'.

Last edited on

helios (17607)

I'm curious, then, how is a cipher like mine so easily broken?

Suppose you know that the plaintext is in English. The sequence "the" occurs in English text with a known frequency. If you can find a three character sequence that occurs with a similar frequency in the ciphertext, you've broken part of the key. Once a message is partly decrypted, decrypting the rest is usually just a matter of time.

IWishIKnew (1364)

That makes sense.

Topic archived. No new replies allowed.