Collision free hash

Feb 28, 2013 at 3:41am
closed account (D4S8vCM9)
Hi,

I have following method to calculate a 16bit hash on a per definition unique string:

1
2
3
uint16_t IReferenceHolder::getHash(const std::string &ref) {
	return std::inner_product(ref.begin(), ref.end(), ref.begin(), (uint16_t)0) % 65437;
}


Actually it calculates an inner product on each character adding it to the last result (see http://www.cplusplus.com/reference/numeric/inner_product/).

65437 is the nearest prime to 2^16. In fact I never need the full range of 16 bit, therefore 65437 different hashes are enough.

Empirically it seems - on unique strings - to be collision free. But if not, can somebody construct me an example where not.

NOTE the strings have an maximum size of 255.
Last edited on Feb 28, 2013 at 3:41am
Feb 28, 2013 at 3:45am
There is no such thing as a collision free hash.

This is the classic problem of trying to fit too many things into a fixed number of slots. You cannot represent every possible string with just a single 16-bit integer.

The absolute best case scenario is 2^16 unique strings before you have a collision.

To reproduce a collision, just hash (2^16)+1 strings. At least 2 of them will collide (though you're likely to have a collision much sooner):

1
2
3
4
5
6
for(int i = 0; i < 0x10001; ++i)
{
   stringstream ss;
   ss << i;
   hash( ss.str() );
}




EDIT:

Also, given the simplicity of your algorithm, I can tell that "ab" and "ba" will collide. So there's a test case for you.
Last edited on Feb 28, 2013 at 5:23am
Feb 28, 2013 at 6:21am
closed account (D4S8vCM9)
Ups, you are true :-[ 2 Byte string already start to collide.
You're also true with the fact "There is no such thing as a collision free hash.", I was just too lax in describing my problem, wanted to say "almost" collision free.

The absolute best case scenario is 2^16 unique strings before you have a collision.

This condition is ensured. But:

In your example the pitfall is the commutativity of the plus and multiply operator. That pitfall I encountered in my first try, but I openly admit, I didn't used that small strings to test.

But minus falls out because it can result in negative numbers and division can result in real numbers.

Any ideas how I can get at least "less" collisions?
Feb 28, 2013 at 6:46am
closed account (D4S8vCM9)
Took my Segdewick book again and found another simple, but better algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
unsigned int RSHash(const std::string& str)
{
   unsigned int b    = 378551;
   unsigned int a    = 63689;
   unsigned int hash = 0;

   for(std::size_t i = 0; i < str.length(); i++)
   {
      hash = hash * a + str[i];
      a    = a * b;
   }

   return hash;
}


Run for AB and BA:

#1: hash = 0 * 63689 + 65 = 65
a = 24109534639

#2: hash = 65 * 24109534639 + 66 = 1567119751601
a = 24109534639

#1: hash = 0 * 63689 + 66 = 66
a = 24109534639

#2: hash = 66 * 24109534639 + 65 = 1591229286239
a = 24109534639


Just should take internally 64 bit in order to avoid integer overflows, but else it seems more stable but yet simple.


EDIT: Implemented it as following:

Declaration in ireferenceholder.h:
1
2
3
4
5
6
typedef struct hashCalculator : std::binary_function<uint64_t, uint64_t, uint16_t> {
	inline hashCalculator() : a(63689u), ab(63689u * 378551u) {}
	uint64_t operator()(uint64_t, uint16_t);
	uint64_t a;
	const uint64_t ab;
} HASHFUNC;

Implementation in ireferenceholder.cpp:
1
2
3
4
5
6
7
8
9
uint64_t IReferenceHolder::hashCalculator::operator()(uint64_t hash, uint16_t chr) {
	uint64_t h = hash * a + chr;
	a = ab;
	return h;
}

uint16_t IReferenceHolder::getHash(const std::string &ref) {
	return std::accumulate(ref.begin(), ref.end(), (uint64_t)0, HASHFUNC())  % 65437u;
}
Last edited on Feb 28, 2013 at 8:03am
Topic archived. No new replies allowed.