question about hashing

our you allow to use rand() when your using std::hash because im trying to make a simple hash function where it generate a 6 unique digit or is there other ways to do it without using rand()

jonnin (11410)

rand is to be avoided actually, use: http://www.cplusplus.com/reference/random/

<map> is effectively a hashed container that does the work for you. It may or may not be applicable to your code.

You can also extract various bytes of something like SHA algorithms to generate a data-driven key (you can %1000000 on 64 bits from the value ... etc) that can be reproduced across systems (unlike rand, and if not using c++ project wide, for example, it would let any language generate the same keys with the same algorithm).

Rand works too, its just no longer preferred, so its not 'wrong'.

Ganado (6802)

Using rand() (or <random>, for that matter) would not work, and is bad for a Hash data structure because it defeats the whole purpose of the same data always hashing to the same thing. You want hash_function("banana") to always hash as the same thing, and not be dynamic between calls.

The point of a hash data structure is to keep the internal array size relatively low while providing a quick, efficient way of deciding which index a particular object should be stored in. Then, after the efficient hash function is called, the equals (==) operator is called if there are any collisions after hashing.

<unordered_map> would actually be what you can use with C++ to use the standard Hash, which means amortized O(1) insertion and access, assuming few collisions. <map> is O(log(n)) insertion and access, so it's a different underlying process.
<map> uses a balanced tree. unordered_map uses std::hash as the default hash function in its implementation.

btw: std::hash/maps and cryptological hashing are two pretty different use cases. One is meant to be used as part of an efficient data structure , the other is for cryptography and secure certification. Of course, you could use a hash function like SHA256, and truncate like jonnin said, but this seems like overkill.

Last edited on

ghost1111 (81)

im working on a project where i dont know what to do anymore im trying to make a unique id that generates 6 number, when i used rand() it did work,but i dont know what to use, and the only thing i can use is hash and size_t and i know im trying to hash an interger,and sometimes it would be 00 in example output.txt and sometimes if i did another way i would get that same number over and over, but i want it to be different
input.txt
bob 4517898 102 tell rd

gill 4787456 103 hmm rd

output.txt

0 bob 4517898 102 tell rd

0 gill 4787456 103 hmm rd

struct customer

{

std::string name;

std::string phone;

std::string address;

std::size_t hash() const;

};

Last edited on

Ganado (6802)

You're asking for two different things. In your first sentence, you're asking for a unique id. In your other sentence(s?), you're asking for some sort of hash for the customer.

You could concatenate name+phone+address, and then hash that, if you want a hash.

#include <iostream>
#include <string>
//#include <unordered_map>
 
 
struct customer
{
    std::string name;
    std::string phone;
    std::string address;
 
    std::size_t hash() const;
};
 
std::size_t customer::hash() const
{
    static std::hash<std::string> hasher;
    return hasher(name + phone + address);
}
 
int main()
{
    customer bob { "Bob",   "555-555-1234", "5 Fake Lane" };
    customer jen { "Jenny", "555-867-5309", "7 Fake Lane" };
 
    std::cout << bob.hash() << std::endl;
    std::cout << jen.hash() << std::endl;
}

If you need the first 6 digits or whatever, do hash % 1000000.
However, note that modding by a prime number will give you less collisions, though. So mod by something like 999983 (or 1000003, if you require the upper bound).

return hasher(name + phone + address) % 999983;

Last edited on

jonnin (11410)

the reason I like to use sha for this is that similar inputs give vastly different outputs, making for nice clean hashing. Its overkill, but its also opensource code, so work involved is minimal. The idea is that if data looks like abc, abd, and aac, etc, you could not begin to guess what the hash key would be. A bad algorithm for those 3 data elements might give 5, 7, and 8 for the keys instead of 1003, 18, 597. It is desirable to have the bigger spread.

Another way to get the same result without adding more code is to abuse the built in stuff: both random libraries can be abused :)

What I mean is that both rand and random can be keyed off a seed and the same seed makes the same stream of random values. But that stream can be ONE value. in rand syntax...

srand(somethingfromyourdata);
key = rand();
//repeat the above for every piece of data, that is, constantly reset the stream. This will give the same answer for the same data every time (on the same compiler, etc). Its a pretty solid hash, actually. Random can do the same thing, but I am not yet able to write that off the top of my head.

this is a slight improvement on the above idea to key off the data, instead, I am saying to seed off the data and use the random key off that. The collisions will be reduced this way.

Resetting the random generator ... I am not sure how much time this takes if you had millions of items. Srand was plenty fast for this, and I expect its a non issue, but if you are doing big data consider the hash algorithm's performance before you commit to something.

don't get fancy turning strings to numbers. *Carefully* take an int64 pointer to some offset in the string, its a number, and a fine seed to the random library.

Last edited on

ghost1111 (81)

@Ganado i dont want my number to be the same for example sorry for not explaining
when it print it should be
123457 Bob 555-555-1234 5 Fake Lane
456779 Jenny 555-867-5309 7 Fake Lane
instead it print like this
123457 Bob 555-555-1234 5 Fake Lane
123457 Jenny 555-867-5309 7 Fake Lane

@jonnin
what the int64 pointer?could you show me example

Ganado (6802)

@jonnin
Oh, I see what you mean, by constantly resetting the seed. Interesting, although seems to put another layer of complexity.

@ghost1111
I'm at a loss then, because they hash to different things for me. Maybe your compiler's implementation of std::hash is different. My code returns 545146 and 342385, respectively, but those numbers are implementation-defined (different on different compilers/versions).

std::size_t customer::hash() const
{
    static std::hash<std::string> hasher;
    return hasher(name + phone + address) % 999983;
}
 
int main()
{
    customer bob { "Bob",   "555-555-1234", "5 Fake Lane" };
    customer jen { "Jenny", "555-867-5309", "7 Fake Lane" };
 
    std::cout << bob.hash() << std::endl; // returns 545146
    std::cout << jen.hash() << std::endl; // returns 342385
}

Why don't you post your updated source code so that we can actually see what you're doing, and not be guessing?

jonnin (11410)

something like this:

unsigned long long * pull; //this may or may not be 64 bits on your machine, but lets say it is.
string blah = "stringdatathathastobeatleast64bitslong";
const char * cp = blah.c_str(); //not sure, would &string[3] work here?
pull = (unsigned long long *)(&cp[3]);
pull[0];//this is now a legit integer stolen from the bytes of your string. it won't work if the string is too short (less than 64 bits long) so you will need to correct for that if you use this approach. This is the very fast and very lazy way to generate a number from a string for hashing type purposes. Be sure to pick parts of the string data that actually change a lot for different records.

all you are doing is brute force shoving the characters (bytes) into the integer (bytes). All legal combinations of bytes are legal integers, so that works well enough. Then you can feed this integer as a seed to a random function of some sort to get a hash key.

the question is... do you want the same number for the same data or not?
if you have
bob 123 address
and
bob 123 address
do you want to see

854367 bob 123 address
97234 bob 123 address
OR
854367 bob 123 address
854367 bob 123 address
854367 bob 123 address //same every time??

Last edited on

ghost1111 (81)

@ganado
i get what i did wrong
i made an object outta customer and call the phone etc...

1
2

customer s;

an i shouldnt do that

Topic archived. No new replies allowed.

C++

Forum

question about hashing