Hi, I am trying to learn about hash tables, but everything is watch or read is either just saying what they are, or its really advanced and i dont understand a word that is being said. So hoping someone might be able to clarify some questions i have regarding them.
One video i watched said the keys have to be unique, so I am wondering how this would work with something like a phonebook (storing, name, number, street etc), if the persons name is the key. There would obviously be several people with the same name?
Can you only access the information that is stored using the key. Again say it is being used for a phonebook, but instead of a name, i want to do a search based on a street name, is that possible?
One video i watched said the keys have to be unique, so I am wondering how this would work with something like a phonebook (storing, name, number, street etc), if the persons name is the key. There would obviously be several people with the same name?
Keys don't have to be unique, but the most common application of hash tables is implementing a dictionary (a.k.a. map, or associative array) data structure, and for those you don't want duplicate keys. You can implement a hash table with duplicate keys, it's just not going to be a dictionary.
Can you only access the information that is stored using the key. Again say it is being used for a phonebook, but instead of a name, i want to do a search based on a street name, is that possible?
The only way to quickly access the information is to use the key. You can traverse the entire table looking for the information you want, but that's no faster than an unsorted vector.
Whether a hash table is suitable for your problem depends on what's the most common lookup. For a phonebook, I think a hash table is probably a poor fit. A hash table works best when you know the exact value you want to look for. With a phonebook, the most common situation is that the user doesn't know the exact name of who they're looking for, or they know some other detail of that person. Hash tables do not work well for approximate searches, at least not without carefully designing the hashing function.
Also, phonebooks are usually fairly small. A simple linear search will be much, much simpler, and it will probably be about as fast.
you can mix and match data structures.
a 2-d hash table where the first dimension is the key access, and the second dimension is just a sorted (or not) vector, for example.
then say you had 3 people named john smith at different addresses. they would all hash to the same location in the first dimension, and you would then have a vector of all the john smiths, which you can search with iteration. Searching 3 names vs the 2 million in your city is still screaming right along: the most redundant name in your city is still going to only have a relatively small bucket to deal with.
this is hard for new students to grasp, so I will say it flat out: there does NOT EXIST any data structure that can be searched efficiently on 2 different keys. So no, you cannot search on address efficiently if you stored it by name, and you can't search by name if you stored it by address. Just trust me on that for now :) If the 2 things are correlated, you may be able to do it.
It is perfectly acceptable to have a second data structure that re-indexes the data so you can search it on another key. If you do this with pointers, it does not cost much additional space even for fat objects to do this.
approximate (typo tolerant) searching is a very complex problem usually. This is difficult to do in hashes because a hash algorithm that is nicely scattered for the data won't cluster similar data together, and that is desired in hashing, but to be fault tolerant, you need them to cluster, which is bad for hashing.... you can't really get both cleanly for nonstatic data.
Can you only access the information that is stored using the key
pretty much. That could be a group of records, though: see my list of john smiths.
if you store by address, that becomes the key, and you can do that, but then you can't find by name. The only answer to this is to store 2 different hash tables or some other redundant approach, again, unless the data has some feature that lets you work some magic. The same is true if you had them in a sorted array and did binary search: sorted by name is random by address.
you may be overthinking this.
all a hash table is, is an array where you can compute the index where a piece of data is stored.
you have an array:
person pplz[10];
you put john smith at location 3.
later, you look for john smith: hash("john smith") returns a 3.
you go to pplz[3] and there he is. That is all there is to the core concept. the rest of it is mechanics to make this work and strategies to handle what happens when mary jane is also assigned 3 and so on.
Ok so in that case I would not be able to have the persons name as the key, since it would be a duplicate? So how would i find a person through a key without being slow and having to go through the entire list.
A phonebook was just an example, I am not trying to implement this, just trying get my head around how it works. Say the problem was something like a school system, that has all the students listed. And i wanted to be able to search for students by name, class, grade etc etc. Then a hash table would not be good for that, because of the other search functions?
What type of data structures would something like this need then? I find a lot of website and apps allows you to search or sort things by certain criteria, so there must be a data structure that allows the ability?
EDIT:Somehow missed Jonnin message before replying, reading it now
you can mix and match data structures.
a 2-d hash table where the first dimension is the key access, and the second dimension is just a sorted (or not) vector, for example.
then say you had 3 people named john smith at different addresses. they would all hash to the same location in the first dimension, and you would then have a vector of all the john smiths, which you can search with iteration. Searching 3 names vs the 2 million in your city is still screaming right along: the most redundant name in your city is still going to only have a relatively small bucket to deal with.
Sorry If i am misunderstanding. But would john smith be the key? If so this would be the one type that helios mentioned that does not need unique keys? (is there a name for this type of hash table so i can look into it more)? Also is this similar to solving collisions (Do not know much about it but seen a video where they were using a linked list to solve keys that were in the same location?
It is perfectly acceptable to have a second data structure that re-indexes the data so you can search it on another key. If you do this with pointers, it does not cost much additional space even for fat objects to do this.
Do you mean basically just have the 2 data structures with the same information, but different keys. So say the first hash table keys are the persons name, the second hash table is the address of the person?
Is it best to do the two data structures each with their own search function, or have the one hash table search for the data via other information (ie - address) and just have it be slow. I assume having the two data structures would be best?
Sorry If i am asking things that has already been asked, sometimes it takes me a while to grasp concepts, explains why i have been learning programming for 5 years and still dont know data structures.
Say the problem was something like a school system, that has all the students listed.
Normally, such a system would be implemented using a regular old relational database, not an in-memory data structure.
I find a lot of website and apps allows you to search or sort things by certain criteria, so there must be a data structure that allows the ability?
Again, those ones use an RDBMS.
It sounds like what you really want is to store a large volume of permanent information and then retrieve it efficiently. There's multiple ways to design such a system, but let's suppose you with a traditional client-server RDBMS such as MariaDB.
* You install the database in your server.
* You log in to your database and design your tables. How to do that exceeds the scope of this post. There are entire books on database design.
* You write your program such that it can connect to the server, perform its queries, and update the contents of the database.
A database is set up to do exactly what I was saying: They are set up to be indexed on multiple keys, that is, they have a small redundancy for each key they allow you to search with so the data is effectively sorted by each key in those 'lightweight copies' allowing them to get to the data quickly from multiple keys. If/when you get to them, you will encounter their want for 'keys' and 'indexes' .... and it will be exactly what I told you, only the tool does the work for you, based off what you asked it to do.
But would john smith be the key?
-- yes.
-- it depends on what exactly you are doing how this all works:
1) if you did my example, your 'unique key' of 'john smith' would get you a vector of records for all the different john smiths in that phone book. one unique key, one thing returned, it just happens that the thing returned is itself {a container of records}.
2) you could also have not unique keys. the first john smith may hash to location 1234. The second one shows up, and your code sees that location 1234 is in use, so it then runs a second hash algorithm and this time gets location 4567. And so on. (The re-hash can be the same algorithm where you do something cute like append an ascii '1', '2', ... to 'john smith' to force it to make a different value). You can also do various "go to 1234+ offset" approaches to find an empty slot.
I do not know that there are distinct web searchable names for these techniques. Sorry. It was all lumped up under hashing in a short chapter when I studied it.
Both of my examples above are solving for collisions. One approach hunts for an empty slot and one stores a container (could be list, type is not important). The linked list is like my first example.
Do you mean basically just have the 2 data structures with the same information, but different keys. So say the first hash table keys are the persons name, the second hash table is the address of the person?
--almost. usually, if the data has a lot of size to it (and most data does) you have 1 copy of the data in something and 2 hash tables that have pointers to the data that are fast-searchable. Pointers do not take up much space. If the data were in a vector as the master copy, an array index is smaller than a pointer and could be used instead.
what is best depends on your needs. Most databases allow searches on data that is not indexed but it takes a lot longer, going from seconds to 10s of min for a big database. Allowing this or not is a design choice based off how long it will take (no one wants to wait more than a few seconds these days) and other factors (how much of your resources are burned doing it, etc).
Sorry If i am asking things that has already been asked, sometimes it takes me a while to grasp concepts, explains why i have been learning programming for 5 years and still dont know data structures.
Please, ask until you get it. Hashing and its friends (counting sort/ bucket sort) are among the most useful data structures / algorithms I know.
if you have big, persistent data, you want a database. Many people I work with will load a throwaway database just to query it for a while. They are good at what they do. Hashing is a intermediate tool, when you don't quite have enough need to deal with a database but want that kind of tool available in your code.
First of all, thank you for all the help. It has been very helpful. Spent the past few days trying to get my head around it all.
I have been thinking though. Instead of using the non unique keys, is it possible to still use the name, but for every duplicate entry add 001. So say, johnSmith001, johnSmith002 etc etc. and then use that as the key. Then use a search function that would go through the table, but only search the name, missing out the number part? Or is that not how it would work with keys?
If i think about it, I am imagining it would not work, because you need the exact key to get the location, but there may be a way that I just dont know, or something I am not thinking of.
you can try key tampering. This is similar to my re-hash description above. Basically, by changing the data itself (as you suggest) or the hash function (this is better, because tampering with the data takes more CPU effort) you can handle collisions by just searching a few hashes. The problem with doctoring the data is when to quit... searching for john smith, you look for 001, 002, 003, … nnn and you can only truly quit when you find an unused cell in the table. That is true for changing the hash function as well. In both cases you search until you find an empty cell. Its better than a linear search, and should be much better than a binary search too, otherwise you did not do your data analysis right when you picked a hash table over a sorted table. The bigger your table, the more empty cells, and the better things work across the board.
for re-hashing, think on it this way...
the basic crc32 algorithm, as an example (its redundancy makes it a bad hash function, but its simple so as an example..) … say you had a loop inside.. and mad it like this..
int hashit(unsigned char* data, int length, int reshash = 1)
{
for(int j = 0; j < length*rehash; j++)
{
crc= crc of data[j%length];
}
}
the multiple passes over the data when reshash is not 1 give different answers. A good scrambler (again, crc isn't a great choice) would give radically different answers and work great just by running it until it finds and empty cell, incrementing rehash every time. (this may need some common sense adaptation if data is very long to begin with).
I have been thinking though. Instead of using the non unique keys, is it possible to still use the name, but for every duplicate entry add 001. So say, johnSmith001, johnSmith002 etc etc. and then use that as the key. Then use a search function that would go through the table, but only search the name, missing out the number part?
The end result would be the same as if you'd implemented a dictionary that allows duplicates, but worse.
* What happens if the key ends in digits?
* What happens after someone inserts a thousand "johnSmith"s, or a thousand "johnSmith42"s? How do you know where the string ends and the postfix begins?
Sorry If it seems like I am jumping around with my questions here, somethings just come a bit later to me.
a 2-d hash table where the first dimension is the key access, and the second dimension is just a sorted (or not) vector, for example.
then say you had 3 people named john smith at different addresses. they would all hash to the same location in the first dimension, and you would then have a vector of all the john smiths, which you can search with iteration.
Regarding this. Is this the same thing as chaining? I have recently came across this term and it sounds similar to what you mention here?