a large-scale search for symbols

Forum

Forum
Beginners
a large-scale search for symbols

a large-scale search for symbols

hi there, i have an interesting question below. Let me know if you have any thoughts. Assume we have a function which returns an index:

int lookup(string symbol)
{
	vector<string>::iterator i;
	i = find(symbols.begin(), symbols.end(), symbol);
	if (i == symbols.end())
		throw new exception();
	return (i - symbols.begin));
}

"symbols" is defined as vector<string> symbols.
The code is pretty simple as it looks up a "symbol" from "symbols". Now my question is what if we have more than 1 million records in "symbols".
It is time-consuming if we need to go through every single symbol in "symbols".

My first thought is to apply the binary search.
My idea is to apply the hashtable to convert the symbol into an integer. Then I need to sort those keywords as <integer, index>, where the integer is the result from the hashtable while the index is the index with the symbol in the original symbols. Next the binary search could be applied to check the matching symbol.

Is there anyway we can reduce the computational complexity?
Let me know if there is anything I need to explain
Thanks in advance.

Cubbi (4774)

If you can make a hash out of that string, you could just use a hash table directly, as in, std::unordered_map<std::string, int> symbols;, which would map a symbol to its index.

Alternatively, there are many data structures for efficient storage and search of multiple strings. Take tries for example: http://en.wikipedia.org/wiki/Trie

Last edited on

histrungalot (272)

Do you care about collisions?

superjack (15)

sorry guys... i need to make it clear.
the hash is my first concern. But it might not be the best way to achieve the task.
Thanks Cubbi. Trie Tree could be a good way.

But in fact, this problem turns out to be a question from an interview.
I, personally, donot think any person can manage a program about Trie Tree in a short time , especially during the interview. Any simple answer?
Anyway, thanks again.

JLBorges (13770)

> But in fact, this problem turns out to be a question from an interview.

Before I try to answer the question, I would ask the interviewer a few questions.

1. How large is each string? Are most of the strings of roughly the same size? Are there many duplicate strings in the collection?

2. Are the strings in the collection modifiable? If they are, how often are they modified?

3. Would strings need to be added or removed from the collection frequently?

4. Is searching for strings in this collection far more frequently done when compared to changes as in 2. or 3.

5. Would many of these strings have large common substrings as either prefixes or suffixes? (For example, as with a set of web urls)

6. Does the 80-20 or 90-10 rule apply? (Is it the case that 90% of the searches are for 10% of the strings?)

Ps. If I were the interviewer, I would be very keen to knowi what questions the candidate asked before attempting a solution.

Last edited on

superjack (15)

That is very impressive JLBorges. Thanks a lot.
The reason I put this topic is to have a brainstorm or discussion.
I want to find out any possible solutions or tips behind the question, which isnot limited by the hashtable.

I think I got rejected by the company since I havenot heard anything from them until now.
That is fine. Maybe I didnot ask them the above questions before I rushed to my hashtable. lol

Anyway, the reason I think about the hash is that there isnot any searching algorithm can be applied to a "string" data structure directly, to my best knowledge. Thus I need to convert the string into an integer or a scale before applying any advanced searching algorithm, such as the binary search.

Is that correct? Is there anyway we can search a string from a vector directly?

Here let me repeat my question, in case any misunderstanding.
Assume we have a function which returns an index:

int lookup(string symbol)
{
	vector<string>::iterator i;
	i = find(symbols.begin(), symbols.end(), symbol);
	if (i == symbols.end())
		throw new exception();
	return (i - symbols.begin));
}

The code is to look for a "symbol" from "symbols". However, the computational complexity is O(n) since we need to go through every single symbol in "symbols".
Is there anyway we can reduce the computational complexity?

JLBorges (13770)

> Thus I need to convert the string into an integer or a scale before applying any advanced searching algorithm, such as the binary search.

std::string has overloaded comparison operators. So two strings can be compared directly by str1 < str2 or str1 == str2. This would be somewhat time consuming, particularly if the strings are long.

Comparing using an integral hash value would be faster std::hash(str1) == std::hash(str2); for multiple comparisons, the hash value has to be computed only once. In this case, the actual strings have to be compared only if their hashes compare equal. See: http://en.cppreference.com/w/cpp/utility/hash

Assuming that the answers to questions 2. and 3. are "no" or the answer to question 4. is "yes", a hashtable is a strong possibility. std::unordered_set<std::string> if there are no duplicates or an appropriate std::unordered_multimap<> otherwise.

For a specific case there would be more efficient solutions - for example if the answers to either of the questions 5 or 6 is "yes". One would always go for the simplest possible solution that meets the performance requirements. Programmer time is more expensive than machine time.

When in doubt, use brute force

- Ken Thompson (prefer simple, robust, and portable algorithms over brittle ‘smart’ ones).

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

- Brian Kernighan

superjack (15)

lol...
That is awesome..JLBorges ....
Fully detailed and convincing...
Not only I learn from the std::hash, but a tricky way to answer the question.
You even mention Brian Kernighan and Ken Thompson...so touching ............

Topic archived. No new replies allowed.

C++

Forum

a large-scale search for symbols