### Binary search of a dynamic array of strings

Hey yall, so for school I have to make a recursive word ladder program, and I wanna put a binary search in it. I have a function that checks the inputted word in the dictionary of words and returns a bool. I had it as an iterative function but going through a 187000-word dynamic array just to find 1 isn't very fast. I've been trying to make the binary search work but it just doesn't, I know its something with the word compares but this is probably version 10 of my function and it still sucks. any help or tips would be extremely appreciated. Thank you!

global_count is just a tracker for dev stuff, WORDCOUNT is 187720 which is the number of words in the dynamic array wordsArray.
 ``12345678910111213141516171819202122232425262728293031`` ``````bool validWord(string word, string *&wordsArray) { cout << "running valid word: " << global_count << endl; ++global_count; int mid; int low = 0; int high = WORDCOUNT; while (low <= high) { // find the middle index mid = low + ((high - low) >> 1); if (word == wordsArray[mid]) // word found { return true; } else if (word > wordsArray[mid]) // word may be on the left half { high = mid - 1; } else // word may be on the right half { low = mid + 1; } } // key not found return false; }``````
This is a classic bisecting search.
Try this (not tested):
 ``12345678910111213141516171819202122232425262728293031323334`` ``````#include #include constexpr int WORDCOUNT = 187720; int validWord(const std::string word, const std::string*& wordsArray) { int mid; int low = 0; int high = WORDCOUNT; int iterations = 0; mid = low + ((high - low) >> 1); // Initial midpoint while (low != high) { iterations++; if (word == wordsArray[mid]) // word found { return iterations; } else if (word > wordsArray[mid]) // word may be in second half { low = mid; // Can't be lower than old midpoint mid = low + ((high - low) >> 1); // New midpoint } else // word may be in first half { high = mid; // Can't be higher than old midpoint mid = low + ((high - low) >> 1); // New midpoint } } // key not found return 0; }``````

Note: It returns the number of iterations required.
Last edited on
OP was very close to a working solution

 ``123456789101112131415161718192021222324252627282930`` ``````bool validWord2(string word, string *&wordsArray) { int mid; int low = 0; // subtract one to avoid ever accessing wordsArray[WORDCOUNT] int high = WORDCOUNT - 1; while (low <= high) { // find the middle index mid = low + ((high - low) >> 1); if (word == wordsArray[mid]) // word found { return true; } // Change greater-than to less-than else if (word < wordsArray[mid]) // word may be on the left half { high = mid - 1; } else // word may be on the right half { low = mid + 1; } } // key not found return false; }``````

Alternatively, here's a complete example:
 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344`` ``````#include #include #include int constexpr WORDCOUNT = 26; using namespace std; bool validWord(string const& word, string const *wordsArray) { int begin = 0, end = WORDCOUNT - 1; while (begin <= end) { auto const middle = begin + (end - begin) / 2; auto const result = word <=> wordsArray[middle]; if (result == 0) { return true; } if (result > 0) { begin = middle + 1; } if (result < 0) { end = middle - 1; } } return false; } int main(int argc, char** argv) { string phonemes[] = { "alpha" , "bravo" , "charlie", "delta", "echo" , "foxtrot" , "golf" , "hotel", "indigo" , "juliet" , "kilo" , "lima" , "mike" , "november", "oscar" , "papa" , "quebec" , "romeo" , "sierra" , "tango", "uniform", "victor" , "whiskey", "xray" , "yankee" , "zulu" }; static_assert(size(phonemes) == WORDCOUNT); for (int i = 1; i < argc; ++i) { std::cout << "\"" << argv[i] << "\" is " << (validWord(argv[i], phonemes)? "": "not ") << "a NATO code word\n"; } }``````

Demo:
https://coliru.stacked-crooked.com/a/ceef5d8550cae8db

https://en.cppreference.com/w/cpp/algorithm/binary_search
Last edited on
 ``12345678910111213141516171819`` ``````#include #include #include int main() { std::vector dico { "Cat", "Dog", "Fish", "Snake", "Bird" }; std::vector words { "Fish", "Dog", "Scorpion" }; for (auto word : words) { std::cout << "Searching for " << word << '\n'; if (std::binary_search(dico.begin(), dico.end(), word)) std::cout << "Found the word " << word << '\n'; else std::cout << word << " does not exist in the dictionary!\n"; } }``````
This alternative using binary_search could be useful for you. It loads more than 370K words in less than 1 seconds (according to the specs of my new PC) and it can find a word in the vector words in a few microseconds. For this test, I use a text file which is available at the address below. Seems good enough ++

v2 : I added some perf measurements...

https://github.com/dwyl/english-words/blob/master/words_alpha.txt

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152`` ``````#include #include #include #include #include #include int main() { std::vector words; std::ifstream infile("dictionary.txt"); std::string newWord; if (!infile.is_open()) { std::cout << "Could not open the dictionary!" << std::endl; exit(EXIT_FAILURE); } // for perf auto start = std::chrono::high_resolution_clock::now(); // create your vector with all words from your dictionary while (std::getline(infile, newWord)) { transform(newWord.begin(), newWord.end(), newWord.begin(), ::tolower); words.push_back(newWord); } infile.close(); auto end = std::chrono::high_resolution_clock::now(); std::cout << "Our dictionary with " << words.size() + 1 << " words has been read in " << std::chrono::duration_cast(end - start).count() << " ms" << std::endl; while (1) { std::string input; std::cout << "Give me a word : " << input; std::cin >> input; transform(input.begin(), input.end(), input.begin(), ::tolower); if (input == "exit") break; auto start = std::chrono::high_resolution_clock::now(); if (std::binary_search(words.begin(), words.end(), input)) std::cout << "The word " << input << " has been found :)" << std::endl; else std::cout << input << " does not exist in the dictionary!" << std::endl; auto end = std::chrono::high_resolution_clock::now(); std::cout << "Operation done in " << std::chrono::duration_cast(end - start).count() << " ns" << std::endl; }; return 0; }``````

 ``` Our dictionary with 370106 words has been read in 998 ms Give me a word : albatroos albatroos does not exist in the dictionary! Operation done in 94600 ns Give me a word : albatros The word albatros has been found :) Operation done in 176900 ns ```
Last edited on
 It load more 350K words in less than 1 seconds

Take a look at this thread, which talks a little about read performance and has a little measurements. It's the same thing, reading a dictionary file:
https://cplusplus.com/forum/general/280508/
Last edited on
 ``12345678910111213141516171819202122232425262728293031`` ``````#include #include #include #include #include #include template < std::ranges::forward_range RANGE, typename T, std::indirect_strict_weak_order< T*, std::ranges::iterator_t > CMP = std::ranges::less > std::ranges::iterator_t binary_find( RANGE&& range, const T& value, CMP cmp = {} ) // TO DO: add support for projection { const auto iter = std::ranges::lower_bound( range, value, cmp ) ; if( iter != std::ranges::end(range) && !cmp( value, *iter ) ) return iter ; else return std::ranges::end(range) ; } int main() { std::vector dictionary { "Cat", "Dog", "Fish", "Cow", "Crocodile", "Snake", "Bird" }; std::ranges::sort(dictionary) ; for( std::size_t i = 0 ; i < dictionary.size() ; ++i ) std::cout << i << ". " << dictionary[i] << '\n' ; std::cout << "\n----------------\n\n" ; for( const char* const word : { "Fish", "Snake", "Scorpion", "Bird", "Crocodile" } ) { if( auto iter = binary_find( dictionary, word ) ; iter != dictionary.end() ) std::cout << "Found the word '" << word << "' at position " << iter - dictionary.begin() << '\n'; else std::cout << "word '" << word << "' does not exist in the dictionary!\n"; } }``````

https://coliru.stacked-crooked.com/a/e11ce4ec6a220393
Why hold a dictionary in a std::vector and not a std::set? Consider (as C++20):

 ``1234567891011121314151617181920`` ``````#include #include #include #include int main() { const std::set dictionary { "Cat", "Dog", "Fish", "Cow", "Crocodile", "Snake", "Bird" }; for (size_t i {}; const auto& d : dictionary) std::cout << ++i << ". " << d << '\n'; std::cout << "\n----------------\n\n"; for (const char* const word : { "Fish", "Snake", "Scorpion", "Bird", "Crocodile" }) { if (dictionary.contains(word)) std::cout << "Found the word " << std::quoted(word) << '\n'; else std::cout << "word " << std::quoted(word) << " does not exist in the dictionary!\n"; } }``````

 ``` 1. Bird 2. Cat 3. Cow 4. Crocodile 5. Dog 6. Fish 7. Snake ---------------- Found the word "Fish" Found the word "Snake" word "Scorpion" does not exist in the dictionary! Found the word "Bird" Found the word "Crocodile" ```

@Geckoo - For the search, you are also timing the cout statement! Consider:

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273`` ``````#include #include #include #include #include #include #include #include struct comma_facet : public std::numpunct { explicit comma_facet(char del, uint8_t group) :g(group), delim(del) {} virtual char do_thousands_sep() const { return delim; } virtual std::string do_grouping() const { return std::string(1, g); } uint8_t g {}; char delim {}; }; struct set000s { public: explicit set000s(char del = ',', uint8_t grp = 3) : group(grp), delim(del) {} private: uint8_t group {}; char delim {}; friend std::ostream& operator<<(std::ostream& os, const set000s& grp) { os.imbue(std::locale(os.getloc(), new comma_facet(grp.delim, grp.group))); return os; } }; int main() { const auto lc { [](unsigned char ch) {return static_cast(std::tolower(ch)); } }; std::set words; if (std::ifstream infile { "words1.txt" }) { const auto start { std::chrono::high_resolution_clock::now() }; for (std::string newWord; std::getline(infile, newWord); words.insert(newWord)) std::transform(newWord.begin(), newWord.end(), newWord.begin(), lc); const auto end { std::chrono::high_resolution_clock::now() }; std::cout << "Our dictionary with " << set000s() << words.size() << " words has been read in " << std::chrono::duration_cast(end - start).count() << " ms" << std::endl; } else return (std::cout << "Could not open the dictionary!\n"), EXIT_FAILURE; while (true) { std::string input; std::cout << "Give me a word : " << input; std::cin >> input; std::transform(input.begin(), input.end(), input.begin(), lc); if (input == "exit") break; const auto start { std::chrono::high_resolution_clock::now() }; const auto fnd { words.contains(input) }; const auto end { std::chrono::high_resolution_clock::now() }; if (fnd) std::cout << "The word " << std::quoted(input) << " has been found :)\n"; else std::cout << std::quoted(input) << " does not exist in the dictionary!\n"; std::cout << "Operation done in " << set000s() << std::chrono::duration_cast(end - start).count() << " ns\n"; } }``````

which for my laptop gives these timings:

 ``` Our dictionary with 466,546 words has been read in 298 ms Give me a word : albatros The word "albatros" has been found :) Operation done in 2,371 ns Give me a word : albatroos "albatroos" does not exist in the dictionary! Operation done in 2,766 ns Give me a word : zebra The word "zebra" has been found :) Operation done in 3,952 ns Give me a word : zebar "zebar" does not exist in the dictionary! Operation done in 3,556 ns Give me a word : exit ```

Last edited on
> Why hold a dictionary in a std::vector and not a std::set?

For a large stable dictionary, searching in a sorted vector may be (typically would be) somewhat more efficient.

 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344`` ``````#include #include #include #include #include #include #include #include int main() { const std::size_t N = 2'000'000 ; const std::string prefix = "ABCDEFGH" ; std::vector test_data ; for( std::size_t i = 0 ; i < N ; ++i ) test_data.push_back( prefix + std::to_string(i) ) ; std::vector vector = test_data ; std::ranges::sort(vector) ; std::set set ; for( auto& str : test_data ) set.insert(str) ; const std::size_t INC = 2 ; std::ranges::shuffle( test_data, std::mt19937( std::random_device{}() ) ) ; { const auto start = std::clock() ; volatile int cnt = 0 ; for( std::size_t i = 0 ; i < N ; i+= INC ) cnt = cnt + std::ranges::binary_search( vector, test_data[i] ) ; const auto end = std::clock() ; std::cout << std::fixed << "vector: " << double(end-start) / CLOCKS_PER_SEC << " seconds\n" ; } { const auto start = std::clock() ; volatile int cnt = 0 ; for( std::size_t i = 0 ; i < N ; i+= INC ) cnt = cnt + ( set.find( test_data[i] ) != set.end() ) ; const auto end = std::clock() ; std::cout << std::fixed << " set: " << double(end-start) / CLOCKS_PER_SEC << " seconds\n" ; } }``````

Typical run on coliru:
 ```vector: 1.141108 seconds set: 1.425130 seconds```

https://coliru.stacked-crooked.com/a/8a873387396c95f3
Yes. It's a good alternative too. However I have lower perf reading the dictionary file. I guess that the ordered process costs some resources. Obviously there are many ways to create such kind of prog ++

 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950`` ``````#include #include #include #include #include #include int main() { std::set words; std::ifstream infile("dictionary.txt"); std::string newWord; if (!infile.is_open()) { std::cout << "Could not open the dictionary!" << std::endl; exit(EXIT_FAILURE); } // for perf auto start = std::chrono::high_resolution_clock::now(); // create your vector with all words from your dictionary while (std::getline(infile, newWord)) { transform(newWord.begin(), newWord.end(), newWord.begin(), ::tolower); words.insert(newWord); } infile.close(); auto end = std::chrono::high_resolution_clock::now(); std::cout << "Our dictionary with " << words.size() + 1 << " words has been read in " << std::chrono::duration_cast(end - start).count() << " ms" << std::endl; while (1) { std::string input; std::cout << "Give me a word : " << input; std::cin >> input; transform(input.begin(), input.end(), input.begin(), ::tolower); if (input == "exit") break; auto start = std::chrono::high_resolution_clock::now(); words.contains(input) ? std::cout << "The word " << input << " has been found :)" << std::endl : std::cout << input << " does not exist in the dictionary!" << std::endl; auto end = std::chrono::high_resolution_clock::now(); std::cout << "Operation done in " << std::chrono::duration_cast(end - start).count() << " ns" << std::endl; }; return 0; }``````

 ``` Our dictionary with 370106 words has been read in 2166 ms Give me a word : albatroos albatroos does not exist in the dictionary! Operation done in 213500 ns Give me a word : albatros The word albatros has been found :) Operation done in 200200 ns ```
Last edited on
I am very impressed by the perf of your laptop.
Which CPU you have inside?

---------------------------------------------
A few minutes after my previous question :/

Forgive me. I am stupid. Compiling as release - I have better perf of course...

 ``` Our dictionary with 370106 words has been read in 116 ms Give me a word : albatroos albatroos does not exist in the dictionary! Operation done in 146000 ns Give me a word : albatros The word albatros has been found :) Operation done in 137500 ns ```
Last edited on
 Which CPU you have inside?

Intel dual core I7 2.6 GHz

but it's just a standard rotating disk - not SSD! :cry:

Try taking the cout statements outside of the search measurement and you'll see even better performance for the search - as std::cout is really very slow...
> However I have lower perf reading the dictionary file.

The vector performance would be an order of magnitude faster than that of the set if the input is already sorted. Otherwise, we would have to sort the vector, and there may not be a huge difference.

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758`` ``````#include #include #include #include #include #include #include #include struct timer { explicit timer( std::string label = {} ) : label( std::move(label) ) {} ~timer() { const auto end = std::clock() ; std::cout << std::fixed << label << ": " << double(end-start) / CLOCKS_PER_SEC << " seconds\n" ; } const std::clock_t start = std::clock() ; const std::string label ; }; int main() { const std::size_t N = 2'000'000 ; const std::string prefix = "ABCDEFGH" ; std::vector test_data ; for( std::size_t i = 0 ; i < N ; ++i ) test_data.push_back( prefix + std::to_string(i) ) ; std::vector vector ; { timer t( "vector insert and sort" ) ; vector = test_data ; std::ranges::sort(vector) ; } std::set set ; { timer t( " set insert" ) ; for( auto& str : test_data ) set.insert(str) ; } const std::size_t INC = 2 ; std::ranges::shuffle( test_data, std::mt19937( std::random_device{}() ) ) ; { timer t( " vector search" ) ; volatile int cnt = 0 ; for( std::size_t i = 0 ; i < N ; i+= INC ) cnt = cnt + std::ranges::binary_search( vector, test_data[i] ) ; } { timer t( " set search" ) ; volatile int cnt = 0 ; for( std::size_t i = 0 ; i < N ; i+= INC ) cnt = cnt + ( set.find( test_data[i] ) != set.end() ) ; } }``````

On a typical run on coliru:
 ```vector insert and sort: 1.014847 seconds set insert: 1.175835 seconds vector search: 1.085210 seconds set search: 1.408074 seconds```

https://coliru.stacked-crooked.com/a/23f4b94e513c3941
 For a large stable dictionary, searching in a sorted vector may be (typically would be) somewhat more efficient.

Ah yes. vector is continuous memory and set is a tree (black/red) constructed from linked memory nodes. I now seem to recall a similar discussion a while ago. But my memory isn't what it was... a similar discussion a while ago. But my memory isn't what it was...

Consider:

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455`` ``````#include #include #include #include #include #include #include #include #include int main() { const auto lc { [](unsigned char ch) {return static_cast(std::tolower(ch)); } }; std::set words; std::vector vs; if (std::ifstream infile { "words1.txt" }) { const auto start { std::chrono::high_resolution_clock::now() }; for (std::string newWord; std::getline(infile, newWord); words.insert(newWord), vs.push_back(newWord)) std::transform(newWord.begin(), newWord.end(), newWord.begin(), lc); std::ranges::sort(vs); const auto end { std::chrono::high_resolution_clock::now() }; std::cout << "Our dictionary with " << set000s() << words.size() << " words has been read in " << std::chrono::duration_cast(end - start).count() << " ms" << std::endl; } else return (std::cout << "Could not open the dictionary!\n"), EXIT_FAILURE; while (true) { std::string input; std::cout << "Give me a word : " << input; std::cin >> input; std::transform(input.begin(), input.end(), input.begin(), lc); if (input == "exit") break; const auto start { std::chrono::high_resolution_clock::now() }; const auto fnd { words.contains(input) }; const auto end { std::chrono::high_resolution_clock::now() }; const auto start1 { std::chrono::high_resolution_clock::now() }; const auto fnd1 { std::ranges::binary_search(vs, input)}; const auto end1 { std::chrono::high_resolution_clock::now() }; if (fnd && fnd1) std::cout << "The word " << std::quoted(input) << " has been found :)\n"; else std::cout << std::quoted(input) << " does not exist in the dictionary!\n"; std::cout << "Set operation done in " << set000s() << std::chrono::duration_cast(end - start).count() << " ns\n"; std::cout << "Vector operation done in " << set000s() << std::chrono::duration_cast(end1 - start1).count() << " ns\n"; } }``````

 ``` Our dictionary with 466,546 words has been read in 466 ms Give me a word : albators "albators" does not exist in the dictionary! Set operation done in 2,371 ns Vector operation done in 1,580 ns Give me a word : zebra The word "zebra" has been found :) Set operation done in 4,346 ns Vector operation done in 1,976 ns Give me a word : albatroos "albatroos" does not exist in the dictionary! Set operation done in 4,346 ns Vector operation done in 1,976 ns Give me a word : qwerty The word "qwerty" has been found :) Set operation done in 3,556 ns Vector operation done in 1,580 ns Give me a word : exit ```

Last edited on
 I've been trying to make the binary search work but it just doesn't, I know its something with the word compares but this is probably version 10 of my function and it still sucks

@dorito200 - Following the interlude, in cases like this use the debugger to trace through the code to see what's happening. You can step through code, set breakpoints and watch the contents of variables. You can then see where the code execution is deviating from what is expected. In this case L19.

Note that you don't need an else clause if the final statement of the if clause is a return.

Try an `unordered_set`.
:O)
OK. We have a winner! Obviously that's what I meant to say when I said set before.... :) :)

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182`` ``````#include #include #include #include #include #include #include #include #include struct comma_facet : public std::numpunct { explicit comma_facet(char del, uint8_t group) :g(group), delim(del) {} virtual char do_thousands_sep() const { return delim; } virtual std::string do_grouping() const { return std::string(1, g); } uint8_t g {}; char delim {}; }; struct set000s { public: explicit set000s(char del = ',', uint8_t grp = 3) : group(grp), delim(del) {} private: uint8_t group {}; char delim {}; friend std::ostream& operator<<(std::ostream& os, const set000s& grp) { os.imbue(std::locale(os.getloc(), new comma_facet(grp.delim, grp.group))); return os; } }; int main() { const auto lc { [](unsigned char ch) {return static_cast(std::tolower(ch)); } }; std::unordered_set words; std::vector vs; if (std::ifstream infile { "words1.txt" }) { const auto start { std::chrono::high_resolution_clock::now() }; for (std::string newWord; std::getline(infile, newWord); words.insert(newWord), vs.push_back(newWord)) std::transform(newWord.begin(), newWord.end(), newWord.begin(), lc); std::ranges::sort(vs); const auto end { std::chrono::high_resolution_clock::now() }; std::cout << "Our dictionary with " << set000s() << words.size() << " words has been read in " << std::chrono::duration_cast(end - start).count() << " ms" << std::endl; } else return (std::cout << "Could not open the dictionary!\n"), EXIT_FAILURE; while (true) { std::string input; std::cout << "Give me a word : " << input; std::cin >> input; std::transform(input.begin(), input.end(), input.begin(), lc); if (input == "exit") break; const auto start { std::chrono::high_resolution_clock::now() }; const auto fnd { words.contains(input) }; const auto end { std::chrono::high_resolution_clock::now() }; const auto start1 { std::chrono::high_resolution_clock::now() }; const auto fnd1 { std::ranges::binary_search(vs, input) }; const auto end1 { std::chrono::high_resolution_clock::now() }; if (fnd && fnd1) std::cout << "The word " << std::quoted(input) << " has been found :)\n"; else std::cout << std::quoted(input) << " does not exist in the dictionary!\n"; std::cout << "Set operation done in " << set000s() << std::chrono::duration_cast(end - start).count() << " ns\n"; std::cout << "Vector operation done in " << set000s() << std::chrono::duration_cast(end1 - start1).count() << " ns\n"; } } ``````

 ``` Our dictionary with 466,546 words has been read in 404 ms Give me a word : albatros The word "albatros" has been found :) Set operation done in 395 ns Vector operation done in 1,976 ns Give me a word : zebra The word "zebra" has been found :) Set operation done in 1,581 ns Vector operation done in 1,580 ns Give me a word : qwerty The word "qwerty" has been found :) Set operation done in 790 ns Vector operation done in 1,975 ns Give me a word : albatroos "albatroos" does not exist in the dictionary! Set operation done in 790 ns Vector operation done in 2,371 ns Give me a word : albators "albators" does not exist in the dictionary! Set operation done in 396 ns Vector operation done in 1,975 ns Give me a word : exit ```

Topic archived. No new replies allowed.