C++ word library?

Pages: 123

Looks like official tournament dictionaries are hidden behind copyrighted books I can't readily find. I did find an unofficial txt file: https://raw.githubusercontent.com/benhoyt/boggle/master/word-list.txt
That list is more reasonable than the 'aaa' list.

mbozzi (3945)

On the topic of read performance, see this video from 2019, it's <8 minutes:
https://www.youtube.com/watch?v=n7kVMKkwa_s

Last edited on

jonnin (11497)

Tough crowd... I got a small one and everyone pointed out there were bigger ones, so I got a bigger one and everyone points out the AAA crap :)
the point was the load time using just a simple approach, not the awesome word list.
boggle seems to lack a competition dictionary, and apparently, even the cube distro varies across versions by a fair amount.

Ganado (6896)

lol that is funny, I didn't notice. You just can't win :)

AbstractionAnon (6957)

@agent max - Sorry for the late reply. Just saw your post.

agent max wrote:
AbstractAnon wrote: ...eliminating characters that repeat from the previous word... I'm not sure I understand, could you give an example?

My objective was to have a small footprint for the file on disk, yet keep it easy to load.
Every word on disk starts with an 4 bit word length (WL) and a 4 bit repeat count (RC).
4 bits is sufficient for each since the longest word in the Scrabble dictionary is 15 letters.
For example,
WL=8 RC=2 RDVARK (AA repeated from previous word).
WL=9 RC=8 S (AARDVARK repeated from previous word)
As you can see, the plural of AARDVARK takes only two bytes.

agent max wrote:
...and compressing diphthongs. Why? And how?

Again the objective was to decrease the footprint of the file on disk.

A diphthong is normally two consecutive vowels. I expanded on that to mean common two letter pairs, The Scrabble dictionary only uses upper case ASCII (26 values). Using unsigned chars, that leaves 230 values unused per byte. I did an analysis of the dictionary and found the 230 most common two letter pairs and hard coded those into a table.

Going back to my AARDVARK example, we get
0x82, 0xC0, 0xF4, RK
0x82 being the WL and RC. 0xC0 being the value assigned to the letter pair "RD" and 0xF4 the the value assigned to "VA". We're down to 5 bytes from 8. Not a huge savings, but it adds up.

Last edited on

seeplus (6653)

Isn't the main issue the search time - not the space used on disk or it's initial load time? Presumably the data will be loaded once from the disk into a structure and then this structure searched many times? Why is the file size/load time so important?

jonnin (11497)

it is not that important. we got sidetracked.
search time won't be interesting either, this problem lends itself to very fast lookups.

JRManx (603)

The main problem I had was finding a C++ word library of some kind. The load time was something I thought could be an issue when I found out I'd have to just load a list into the program.

AbstractAnon,
Heh, sorry for my late reply. I was on a short vacation and didn't have access to internet. Thanks for the explanation– that makes sense. I'm not sure how I'd do that though, since my C++ experience is something less than expert. Although, doing that to all the words in the list, wouldn't that take so much time as to pretty much defeat the entire purpose?

jonnin wrote:
I got a small one and everyone pointed out there were bigger ones, so I got a bigger one and everyone points out the AAA crap :)

Well...aaa isn't a word in Boggle, I'm pretty sure. I know Scrabble allows some weird stuff, but we always played Boggle with a Merriam Websters dictionary. A small list is fine too– less time to load ;)

AbstractionAnon (6957)

My compression scheme is not for everyone, but I think it is relatively efficient for achieving a compression factor of 75%. I tried running ZIP compression over the text word list and got 0% compression.

I use 0-26 to encode A-Z in the dictionary. That leaves 27-255 as available for two letter pairs. A quick check if a value is 27 or greater indicates a byte is an index to a two letter pair table rather than an ASCII character. If it's greater than 26, subtract 26 from t and use it as an index to the diphthong table which is a simple int array preloaded with two letter pairs. If it is 26 or less, add 101 to it to get an upper case ASCII letter.

The dictionary is loaded by a maintenance program that can load a text word list and add or delete words from the dictionary.

I originally wrote this 20 years ago on a mainframe. Storage was more important to me at the time than CPU cycles. Given how cheap storage has become it's probably time to rethink that strategy.

jonnin (11497)

the bigger one can cut half a mb just by moving to 1 byte end of lines.
it compresses 4.2mb to 1.2mb with bzip2 -9
I also remember when a MB was a lot, everything was measured in kb..

Last edited on

AbstractionAnon (6957)

I removed the diphthong logic. Not surprisingly, the size of the file went up from 460KB yo 601KB. Surprisingly, the elapsed time to read and load into a set went up from 1.4 seconds to 1.77 seconds.

mbozzi (3945)

Here's some output from my attempt, using @jonnin's 4.2Mb word list. Which, by the way, is not in lexicographic order.

Observations:
- I didn't make any optimization effort (aside from lines 104-117).
- I/O is half as fast as the prior measurements suggest is possible.
- The solver algorithm traces out all the words on the board, depth-first, and backtracks when encountering a dead end.
- Fixed-width strings could cut down on searching and sorting time.
- Binary searching the entire sequence each time is not necessary
- This program hardly utilizes the hardware, there are 31 processors sitting idle on my machine
- This is a release build under MS Windows
- The time spent "printing output" is dominated by waiting for MS Windows' abysmally-slow terminal; it is at least 100 times slower than sending the output to a file.

Overall lots of room for improvement, but consider:

W:\boggle>build release
release build

clang++ -Wall -Wextra -pedantic-errors -std=c++17 -O3 -march=native solver.cpp -o solver.exe
W:\boggle>solver.exe wordlist_sorted.txt zlfsrlreaaioritaheaorhapi
                 a                aa               aah               aas                ae
            aeolia               aer             aerie            aerier            aeries
</snip>
              tope             toper              topi             topia          topiaria
         topiaries                 z
read 4234903 bytes from the dictionary file in 2727 us (measured 1552953061 byte/s)
parsed dictionary in 5423 us
ensured dictionary was sorted in 5310 us
found all solutions in 3109 us
presented results in 169238 us

AbstractionAnon's std::set measurements start to make more sense assuming
a. they're in "debug" mode, which is a substantial (10x slowdown or worse?) pessimization,
b. they're using a large dictionary that's not already sorted, not the 170k or 600k one they were talking about earlier.

I have to split this post into two now, because after fixing a typo it is too long. See below:

mbozzi (3945)

// 5x5 "Boggle" solver.
// Usage: ./boggle_solver dictionary_file board_string

// For example, try executing
// ./boggle_solver wordlist_medium.txt etbtwefetemyfnaececrsldlo

#include <algorithm>
#include <string>
#include <fstream>
#include <vector>
#include <iostream>
#include <chrono>
#include <filesystem>
#include <string_view>

int constexpr board_dim = 5;
int constexpr board_length = board_dim * board_dim;

std::vector<std::string> results;
std::vector<std::string_view> dict_entries;

void search_recursive(std::string b, int search_root, std::string prefix)
{
  // if the search_root has already been visited, terminate, otherwise visit it
  if (b[search_root] == '\0') return; else prefix.push_back(std::exchange(b[search_root], 0));

  auto const lb = std::lower_bound(dict_entries.begin(), dict_entries.end(), prefix);
  if (lb == dict_entries.end()) return;

  auto const candidate = *lb;
  auto const min_size = std::min(candidate.size(), prefix.size());

  if (std::equal(candidate.begin(), candidate.begin() + min_size, prefix.begin()))
  { // `prefix` is a prefix of `candidate`
    if (candidate.size() == prefix.size())
    { // `prefix` is exactly `candidate`
      results.push_back(std::string(candidate));
    }
  }
  else return;

  bool const has_left_neighbor  = search_root % board_dim != 0;
  bool const has_right_neighbor = search_root % board_dim != board_dim - 1;
  bool const has_upper_neighbor = search_root >= board_dim;
  bool const has_lower_neighbor = search_root < board_length - board_dim;

  if (has_upper_neighbor)
  {
    search_recursive(b, search_root - board_dim, prefix);

    if (has_left_neighbor) search_recursive(b, search_root - board_dim - 1, prefix);
    if (has_right_neighbor) search_recursive(b, search_root - board_dim + 1, prefix);
  }

  if (has_lower_neighbor)
  {
    search_recursive(b, search_root + board_dim, prefix);

    if (has_left_neighbor) search_recursive(b, search_root + board_dim - 1, prefix);
    if (has_right_neighbor) search_recursive(b, search_root + board_dim + 1, prefix);
  }

  if (has_left_neighbor) search_recursive(b, search_root - 1, prefix);
  if (has_right_neighbor) search_recursive(b, search_root + 1, prefix);
}

int main(int argc, char** argv)
{
  if (argc != 3)
  {
    std::cerr << "usage: " << *argv << " dictionary_file board\n";
    return 0;
  }

  std::string const b = argv[2];
  if (b.size() != board_length)
  {
    std::cerr << "error: board is not " << board_length << " characters in length\n";
    return 1;
  }

  std::chrono::high_resolution_clock::time_point t1, t2, t3, t4, t5, t6;

  t1 = std::chrono::high_resolution_clock::now();
  ////////////////////////////////////////////////////////////////////////
  // Perform I/O
  std::ifstream dict_stream { argv[1], std::ios::binary };
  if (! dict_stream)
  {
    std::cerr << "error: opening dictionary_file \"" << argv[1] << "\"\n";
    return 0;
  }

  auto const file_size = std::filesystem::file_size(argv[1]);
  char* dict_fulltext = new char[file_size];
  dict_stream.read(dict_fulltext, file_size);
  auto const dict_fulltext_size = dict_stream.gcount();

  t2 = std::chrono::high_resolution_clock::now();
  ////////////////////////////////////////////////////////////////////////
  // Parse dictionary contents
  dict_entries.reserve(1'000'000); // some reasonable guess

  char const* word_start = dict_fulltext;
  char const* word_end   = dict_fulltext;
  char const* dict_end   = dict_fulltext + dict_fulltext_size;

  auto const is_space = [](char c) { return c == '\r' || c == '\n'; };

  while (true)
  {
    while (word_end != dict_end &&  is_space(*word_end)) ++word_end;
    word_start = word_end;
    while (word_end != dict_end && !is_space(*word_end)) ++word_end;
    if (word_end == dict_end) break;
    dict_entries.emplace_back(std::string_view(std::exchange(word_start, word_end),
                                               word_end - word_start));
  }

  auto const delta = [](auto start, auto end)
  {
      return std::chrono::duration_cast<std::chrono::microseconds>(end - start);
  };

  t3 = std::chrono::high_resolution_clock::now();
  ////////////////////////////////////////////////////////////////////////
  // Make sure the dictionary is sorted.
  if (auto it = std::is_sorted_until(dict_entries.begin(), dict_entries.end());
      it != dict_entries.end())
  {
    std::cerr << "warning: dictionary_file \"" << argv[1] << "\" is not properly sorted.\n";
    std::cerr << "warning: first out-of-order word is \"" << *it
              << "\", which should have appeared before \"" << *(it - 1) << "\".\n";

    std::sort(dict_entries.begin(), dict_entries.end());
    dict_entries.erase(std::unique(dict_entries.begin(), dict_entries.end()), dict_entries.end());
  }

  t4 = std::chrono::high_resolution_clock::now();
  ////////////////////////////////////////////////////////////////////////
  // Search the dictionary for each possible choice on the board
  for (int i = 0; i < board_length; ++i)
    search_recursive(b, i, "");

  t5 = std::chrono::high_resolution_clock::now();
  ////////////////////////////////////////////////////////////////////////
  // Present results
  std::sort(results.begin(), results.end());
  results.erase(std::unique(results.begin(), results.end()), results.end());

  for (std::size_t i = 0; i < results.size();)
  {
    std::cout << std::setw(18) << results[i++];
    if (i % 5 == 0) std::cout << '\n';
  }
  std::cout << '\n';

  t6 = std::chrono::high_resolution_clock::now();
  ////////////////////////////////////////////////////////////////////////
  // Show timings
  auto const read_time = delta(t1, t2);
  std::clog << "read " << file_size << " bytes from the dictionary file in " << read_time.count()
            << " us (measured " << (file_size * std::micro::den / read_time.count()) << " byte/s)\n";
  std::clog << "parsed dictionary in " << delta(t2, t3).count() << " us\n";
  std::clog << "ensured dictionary was sorted in " << delta(t3, t4).count() << " us\n";
  std::clog << "found all solutions in " << delta(t4, t5).count() << " us\n";
  std::clog << "presented results in " << delta(t5, t6).count() << " us\n";
}

AbstractionAnon (6957)

My code is currently a mix of release build and debug build. the DICT class which is a wrapper for std:set is release build. The rest is debug build.

The dictionary is in lexicographic order. Since the input to loading the set is ordered, a hint is used on the set.insert statement.

I had been using the >> operator to read from the dictionary, but removed that to avoid an extra function call.

jonnin (11497)

Boggle specific thing: the board input needs to handle the [qu] (one tile face) special case if you get serious with it. Also the original is 4x4, 5x5 came later. Given what you have, I would just have the put 'q' in the input board and treat it as "qu" when searching legit words, watching out for the exchange (I think you just have to manually append qu and zero it out, there may not be a clever way here?).

I ran a 4x4 :
read 3752488 bytes from the dictionary file in 6733 us (measured 557327788 byte/s)
parsed dictionary in 4861 us
ensured dictionary was sorted in 72926 us
found all solutions in 2078 us
presented results in 12060 us

and mbozzi's test cast 5x5: (the one in the file)
read 3752488 bytes from the dictionary file in 2105 us (measured 1782654631 byte/s)
parsed dictionary in 4987 us
ensured dictionary was sorted in 70983 us
found all solutions in 5158 us
presented results in 36889 us

and the one before the code with results (z...)
read 3752488 bytes from the dictionary file in 1998 us (measured 1878122122 byte/s)
parsed dictionary in 4919 us
ensured dictionary was sorted in 71540 us
found all solutions in 7823 us
presented results in 57886 us

it probably cached the word file or something.
the string sort isnt as slick on this (cygwin g++) compiler.

This is sort of depressing. Even if you discard the junky words in the expanded dictionary, the computer finds many, many times more than I ever would in the time limit. If nothing else I can only write so fast.

Last edited on

JRManx (603)

Wow, and all I asked was if anyone knew of a C++ word library...

@mbozzi, what exactly is the -march=native flag you set in your compile command? I've never seen it before.

This is starting to look more and more complicated. I've never even used some of those headers– <algorithm>, <chrono>, <filesystem>, and <string_view>. I've never even heard of <filesystem> or <string_view>. Guess I need to do some hard studying.

AbstractAnon wrote:
I originally wrote this 20 years ago on a mainframe. Storage was more important to me at the time than CPU cycles. Given how cheap storage has become it's probably time to rethink that strategy.

Wow, a mainframe? I think my college had one, but I don't remember ever using it. I had my Powerbook and that was all I needed for typing up homework and playing Myst.

After starting some pseudocode, I'm starting to realize just how complicated this project is going to be. I'm not even sure if I have time to do it anymore.

deleted account xyzzy (5768)

I've never even heard of <filesystem> or <string_view>. Guess I need to do some hard studying.

Both libraries were added in C++17, and are well worth the time to muck around with learning how to use them.

https://en.cppreference.com/w/cpp/filesystem
https://en.cppreference.com/w/cpp/string/basic_string_view

For that matter the other libraries, <algorithm> and <chrono> should be looked at as well.
https://en.cppreference.com/w/cpp/algorithm
https://en.cppreference.com/w/cpp/chrono

<chrono> is the C++ replacement/enhancement to the C library <ctime>, though for some uses isn't as easy to do things. Lots more features as well as an entirely different way of doing date/time computations. Similar "issues" with differences in generating random numbers between <cstdlib> and <random>.

jonnin (11497)

Wow, a mainframe? I think my college had one,

the word triggers people like punch cards... they think it means something ancient.
I just a year ago I worked at a major corp that had at least 2 mainframes running. They have been upgraded and are quite powerful (one of them is possibly the fastest machine I have ever worked on, for its one-trick-pony area of specialty), just use an archaic "operating system" and programming style that requires digging up someone who knows how to do it and getting a few younguns trained in that to keep it going. They are, for better or worse, still alive and well in the industry. The UI is straight out of 1982, but there are overlays you can hide it with and avoid all that nonsense.

I actually hacked mbozzi's code to let me play for 3 min then tell me how bad I did by printing the board (back to 4x4 original), fixing the QU problem, and getting the letter distro for the game (and the actual cubes, for that matter, but I didn't want to deal with coding each cube face and randomly arranging all that so its not an exact replica). I get about 20 words in 3 min. The computer gets about 150 in an eyeblink.

Last edited on

mbozzi (3945)

I didn't know the QU rule, thanks. Clearly I had not played "Boggle" before yesterday. To fix, modify line 26 to check the letter pushed onto prefix on the previous line. If it was a 'q', additionally push a 'u'. This assumes that all "Q" tiles are actually "QU" tiles:
if (prefix.back() == 'q') prefix.push_back('u');

Also please pardon my stupid memory leak, etc. the program needs a review.

What exactly is the -march=native flag you set in your compile command?

Normally the compiler limits the code it generates to a common subset of instructions supported by all the processors in a family, for compatibility purposes. This allows compiled code to execute on any machine with a similar processor architecture.

-march=native tells the compiler not to limit itself to the lowest common denominator, but rather to the (extended) capabilities of the native processor. So it sacrifices code compatibility for the possibility of increased speed or reduced power consumption.

P.S. you can web search for "-march=native" (with quotes) and get results, but if you omit the quotes you will get nothing.

jonnin (11497)

yea all the dice with a q got a u, so that is right.
I didn't use march native... wonder if that is why my sort was so bad...

Pages: 123