I don't see how it would be any faster.
there are not even 200k words in english and a chunk of them are not used in such games due to PC police. It would be fast enough.
in your tool, I would load the file and then cull anything you can't make purely from the available letters right off. Then whatever is left, you can do a few passes... one pass to get rid of what can't be made due to adjacency, whatever.. you want the smallest list possible when you get down to the 'can I make this one' phase.
I normally avoid trees but here, you may consider that if you find a prefix, you may find more, eg the may mean that then or them are present: you can't find either without finding the first... so if your wordlist were treed up to reflect that, your searching could maybe find an advantage.
most of the time if you find eat, you also find ate and tea ... anagram exploitation, and partial anagram exploitation...
You could iterate the dictionary and search the board, or iterate the board and search the dictionary.
The latter is probably easier to code. Fewer data structures.
The sowpods list has 267,753 words with a size of 2.83MB. In the scheme of things this is not a big file/container. If you had say a Dictionary class with the right structure, then once it was built from the file then searching would be pretty fast.
boggle is easier than scrabble though, it has no blanks and I don't think 3 zs are possible ever, Q without U isnt possible as they tied them together, and so on.
Several online dictionaries have around 470K entries. Are words that are spelt the same but have different meanings to be classified as different words?
English, especially the American variant, borrows words from other languages quite ruthlessly.
It is "colour" in Brit-speak, "color" in 'Murrican. Same meaning, but spelling makes two different words.
yea I saw that too. It depends on how you want to define it.
English 171,476 Oxford English Dictionary, Second Edition + 47,156 archaic/obsolete for a total of 218k.
webster's 3rd is 470ish but contain multi word entries.
the million + includes both multi word entries, foreign adopted words, jargon, and so on.
That would probably take a while... I might be able to put them into a CSV file, although I don't know if that would be any quicker.
Just to beat this dead horse further-- let's suppose for the sake of argument that it is indeed too slow, determined by the requirements of the application. What would be faster? You need to load (or at least, sift through) N elements at some point (even if only once initially) to check matches. There might be micro-optimizations you could do, like perhaps making things binary instead of plaintext w/ prepended lengths instead of whitespace, but at the end of the day, you still need to iterate through the data.
For fun, you could try taking in a plaintext file of words and transforming it into a binary file, and then measure to see if it saves time to load it. If reading from disk is determined to be the bottleneck (again, from real measuring, not guessing), the dictionary data could be compressed first, and then decompressed in memory (using a compression library like zlib).
The Official Scrabble Players wordlist licensed from North American Scrabble Players Association (NASPA) is 1.8MB as a simple plaintext file. I store it in a compressed format, eliminating characters that repeat from the previous word and compressing diphthongs. The result is a 173KB file. Takes me just over 1.5 seconds to load into a std::set.
Well, "too slow" depends on the application, not on the actual time. A process can be fast enough that any further optimizations are a waste of effort.
My first impression is that 1.5 seconds is too slow by three orders of magnitude.
Yeah, I agree it's too long, however, the load is only done once as part of program startup. Do you seriously think it would be possible to load a file of that size in 15 milliseconds?
I have no doubt I could come up with a file structure I could read directly into memory eliminating decompression and insertion into a set. I would be compromising file size for loading speed, but I doubt I could read that into memory in 15 milliseconds even using large block binary transfers.
Your post reminded me that set.insert() accepts a hint. After adding a hint to the insert call because the disk file is ordered and removing some unnecessary consistency checks, the load time dropped to 0.41 seconds. The load was barely noticeable. Thanks for the reminder.
int main()
{
auto start = std::chrono::high_resolution_clock::now();
ifstream ifs("words.txt");
auto s = std::filesystem::file_size("words.txt");
vector<char> v(s);
ifs.read(&v[0],s);
auto end = std::chrono::high_resolution_clock::now();
cout <<"time in ms:" <<chrono::duration_cast<chrono::milliseconds>(end - start).count();
return 0;
}
output: worst result: 11 ms. typical output:
time in ms:2
this is on a SSD / I9 / 64gb ram box.
'metathetical' is 1/2 way down the list, and takes 1ms to find using strstr().
not bad for single threaded brute force (clearly a bin search or map etc would do infinitely better).
If you had say a Dictionary class with the right structure, then once it was built from the file then searching would be pretty fast.
Yeah, that's pretty much what I was planning to do.
AbstractAnon wrote:
...eliminating characters that repeat from the previous word...
I'm not sure I understand, could you give an example?
...and compressing diphthongs.
Why? And how?
mbozzi wrote:
My first impression is that 1.5 seconds is too slow by three orders of magnitude.
Heh, is that all? I thought it would take at least a minute, or more.
Ganado wrote:
There might be micro-optimizations you could do, like perhaps making things binary instead of plaintext w/ prepended lengths instead of whitespace, but at the end of the day, you still need to iterate through the data.
Making it a bin file would probably help. And with istream::seekg() I could get pretty quick results, I think.
Yeah, I agree it's too long, however, the load is only done once as part of program startup. Do you seriously think it would be possible to load a file of that size in 15 milliseconds?
15ms is more reasonable for a rotating disk. I originally suggested 1.5ms, assuming you have an SSD.
Conservatively, NVM Express SSDs have a read latency in the order of 100 microseconds, and a corresponding throughput of about 1GB/s. For example winsat measures the following throughputs on my Windows machine:
W:\>winsat disk -drive w -ran -read | findstr Disk
> Disk Random 16.0 Read 1555.19 MB/s 9.1
W:\>winsat disk -drive w -seq -read | findstr Disk
> Disk Sequential 64.0 Read 3215.68 MB/s 9.3
On this system a 173k file could very conservatively be in core in under a millisecond. Consistent with the estimation, a 170k binary file takes 200 microseconds to read via the C++ library on the aforementioned device.
There are multiple layers of buffering between the drive's hardware, device driver, the HAL, kernel, and the C++ library. If these middlemen weren't present you could get closer to the theoretical maximum by accessing the drive directly.
There isn't even a fundamental need to wait for the entire read to complete. The computer could, in principle, start processing as soon as some data arrives (absent any problem constraints). Are you willing to share your program?