C++ word library?

Forum

Forum
General C++ Programming
C++ word library?

Pages: 12 3

Oct 19, 2021 at 9:51pm

Is there a good C++ (or C) word library? I just had an idea to make some kind of a Boggle cheating program and I realized I need a list of words.

If there isn't one already in existence, I'll have to make one– and that might be more work than I want to do.

Oct 19, 2021 at 9:53pm

helios (17607)

There are plaintext lists of words online. You just have to read them line by line and load them into the program.

Oct 20, 2021 at 12:14am

JRManx (603)

That would probably take a while...

I might be able to put them into a CSV file, although I don't know if that would be any quicker.

Oct 20, 2021 at 1:17am

jonnin (11491)

I don't see how it would be any faster.
there are not even 200k words in english and a chunk of them are not used in such games due to PC police. It would be fast enough.

in your tool, I would load the file and then cull anything you can't make purely from the available letters right off. Then whatever is left, you can do a few passes... one pass to get rid of what can't be made due to adjacency, whatever.. you want the smallest list possible when you get down to the 'can I make this one' phase.

I normally avoid trees but here, you may consider that if you find a prefix, you may find more, eg the may mean that then or them are present: you can't find either without finding the first... so if your wordlist were treed up to reflect that, your searching could maybe find an advantage.
most of the time if you find eat, you also find ate and tea ... anagram exploitation, and partial anagram exploitation...

Last edited on Oct 20, 2021 at 1:22am

Oct 20, 2021 at 5:14am

mbozzi (3943)

You could iterate the dictionary and search the board, or iterate the board and search the dictionary.
The latter is probably easier to code. Fewer data structures.

Oct 20, 2021 at 9:12am

thmm (703)

That would probably take a while...

Why don't you check your assumption?
How long does it actually take ?

Oct 20, 2021 at 9:19am

lastchance (6980)

https://www.wordgamedictionary.com/word-lists/

Apparently there are 175393 words in the Scrabble dictionary. But that doesn't stop my mother-in-law ~~making up~~ finding a few that aren't in there.

Can't say that I've ever used "zyzzyvas" in Scrabble - and you'd need both blank tiles to actually put it down.

Last edited on Oct 20, 2021 at 9:24am

Oct 20, 2021 at 10:58am

seeplus (6623)

The sowpods list has 267,753 words with a size of 2.83MB. In the scheme of things this is not a big file/container. If you had say a Dictionary class with the right structure, then once it was built from the file then searching would be pretty fast.

Last edited on Oct 20, 2021 at 10:58am

Oct 20, 2021 at 12:35pm

jonnin (11491)

boggle is easier than scrabble though, it has no blanks and I don't think 3 zs are possible ever, Q without U isnt possible as they tied them together, and so on.

Oct 20, 2021 at 12:35pm

deleted account xyzzy (5768)

there are not even 200k words in english

https://www.merriam-webster.com/help/faq-how-many-english-words

Several online dictionaries have around 470K entries. Are words that are spelt the same but have different meanings to be classified as different words?

English, especially the American variant, borrows words from other languages quite ruthlessly.

It is "colour" in Brit-speak, "color" in 'Murrican. Same meaning, but spelling makes two different words.

Some estimates put the total number of words in English at 1 million and more.
https://languagemonitor.com/number-ofwords/number-of-words-in-english/

Lest we forget, there are also slang and jargon words that need to be considered.....
https://wordcounter.io/blog/how-many-words-are-in-the-english-language/

Last edited on Oct 20, 2021 at 12:41pm

Oct 20, 2021 at 1:12pm

jonnin (11491)

yea I saw that too. It depends on how you want to define it.

English 171,476 Oxford English Dictionary, Second Edition + 47,156 archaic/obsolete for a total of 218k.
webster's 3rd is 470ish but contain multi word entries.
the million + includes both multi word entries, foreign adopted words, jargon, and so on.

Oct 20, 2021 at 1:54pm

Ganado (6831)

That would probably take a while... I might be able to put them into a CSV file, although I don't know if that would be any quicker.

Just to beat this dead horse further-- let's suppose for the sake of argument that it is indeed too slow, determined by the requirements of the application. What would be faster? You need to load (or at least, sift through) N elements at some point (even if only once initially) to check matches. There might be micro-optimizations you could do, like perhaps making things binary instead of plaintext w/ prepended lengths instead of whitespace, but at the end of the day, you still need to iterate through the data.

For fun, you could try taking in a plaintext file of words and transforming it into a binary file, and then measure to see if it saves time to load it. If reading from disk is determined to be the bottleneck (again, from real measuring, not guessing), the dictionary data could be compressed first, and then decompressed in memory (using a compression library like zlib).

Last edited on Oct 20, 2021 at 2:01pm

Oct 20, 2021 at 2:49pm

AbstractionAnon (6954)

The Official Scrabble Players wordlist licensed from North American Scrabble Players Association (NASPA) is 1.8MB as a simple plaintext file. I store it in a compressed format, eliminating characters that repeat from the previous word and compressing diphthongs. The result is a 173KB file. Takes me just over 1.5 seconds to load into a std::set.

Oct 20, 2021 at 7:17pm

mbozzi (3943)

Takes me just over 1.5 seconds to load into a std::set.

My first impression is that 1.5 seconds is too slow by three orders of magnitude.

Oct 21, 2021 at 12:59am

helios (17607)

Well, "too slow" depends on the application, not on the actual time. A process can be fast enough that any further optimizations are a waste of effort.

Oct 21, 2021 at 1:25am

jonnin (11491)

decompression and loading into a set, + file I/O time too.

Oct 21, 2021 at 3:47pm

AbstractionAnon (6954)

mbozzi wrote:
My first impression is that 1.5 seconds is too slow by three orders of magnitude.

Yeah, I agree it's too long, however, the load is only done once as part of program startup. Do you seriously think it would be possible to load a file of that size in 15 milliseconds?

I have no doubt I could come up with a file structure I could read directly into memory eliminating decompression and insertion into a set. I would be compromising file size for loading speed, but I doubt I could read that into memory in 15 milliseconds even using large block binary transfers.

Your post reminded me that set.insert() accepts a hint. After adding a hint to the insert call because the disk file is ordered and removing some unnecessary consistency checks, the load time dropped to 0.41 seconds. The load was barely noticeable. Thanks for the reminder.

Last edited on Oct 21, 2021 at 5:36pm

Oct 21, 2021 at 5:25pm

jonnin (11491)

modern hardware is ... outstanding.

file: https://github.com/dwyl/english-words/blob/master/words_alpha.txt (4MB ish)

int main()
{	
  auto start = std::chrono::high_resolution_clock::now();
  ifstream ifs("words.txt");
  auto s =  std::filesystem::file_size("words.txt");
  vector<char> v(s);
  ifs.read(&v[0],s);
  auto end = std::chrono::high_resolution_clock::now();
  cout <<"time in ms:" <<chrono::duration_cast<chrono::milliseconds>(end - start).count();
  return 0;	
}

output: worst result: 11 ms. typical output:
time in ms:2
this is on a SSD / I9 / 64gb ram box.

'metathetical' is 1/2 way down the list, and takes 1ms to find using strstr().
not bad for single threaded brute force (clearly a bin search or map etc would do infinitely better).

Last edited on Oct 21, 2021 at 5:36pm

Oct 21, 2021 at 9:39pm

JRManx (603)

seeplus wrote:
If you had say a Dictionary class with the right structure, then once it was built from the file then searching would be pretty fast.

Yeah, that's pretty much what I was planning to do.

AbstractAnon wrote:
...eliminating characters that repeat from the previous word...

I'm not sure I understand, could you give an example?

...and compressing diphthongs.

Why? And how?

mbozzi wrote:
My first impression is that 1.5 seconds is too slow by three orders of magnitude.

Heh, is that all? I thought it would take at least a minute, or more.

Ganado wrote:
There might be micro-optimizations you could do, like perhaps making things binary instead of plaintext w/ prepended lengths instead of whitespace, but at the end of the day, you still need to iterate through the data.

Making it a bin file would probably help. And with istream::seekg() I could get pretty quick results, I think.

jonnin wrote:
file: https://github.com/dwyl/english-words/blob/master/words_alpha.txt (4MB ish)

Uh..."aaa" is not a word. Is it even a legal word in Scrabble?

Oct 21, 2021 at 9:43pm

mbozzi (3943)

Yeah, I agree it's too long, however, the load is only done once as part of program startup. Do you seriously think it would be possible to load a file of that size in 15 milliseconds?

15ms is more reasonable for a rotating disk. I originally suggested 1.5ms, assuming you have an SSD.

Conservatively, NVM Express SSDs have a read latency in the order of 100 microseconds, and a corresponding throughput of about 1GB/s. For example winsat measures the following throughputs on my Windows machine:

W:\>winsat disk -drive w -ran -read | findstr Disk
> Disk  Random 16.0 Read                       1555.19 MB/s          9.1

W:\>winsat disk -drive w -seq -read | findstr Disk
> Disk  Sequential 64.0 Read                   3215.68 MB/s          9.3

This device is a 1TB Samsung Evo 970.
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-R2-and-2012/cc742157(v=ws.11)?redirectedfrom=MSDN

On this system a 173k file could very conservatively be in core in under a millisecond. Consistent with the estimation, a 170k binary file takes 200 microseconds to read via the C++ library on the aforementioned device.

There are multiple layers of buffering between the drive's hardware, device driver, the HAL, kernel, and the C++ library. If these middlemen weren't present you could get closer to the theoretical maximum by accessing the drive directly.

There isn't even a fundamental need to wait for the entire read to complete. The computer could, in principle, start processing as soon as some data arrives (absent any problem constraints). Are you willing to share your program?

Last edited on Oct 21, 2021 at 9:58pm

Pages: 12 3