An assignment that I had was to write a program that reads a text file and outputs each unique word and the number of occurences of each word. I noticed that the program didn't (couldn't) see that "sample," "Sample" "sample." etc. weren't really different words. It also couldn't split up contractions into two words (e.g., don't into do not.
What would be the best way to break up contractions into their principle words? I can't think of any single algorithm to do it, as there are too many exceptions. The only other way I can think of is to include a "dictionary" of all the contractions and their uncontracted counterparts, but that seems so...unelegant. Any help would be appreciated!
EDIT: For clarification, the assignment only wanted a program that counted occurences of words; the parsing of contractions is something extra I want to (try to) solve.
@ModShop:
Yes, I tolower()'ed everything, as well as erase()'ed trailing punctuation. The thing that's proving fiendishly difficult is the parsing of contractions into their uncontracted counterparts. :(
For the contractions, a very simple solution would be to have a hash of contractions and what they resolve to. However, if you look at a list of contractions ( http://grammar.about.com/od/words/a/EnglishContractions.htm ), you will notice they all follow the same rules. Instead of the hash containing every word on the list, it could just contain the contracted parts.
Like this:
's = is
n't = not
'm = am
'd = had
'll = will
're = are
've = have
Now your hash only has 7 entries, and will still be able to resolve all contractions it encounters.