map<string,int> question

An assignment that I had was to write a program that reads a text file and outputs each unique word and the number of occurences of each word. I noticed that the program didn't (couldn't) see that "sample," "Sample" "sample." etc. weren't really different words. It also couldn't split up contractions into two words (e.g., don't into do not.

What would be the best way to break up contractions into their principle words? I can't think of any single algorithm to do it, as there are too many exceptions. The only other way I can think of is to include a "dictionary" of all the contractions and their uncontracted counterparts, but that seems so...unelegant. Any help would be appreciated!

EDIT: For clarification, the assignment only wanted a program that counted occurences of words; the parsing of contractions is something extra I want to (try to) solve.
Last edited on
> What would be the best way to break up contractions into their principle words?

Some kind of a stemming algorithm.
http://en.wikipedia.org/wiki/Stemming
atropos wrote:
I noticed that the program didn't (couldn't) see that "sample," "Sample" "sample." etc. weren't really different words.


A simple way to get around that would be to convert the loaded word to all lower/upper case before putting it in the map.
@ModShop:
Yes, I tolower()'ed everything, as well as erase()'ed trailing punctuation. The thing that's proving fiendishly difficult is the parsing of contractions into their uncontracted counterparts. :(
For the contractions, a very simple solution would be to have a hash of contractions and what they resolve to. However, if you look at a list of contractions ( http://grammar.about.com/od/words/a/EnglishContractions.htm ), you will notice they all follow the same rules. Instead of the hash containing every word on the list, it could just contain the contracted parts.

Like this:

's = is
n't = not
'm = am
'd = had
'll = will
're = are
've = have


Now your hash only has 7 entries, and will still be able to resolve all contractions it encounters.
Topic archived. No new replies allowed.