Concordance

If somebody would put me in the right direction with this problem, I would greatly appreciate it, thank you. I don't know where to begin with this, Please help!

A "concordance" is an alphabetical list of the words present in a text with a count of how often each word appears and citations of where each word appears in the text (e.g., page number). Write a program -- in the programming language of your choice -- that will generate a concordance of an arbitrary text document written in English: the text can be read from stdin, and the program should output the concordance to stdout or a file. For each word, it should print the count and the sorted list of citations, in this case the zero-indexed sentence number in which that word occurs. You may assume that the input contains only spaces, newlines, standard English letters, and standard English punctuation marks.

Below is an example of what your program would compute for the paragraph above. (The particular formatting below isn't required, as long as you output the word, number of occurrences, and list of sentence numbers.)

a {6:0,0,0,1,1,1} of {6:0,0,0,1,1,2}
alphabetical {1:0} often {1:0}
an {2:0,1} only {1:3}
and {4:0,1,2,3} or {1:1}
appears {2:0,0} output {1:1}
arbitrary {1:1} page {1:0}
assume {1:3} present {1:0}
be {1:1} print {1:2}
can {1:1} program {2:1,1}
case {1:2} programming {1:1}
choice {1:1} punctuation {1:3}
citations {2:0,2} read {1:1}
concordance {3:0,1,1} sentence {1:2}
contains {1:3} should {2:1,2}
count {2:0,2} sorted {1:2}
document {1:1} spaces {1:3}
e.g. {1:0} standard {2:3,3}
each {3:0,0,2} stdin {1:1}
english {3:1,3,3} stdout {1:1}
file {1:1} text {4:0,0,1,1}
for {1:2} that {3:1,2,3}
from {1:1} the {10:0,0,1,1,1,1,2,2,2,3}
generate {1:1} this {1:2}
how {1:0} to {1:1}
in {6:0,0,1,1,2,2} where {1:0}
input {1:3} which {1:2}
is {1:0} will {1:1}
it {1:2} with {1:0}
language {1:1} word {4:0,0,2,2}
letters {1:3} words {1:0}
list {2:0,2} write {1:1}
marks {1:3} written {1:1}
may {1:3} you {1:3}
newlines {1:3} your {1:1}
number {2:0,2} zero-indexed {1:2}
occurs {1:2}

Hmm, interesting exercise; I might have to implement this myself just for fun!
Anyways, it seems to me like this might give you an idea of where to start:
http://www.cplusplus.com/reference/map/map/
Create a class X that contains a string and an int. Make a vector of X and parse your input into the vector. Sort by the string and position. You define something like
1
2
3
4
5
6
7
8
 bool operator< (const X& lhs, const X& rhs)
{
  if (lhs.text<rhs.text) 
           return true;
  else if (lhs.text==rhs.text) 
           return lhs.position<rhs;
  return false;
 }

Once your list is sorted, start with the first word and store occurrences until the word changes. Then go to next
Here's what I did (note that this is probably not the most efficient algorithm, but at least it (kind of) works):

1. Go through the entire text and strip out all of the punctuation (I replaced them with space characters). You'll have to be careful, since hyphenated words like "zero-indexed" are still one word, as well as abbreviations/acronyms like "e.g.". I don't know if you also want to deal with other special cases like "..." and such, but you get the point....

2. To make things easier, you might as well also convert everything to lowercase (so it doesn't think "this" and "This" are two different words).

3. I used a std::map<std::string, std::vector<int>> to make things a bit easier (perhaps). The key is the word and the value is the list of occurrences of that word (you can get the number of occurrences using the std::vector::size() member function). Note that std::map automatically keeps all of the words in order, so you don't have to worry about sorting it yourself.

So basically, you just go through each word and record its occurrence by adding it to the std::vector, and at the end, you just output the contents of the std::map.

Hopefully that makes sense.
Topic archived. No new replies allowed.