Finding the Subject and Object in a sent

Forum

Forum
General C++ Programming
Finding the Subject and Object in a sent

Finding the Subject and Object in a sentence

Apr 28, 2011 at 6:17am

I'm writing a program that finds the subjects and objects of a sentence and labels them as Sub, Indirect Obj and Direct Obj. I'm frankly running out of ideas on how to effectively implement this. Initially, I started by capturing all strings with an Uppercase because they were names and hence nouns, and then I used an array to match all the pronouns as well. However, things like dog, cow etc are too hard to find and I cant use arrays for them all. Anyone have any idea on how to better implement this or have any links to information that might help me out? Thanks a lot in advance.

Apr 28, 2011 at 9:17am

Abramus (285)

This seems like a hard task. I’ve never done anything like this, but if I had to, I would probably try like this:

The first, easier, part is to detect the basic parts of speech (nouns, verbs, etc. maybe even with more detail, like verb form). This can be done using a dictionary, containing words along with flags indicating what part of speech the words are. Obviously the dictionary should be kept in a container that allows fast searching, like a hash table or balanced binary tree.

Now the difficult part. If you know what part of speech each word in the sentence is, you could try to make a grammar parser to detect the basic parts of sentence (subjects, objects, etc.). In this parser, you should simply list the possible sentence structures, taking advantage of the fact that English is quite well structured language. Something like this (I use Yacc/Bison parser generator):

sentence
   : NOUN VERB NOUN { subject($1); object($3); } /* e.g. "Andy is a singer." */
   : DO NOUN VERB NOUN { subject($2); object($4); } /* e.g. "Does Andy like football?" */
   ;

You get the idea. I don't know if this method is really practical or even possible. Also, it would require quite a lot of work and a good knowledge of the English grammar.

Last edited on Apr 28, 2011 at 9:17am

Apr 28, 2011 at 5:12pm

jrok (24)

Thanks so much for the information, Abramus. I'm aready using a word dictionary because I'm using WordNet and the words come already tagged as nouns, verbs, etc.

My real problem was parsing my sentences and determing the subjects and objects. The Yacc/Bison parser you talked about seems like it could potentially help. I could tweak it to recognize Indiract and Direct objects. It will be a lot of work but I could make it work. Can you give me some more info on it, where I can get it and is it an open source program? Thanks.

Apr 29, 2011 at 7:09am

Abramus (285)

Yes, Bison is open source (GPL). It comes with a detailed manual. You can find it here:
http://www.gnu.org/software/bison/
Windows version:
http://gnuwin32.sourceforge.net/packages/bison.htm

There are other alternatives, but I haven't used them so I cannot tell if they are better or worse:

ANTLR Parser generator:
http://www.antlr.org/

boost::spirit:
http://www.boost.org/doc/libs/1_46_1/libs/spirit/doc/html/index.html

One issue is that I'm not really sure if you will be able to create an unambiguous grammar. It depends on your requirements, I guess. Human languages are not strict unlike programming languages, and programming languages are what the mentioned tools were created for.

Basically I'm trying to warn you: It may go well for some time, but it is possible that at one point (when your parser is not small and simple anymore) you may face the situation when adding the next rule to the parser creates an ambiguity, and nothing can be done about it. Keep this in mind.

Apr 29, 2011 at 9:05am

jrok (24)

Thanks a lot Abramus. You've been a great help. I'm considering your warning and see what I can do to avoid them. Thanks again.

Topic archived. No new replies allowed.

C++

Forum

Finding the Subject and Object in a sentence