Parsing TREC Questions

Andreas Haller

unread,

Dec 2, 2010, 8:34:43 PM12/2/10

to link-grammar

Hi,

i am evaluating if Link Grammar can help me to give a reliable answer
to this question:
Is some input string a proper question?

I don't care about meaning. I just want to see if the string is a
question or not.

My approach is quite simple:
1. Can the Link Gammar parser parse the question?
2. Does the sentence include a interrogative word or does it look like
a tag-question?

I was thinking about using a syntax/grammar checker instead of going
in the Q&A systems' question analysis direction, because it looks easy
and a very simple experiment using a POS tagger did not work right. So
i thought, let's not care about content, just care about form and then
check for some fix points (interrogative words).

For testing, i am using the TREC-10 question track data. It's a list
of 500 questions, which is available here:
http://cogcomp.cs.illinois.edu/Data/QA/Trec10questions.txt

I am using Link Grammar 1.4.7.
With a max_null_count of 0, the parser cannot parse 40 out of the 500
questions. This includes questions like
"LEFT-WALL how far.a [away] is.v the moon.n ?"
or
"LEFT-WALL material.n-u called.v-d linen.n-u is.v made.v-d from
[what] [plant] ?"

So i increase max_null_count to, say, 3 and it only fails on about 10.
…
But this is the wrong way to do it, right? Since it might loosen the
parsing to a degree where it does not make sense any more?

(How) would you use Link Grammar to solve my problem?

I am no linguistics expert, so maybe i am a little off here.

Andreas

Dan Brian

unread,

Dec 3, 2010, 1:56:44 PM12/3/10

to link-g...@googlegroups.com

On Thu, Dec 2, 2010 at 6:34 PM, Andreas Haller <andrea...@gmail.com> wrote:
> Hi,
>
> i am evaluating if Link Grammar can help me to give a reliable answer
> to this question:
> Is some input string a proper question?

Are you just trying to determine whether an input string is a
question? What form does "proper" take?

Determining whether an input string is a question or a statement is
not very difficult in English. A trailing question mark and certain
leading words (what/where/when/how) are pretty clear indicators.

Raising the min/max_null_count is entirely appropriate when you aren't
getting a valid parse, but that begs the question of what you are
after with regards to a "proper" question.

jf

unread,

Dec 3, 2010, 2:58:11 PM12/3/10

to link-g...@googlegroups.com

I'm curious to see how this works out for you. Certainly the ending question mark is a give away, but without that it seems quite difficult for a shallow parse. Consider the two sentences without punctuation:

1) Thanksgiving will be a fun holiday
2) If I eat turkey on Thanksgiving will I gain weight

Depending on the accuracy requirements, you may want to take cues from Dan's approach and skip Link Grammar altogether. Regular expressions can often provide the same results (or better) without the significant overhead of Link Grammar (and I do mean significant when you see just how long it takes LG to parse some sentences).

Best of luck!

Jay

--
You received this message because you are subscribed to the Google Groups "link-grammar" group.
To post to this group, send email to link-g...@googlegroups.com.
To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/link-grammar?hl=en.

Andreas Haller

unread,

Dec 3, 2010, 3:57:04 PM12/3/10

to link-g...@googlegroups.com

It's for my bachelorthesis. I am building a Jeopardy-like game to generate question-answer-pairs, which *might* be used to support artificial QA systems. The topic is Human Computation (see Luis von Ahn et al. www.gwap.com etc.). (Short intro: http://www.channels.com/episodes/show/2599604/Luis-von-Ahn-Human-Computation)

An early working version of the game is online at: http://webpardy.com. You can play it/test it, if you want!

So we are asking the player to: Think of a question describing the image/text. If you can, you will be rewarded with x-hundred points. If you just enter balony balony, you will be punished with negative points.

Anyways. I tried looking for question-words, wh-words "Where is ...", "Who invented ....?" or see if the sentence starts with auxilary/modal verb "Is this the right …?", "Should Andreas go to… ?" Simply done, that was easy. But there are many other variants, like the one you mentioned, where the wh-word or interrogative auxilary verb isn't at the beginning of a sentence and so on… "Who discovered India?" is ok "Who India discovered?" is not ok.

So i thought about using something more sophisticated like LG could help me. LG finds errors like the one mentioned, but has other problems. Like with "How far is the moon?" etc.

I changed the implement to this (don't cheat!):

IF Cant LG parse the sentence? AND Does it include a question-word (lookup in a wordlist) and does it include a known verb (word.v not word[?].v)

THEN: Win!

ELSE: Fail!

So far this seems to work. But as mentioned, there are some (a lot?) of edge cases.

In regards to performance, this looks ok, maybe because the input should be only a few words long. But i don't know if it scales.

If you have any more input or comments, let me know

jf

unread,

Dec 3, 2010, 4:50:34 PM12/3/10

to link-g...@googlegroups.com

The underlying problem is that the accuracy of LG is not great. I'm working on a project in which LG was used to determine if a sentence is grammatically correct. Unfortunately, the accuracy rate is not good enough (50-70%) even when giving plenty of time to process linkages. Perhaps you will have better luck with very short sentences.

Regards,
Jay

Linas Vepstas

unread,

Dec 6, 2010, 11:42:00 PM12/6/10

to link-g...@googlegroups.com

On 3 December 2010 15:50, jf <jfi...@gmail.com> wrote:
> The underlying problem is that the accuracy of LG is not great. I'm working
> on a project in which LG was used to determine if a sentence is
> grammatically correct. Unfortunately, the accuracy rate is not good enough
> (50-70%) even when giving plenty of time to process linkages. Perhaps you
> will have better luck with very short sentences.

Can you tell me more about this project? Is the problem that link-grammar
is giving valid parses for invalid sentences, or that too many valid sentences
are not parsed?

I've tried to make the parser coverage broader (i.e. have it accept a larger
number of "good" sentences"), but the cost of this is that it now accepts a
far higher rate of "bad" sentences as well. (too many people want to use lg
parse tweets).

--linas

jf

unread,

Dec 7, 2010, 9:34:55 AM12/7/10

to link-g...@googlegroups.com

Hi Linas,
Thanks for your inquiry. I believe the accuracy issue is in both cases (valid sentences marked as invalid and vice versa), but my project is more concerned with ensuring that sentences marked as invalid actually are invalid. The project splits a document into sentences and attempts to determine if a sentence is grammatically correct by feeding the sentence to LG. If LG does not find valid linkages, then it is considered to not be grammatically correct. Here are some sentences that have no valid linkages in LG, but which appear to be grammatically correct:

- Chief imagines himself lost in a fog when he feels overwhelmed by the demands of society.

- Although society has excluded the patients in the ward for their unique qualities, they feel 'safer' trying to fit in because they receive approval from nurses and the representatives of society.

- He realizes that he and McMurphy can challenge Big Nurse, but cannot change the beliefs of popular society as a whole.

- Chief describes and analyzes the character of the families in a village, remarking that they all act completely the same.

- They feel a "normal" person conforms to, and becomes imperceptible in, society.

- The inability for the rejected to laugh signifies their paranoia at being noticed.

These sentences represent approximately 1/3 of the sentences with no valid linkages. Any thoughts as to how to increase the accuracy?

Thanks,
Jay

--

Linas Vepstas

unread,

Dec 8, 2010, 12:15:36 AM12/8/10

to link-g...@googlegroups.com

On 7 December 2010 08:34, jf <jfi...@gmail.com> wrote:
> Hi Linas,

>
> - Chief imagines himself lost in a fog when he feels overwhelmed by the
> demands of society.

[...]

>
> These sentences represent approximately 1/3 of the sentences with no valid
> linkages. Any thoughts as to how to increase the accuracy?

I fixed the first sentence, and checked in the changes into the svn repo;
they'll appear in version 4.7.1

There are two approaches to fixes: handling them, case by case,
in the dictionary file. This ranges from being easy, to sometimes quite
difficult, and often as rather tedious. But this is the only practical,
short-term approach.

I have ideas for long-term solutions, but no time to pursue them; these
ideas involve certain complex collections of graphs gleaned from text.
I'd love to do this work, but am unfortunately employed doing something
completely different.

-- Linas

Linas Vepstas

unread,

Dec 8, 2010, 1:34:00 AM12/8/10

to link-g...@googlegroups.com

On 3 December 2010 15:50, jf <jfi...@gmail.com> wrote:

> The underlying problem is that the accuracy of LG is not great. I'm working
> on a project in which LG was used to determine if a sentence is
> grammatically correct. Unfortunately, the accuracy rate is not good enough
> (50-70%) even when giving plenty of time to process linkages.

I'm rather surprised by this low figure. I've got several batches of test
sentences, about 3000 total, and accuracy hovers around 90%, despite
my regularly entering new, bad sentences to the batch.

Hmm. But I now realize that you must be parsing 'one flew over
the cuckoos nest'. Link-grammar fails horribly on dialogue, because it
has no idea where quotations start and end. It will not do well on
novels, and maybe only a little better on screenplays.

One way to fix this is to pre-process text, to convert input such as

Then John said "Mary, please go now"

into a pair of sentences:

Then John said X

and

Mary, please go now

I think that could raise percentages significantly.

--linas

jf

unread,

Dec 8, 2010, 8:43:06 AM12/8/10

to link-g...@googlegroups.com

Linas,
Thanks for taking a look and for the suggested approaches. The stats I mentioned already included pre-processing by removing sentences with dialogue. In the batch that I showed you via email, some of the sentences have a word in quotes, but removing the quotes yields the same linkage count from LG. Most of the documents that we run through it are essays on topics as diverse as literature to science. We used a pre-processing technique of skipping very short sentences, very long sentences, sentences with colons, and sentences with quotations, as these commonly cause issues with LG. This helped some, but the accuracy was still not at an acceptable level, especially given the amount of sentences that ended up being skipped.

I completely understand about being employed and not having time to pursue the long-term solutions. I think LG is an amazing tool and I'm thankful for the work that its contributors have put in.

Thanks again!
Jay

--

Reply all

Reply to author

Forward