Long, complex, but grammatically correct sentence - how to gauge parse time before the run?

210 views
Skip to first unread message

JRowe

unread,
Feb 6, 2011, 4:01:37 PM2/6/11
to link-grammar
I recently re-read the Hitchhikers guide to the Galaxy series, and
encountered this gem:

"The problem is, or rather one of the problems, for there are many, a
sizeable proportion of which are continually clogging up the civil,
commercial, and criminal courts in all areas of the Galaxy, and
especially, where possible, the more corrupt ones, this."

This sentence takes longer than 10 minutes to parse, and was at 250MB
ram usage before I shut it down. Is there a way to pre-process a
sentence to split it up into more manageable chunks?

I was thinking that analyzing "or" and "for" and "and" clauses with
relation to their preceding commas would give you some sort of
leverage, but I'm not seeing a way through.

In this case, I think that everything hinges on the most deeply nested
phrase, the "...,and especially, where possible, the more corrupt
ones..."

At any rate, simplified versions of the sentence don't parse, so it's
doubtful that the full version will go through, even after it finishes
the first run. How would (can?) I extend the parser to handle this
case?

Ben Goertzel

unread,
Feb 6, 2011, 6:01:34 PM2/6/11
to link-g...@googlegroups.com
That sentence has a lot of commas; conjunction handling is a tricky
thing for most dependency parsers including the link parser...

Linas recently replaced the original link parser's "fat links" system
with plain old link grammar rules for commas and other conjunctions

Assuming the issue is a grammar shortcoming rather than a software
bug, there seem two possibilities regarding this sentence:

1)
The link parser never finds the right parse

2)
The link parser finds a huge abundance of parses, the right one is way
down the list, and finding them all takes way too long

In case 1, the solution is to edit the link parser dictionary entries
corresponding to commas

In case 2, the solution is to implement better parse ranking, probably
taking probabilities derived from a corpus into account (Ruiting Lian
is now playing with this, though in the context of simpler sentences
than that.)

The same analysis applies to simplified versions of the sentences

-- Ben G

> --
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To post to this group, send email to link-g...@googlegroups.com.
> To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/link-grammar?hl=en.
>
>

--
Ben Goertzel, PhD
CEO, Novamente LLC and Biomind LLC
CTO, Genescient Corp
Chairman, Humanity+
Adjunct Professor of Cognitive Science, Xiamen University, China
Advisor, Singularity University and Singularity Institute
b...@goertzel.org

"My humanity is a constant self-overcoming" -- Friedrich Nietzsche

Linas Vepstas

unread,
Feb 7, 2011, 12:47:11 AM2/7/11
to link-g...@googlegroups.com
On 6 February 2011 15:01, JRowe <jro...@gmail.com> wrote:
> I recently re-read the Hitchhikers guide to the Galaxy series, and
> encountered this gem:
>
> "The problem is, or rather one of the problems, for there are many, a
> sizeable proportion of which are continually clogging up the civil,
> commercial, and criminal courts in all areas of the Galaxy, and
> especially, where possible, the more corrupt ones, this."
>
> This sentence takes longer than 10 minutes to parse, and was at 250MB
> ram usage before I shut it down.

!
Very interesting.

On my (now aging) machine, it took 1:37min and maxed at 81MB
after which it reported:
Found 688260 linkages (147 of 1000 random linkages had no P.P.
violations) at null count 4
Linkage 1, cost vector = (UNUSED=4 DIS=12 FAT=0 AND=0 LEN=129)

Then I tried turning on fat links, and indeed, I'm 7+ minutes and
170+ MB, and still going. Yuck. Try turning off fat links, by
saying !use-fat at the prompt. Strange, though, because fat links
should be disabled by default; are you using the 4.7.x series?

> Is there a way to pre-process a
> sentence to split it up into more manageable chunks?

Not really, not that I know of. In a certain sense, this is what
parsing is all about ...

> I was thinking that analyzing "or" and "for" and "and" clauses with
> relation to their preceding commas would give you some sort of
> leverage, but I'm not seeing a way through.

The sentence is "The problem is this" with three comma-nested
parenthetical remarks in the middle of it.

The problem is, or rather one of the problems, this.

The problem is, or rather one of the problems, for there are many, this.

The problem is, or rather one of the problems, for there are many, a
sizeable proportion of which are continually clogging up the civil,

commercial, and criminal courts in all areas of the Galaxy, this

The above contains a comma-separated list. And then, for good
measure, there's yet another parenthetical emphasis thrown in.
Pulling this sentence apart is tricky.

There are two bugs: the shorted sentence above fails to parse,
"or rather" blowing up. The other bug is with "and, where possible"
Fixing these two would improve parse speed. Fixing these correctly
could be a bit tricky, the phrases seem idiomatic to me;certainly
more than what I could do in half an hour.

> At any rate, simplified versions of the sentence don't parse, so it's
> doubtful that the full version will go through, even after it finishes
> the first run. How would (can?) I extend the parser to handle this
> case?

Usually, one finds the shortest possible failing sentence, and
fixes that. In this case, I beleive the fix could be quite tricky,
and would break your head, if you've never done these before.
Starting on something simpler would be much better, although
I suspect that most of the simpler bugs have been fixed.

My goal at this point is to use statistical techniques to automatically
find & fix these things, but this is a research project I have yet to
make much progress in.

-- linas

Linas Vepstas

unread,
Feb 7, 2011, 12:57:41 AM2/7/11
to link-g...@googlegroups.com
On 6 February 2011 17:01, Ben Goertzel <b...@goertzel.org> wrote:
>
> Linas recently replaced the original link parser's "fat links" system
> with plain old link grammar rules for commas and other conjunctions

Which improved performance significantly, BTW.

Note, however, relex has not been updated to gain knowledge
of the new link types; it still expects/needs fat links to do things,
and might actually have new bugs as a result. Haven't looked
closely.

> 2)
> The link parser finds a huge abundance of parses, the right one is way
> down the list,

We've talked about this before; I'm still not sure if I have any clear-cut
examples of this; did you send me one recently?

> finding them all takes way too long

at the moment, it doesn't report any until its found them all ...

> In case 1, the solution is to edit the link parser dictionary entries
> corresponding to commas

I think commas are relatively healthy these days; they handle
a large varity of structures correctly.

> In case 2, the solution is to implement better parse ranking,

... which helps ranking, not performance :-(

BTW, now that fat links are gone, the main stumbling block
impeding the SAT solver is gone. Doing some tests with SAT
should now be fully head-to-head comparable.

--linas

Ben Goertzel

unread,
Feb 7, 2011, 6:46:48 AM2/7/11
to link-g...@googlegroups.com
> BTW, now that fat links are gone, the main stumbling block
> impeding the SAT solver is gone. Doing some tests with SAT
> should now be fully head-to-head comparable.
>
> --linas

Unfortunately, the guy who wrote the SAT parser has moved on to other things...

My thinking remains that the SAT parser is the right way to move
forward, because it provides a clear path to integrating semantic
(e.g. statistical linguistics based) biasing into the parsing
process...

It would be awesome if you could find the time to do some testing with
the SAT link parser...

ben

Linas Vepstas

unread,
Feb 7, 2011, 12:10:57 PM2/7/11
to link-g...@googlegroups.com
On 7 February 2011 05:46, Ben Goertzel <b...@goertzel.org> wrote:
>> BTW, now that fat links are gone, the main stumbling block
>> impeding the SAT solver is gone. Doing some tests with SAT
>> should now be fully head-to-head comparable.
>
> Unfortunately, the guy who wrote the SAT parser has moved on to other things...

No matter, the code has been integrated in, it compiles & it worked,
last time I tried it. There's a minor integration glitch, where it
exists after every parse; this is annoying but can't be hard to fix.

> My thinking remains that the SAT parser is the right way to move
> forward, because it provides a clear path to integrating semantic
> (e.g. statistical linguistics based) biasing into the parsing
> process...

I'm still very much interested in a Viterbi-based parser, for the
same reason; I also beleive it could be blazingly fast. I also
want to redfine the combinatoric-explosion problem; by defining
a format where alternate parses of short sentence segments are
packetized into a packet.

I admit that I have not studied SAT, and so don't have a clear
mental picture of how it works. My enthusiasm might change if I did.

Do you know of any particularly good, readable overview of SAT?

> It would be awesome if you could find the time to do some testing with
> the SAT link parser...

I'll try to do that soon. It shouldn't be hard.

--linas

Ben Goertzel

unread,
Feb 7, 2011, 12:19:26 PM2/7/11
to link-g...@googlegroups.com
This is a really good book on stochastic local search, including
standard SAT solving methods...

http://www.sls-book.net/

http://www.amazon.com/Stochastic-Local-Search-Applications-Intelligence/dp/1558608729

That's where I learned the details. I don't know any really good
review papers offhand; just that book plus a bunch of technical
papers....

I'll offer you the following deal: I'll buy you a copy of the book
(think of it as a belated Xmas present) if you'll promise to read it
;-D .... These are quite powerful methods and I really think they're
the "right" way to do parsing given the scope of algorithms currently
available...

-- Ben

Linas Vepstas

unread,
Feb 10, 2011, 1:24:50 PM2/10/11
to link-g...@googlegroups.com
On 6 February 2011 15:01, JRowe <jro...@gmail.com> wrote:
> I recently re-read the Hitchhikers guide to the Galaxy series, and
> encountered this gem:
>
> "The problem is, or rather one of the problems, for there are many, a
> sizeable proportion of which are continually clogging up the civil,
> commercial, and criminal courts in all areas of the Galaxy, and
> especially, where possible, the more corrupt ones, this."

I investigated further, and it turns out that there was a bug in
the hash-table algorithms that are heavily used throughout
link-grammar; they were generating poor hashes in many cases.
Fixing these seems to yield a 2x or 3x performance improvement
for the handful of long sentences I've tested with.

I've checked in fixes into svn, but have yet to start more
extensive testing. If that passes, then another release in a week or two.

--linas

Rich Cooper

unread,
Feb 10, 2011, 1:52:48 PM2/10/11
to link-g...@googlegroups.com
Linas,

That's good news! Run times on long sentences are presently way too long
for interactive use, though a faster parser on long sentences could still
run in background discovery mode for a lot of linguistic applications, such
as corpus analysis.

JMHO,
-Rich

Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2

--linas

--

Linas Vepstas

unread,
Feb 12, 2011, 6:55:34 PM2/12/11
to link-g...@googlegroups.com
On 7 February 2011 11:19, Ben Goertzel <b...@goertzel.org> wrote:
> This is a really good book on stochastic local search, including
> standard SAT solving methods...
>
> http://www.sls-book.net/
>
> http://www.amazon.com/Stochastic-Local-Search-Applications-Intelligence/dp/1558608729
>
> That's where I learned the details.  I don't know any really good
> review papers offhand; just that book plus a bunch of technical
> papers....
>
> I'll offer you the following deal: I'll buy you a copy of the book
> (think of it as a belated Xmas present) if you'll promise to read it
> ;-D ....

Hmm. I think I should take you up on that offer.

At the moment, things look rather grim for the SAT-solver code
for link-grammar. I just fixed several bugs that were causing
it to crash. It seems to be horribly slow on certain long sentences,
and then starts behaving erratically, including a hard-to-reproduce
crash. I'll take a few more pokes at it ...

--linas

Ben Goertzel

unread,
Feb 12, 2011, 7:39:48 PM2/12/11
to link-g...@googlegroups.com
Hi Linas,

>> I'll offer you the following deal: I'll buy you a copy of the book
>> (think of it as a belated Xmas present) if you'll promise to read it
>> ;-D ....
>
> Hmm.  I think I should take you up on that offer.

You got it ;)

> At the moment, things look rather grim for the SAT-solver code
> for link-grammar.  I just fixed several bugs that were causing
> it to crash.  It seems to be horribly slow on certain long sentences,
> and then starts behaving erratically, including a hard-to-reproduce
> crash. I'll take a few more pokes at it ...

Well, bugs aside, the current version is quite crude in some ways --
e.g. we intended to change it to proceed linearly thru the sentence,
but that was never done... that would likely improve speed
dramatically...

ben

Linas Vepstas

unread,
Feb 12, 2011, 9:34:37 PM2/12/11
to link-g...@googlegroups.com
On 12 February 2011 17:55, Linas Vepstas <linasv...@gmail.com> wrote:
>
> At the moment, things look rather grim for the SAT-solver code
> for link-grammar.  I just fixed several bugs that were causing
> it to crash.  It seems to be horribly slow on certain long sentences,

I spoke too soon. valgrind is a nice tool, it found an "obvious" bug
in minisat. Once fixed, the sat solver now blasts through text at a
rate that is significantly faster than the regular parser. (And, from
what I can tell, it does seem to do the right thing). There's yet
another crash though .. and the sat-solver API is different, so that's
causing some pain as I go to integrate properly.

--linas

Ben Goertzel

unread,
Feb 12, 2011, 11:36:57 PM2/12/11
to link-g...@googlegroups.com
Very cool!

BTW it would be pretty easy to write a good academic journal paper on
the SAT link parser (with Filip as the lead author I suppose), based
on excerpting from Filip's excellent documentation.... If you want to
do some systematic tests of it, documenting its superior performance,
then I could find time sometime in the next months to write the
paper...

ben

Linas Vepstas

unread,
Feb 13, 2011, 12:16:01 AM2/13/11
to link-g...@googlegroups.com
On 12 February 2011 20:34, Linas Vepstas <linasv...@gmail.com> wrote:
> On 12 February 2011 17:55, Linas Vepstas <linasv...@gmail.com> wrote:
>>
>> At the moment, things look rather grim for the SAT-solver code
>> for link-grammar.  I just fixed several bugs that were causing
>> it to crash.  It seems to be horribly slow on certain long sentences,
>
> I spoke too soon.

... again. During refactoring, I missed a critical step in the solution
process, so the solver wasn't actually being run. Now that I put it
back in, its slowww again. :-/

--linas

Ben Goertzel

unread,
Feb 13, 2011, 12:17:38 AM2/13/11
to link-g...@googlegroups.com
;-D ... and so it goes!

I'll refrain from repeating my prior message about the changes that
needed to be done to make it fast ;) ...

Linas Vepstas

unread,
Feb 13, 2011, 12:57:56 PM2/13/11
to link-g...@googlegroups.com
On 12 February 2011 23:17, Ben Goertzel <b...@goertzel.org> wrote:
> ;-D ... and so it goes!

Practically speaking, a current bottleneck is that relex depends
on "fat links" to handle conjunctions properly. Turning off fat-links
in link-grammar results in s 1.3x to 3.0x perf improvement, depending
on the text.

Also, as a practical matter, these days, relex overall accounts for
the vast majority of cpu time, esp. on shorter sentences. If performance
was the goal, that would be the thing to work on.

I'm somewhat more interested in coverage and correctness, so all
this performance work is a bit of a distraction.

--linas

Reply all
Reply to author
Forward
0 new messages