Learning Syntactic Categories

2 views

Skip to first unread message

Scott Frye

unread,

Mar 23, 2010, 9:12:04 AM3/23/10

to Natural Language Processing Virtual Reading Group

I've been reading (and re-reading) this paper a lot lately:

"An Incremental Bayesian Model for Learning Syntactic Categories by
Parisien, Fazly and Stevenson.
(http://www.aclweb.org/anthology/W/W08/W08-2112.pdf)

I can't remember where I first saw this paper but I think it has a
REALLY exciting approach.

Basically they a technique for automatically learning various
syntactic categories that seem to line up pretty well with the big
categories (verb, noun, adjective).

The idea is to look at a "frame" of 5 words at a time and then figure
a category to put the frame into. A frame is a word and the two words
to the right and two words to the left. The categories are all based
on the middle (head) word and a calculation of the surrounding words.
There is a bit of probability reserved for creating a new category
which become less and less likely the more categories you get. This
means that the approach can be started with NO categories making it an
unsupervised technique or bootstrap technique.

The calculation of the best category, based on the surrounding words
has two parts. The first is that there is a higher probability
assigned for the more words that are the same. The second is by
looking at the categories of the words that surround the headword.
They also have a mechanism to periodically check all the categories
and see if any are alike enough to merge together.

Their numbers seem to show that the exact words surrounding the head
word (word-based technique) gets the system accurate pretty fast
(after examining 10,000 words) The category based technique for the
surrounding words takes a lot longer to be effective but passes the
ability if the word-based technique at around 40,000 words and after
60,000 words isn't that far behind the combination of the two
techniques. They also note that the technique learns nouns the best,
then verbs and isn't too good at picking up adjectives.

They talk about extending the model by various techniques, like the
examining the morphology of the words in the frame.

I think this is a very promising technique mainly because it seems to
be a completely unsupervised technique that continuously improves
itself and doesn't rely on any specific domain. Currently I've been
playing around to implement this on a character by character basis to
see how well it determines the categories of input. For instance,
will it be able to determine that spaces and carriage returns are
separate from punctuation, letters, numbers, and symbols.

Unfortunately I haven't had as much time to play with it as I would
like.

Thoughts? Opinions? Questions? Criticisms?

-Scott Frye

Ronald Hobbs

unread,

Mar 25, 2010, 9:50:24 AM3/25/10

to Natural Language Processing Virtual Reading Group

That does sound exciting.
Especially within unlabeled domain specific languages where regular
NLP methods tend to perform poorly.

I'll have to read through the paper before being able to say anything
coherent, but I really like the bootstrap-ability of this.

Reply all

Reply to author

Forward

0 new messages