Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UCAM-CL-TR-743: Optimising the speed and accuracy of a Statistical GLR Parser

0 views
Skip to first unread message

tech-r...@cl.cam.ac.uk

unread,
Mar 27, 2009, 8:18:09 AM3/27/09
to
Publication announcement:

Optimising the speed and accuracy of a Statistical GLR Parser

Rebecca F. Watson

Technical report UCAM-CL-TR-743, University of Cambridge,
Computer Laboratory, PhD thesis, March 2009, 145 pages.

This document is now available at

http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-743.html

Abstract:

The focus of this thesis is to develop techniques that optimise both the
speed and accuracy of a unification-based statistical GLR parser.
However, we can apply these methods within a broad range of parsing
frameworks. We first aim to optimise the level of tag ambiguity resolved
during parsing, given that we employ a front-end PoS tagger. This work
provides the first broad comparison of tag models as we consider both
tagging and parsing performance. A dynamic model achieves the best
accuracy and provides a means to overcome the trade-off between tag
error rates in single tag per word input and the increase in parse
ambiguity over multipletag per word input. The second line of research
describes a novel modification to the inside-outside algorithm, whereby
multiple inside and outside probabilities are assigned for elements
within the packed parse forest data structure. This algorithm enables us
to compute a set of 'weighted GRs' directly from this structure. Our
experiments demonstrate substantial increases in parser accuracy and
throughput for weighted GR output.

Finally, we describe a novel confidence-based training framework, that
can, in principle, be applied to any statistical parser whose output is
defined in terms of its consistency with a given level and type of
annotation. We demonstrate that a semisupervised variant of this
framework outperforms both Expectation-Maximisation (when both are
constrained by unlabelled partial-bracketing) and the extant (fully
supervised) method. These novel training methods utilise data
automatically extracted from existing corpora. Consequently, they
require no manual effort on behalf of the grammar writer, facilitating
grammar development.

--
University of Cambridge, Computer Laboratory,
Technical Reports (ISSN 1476-2986)
http://www.cl.cam.ac.uk/techreports/

0 new messages