Citation context analysis for information retrieval
Anna Ritchie
Technical report UCAM-CL-TR-744, University of Cambridge,
Computer Laboratory, PhD thesis, March 2009, 119 pages.
This document is now available at
http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-744.html
Abstract:
This thesis investigates taking words from around citations to
scientific papers in order to create an enhanced document representation
for improved information retrieval. This method parallels how anchor
text is commonly used in Web retrieval. In previous work, words from
citing documents have been used as an alternative representation of the
cited document but no previous experiment has combined them with a
full-text document representation and measured effectiveness in a large
scale evaluation.
The contributions of this thesis are twofold: firstly, we present a
novel document representation, along with experiments to measure its
effect on retrieval effectiveness, and, secondly, we document the
construction of a new, realistic test collection of scientific research
papers, with references (in the bibliography) and their associated
citations (in the running text of the paper) automatically annotated.
Our experiments show that the citation-enhanced document representation
increases retrieval effectiveness across a range of standard retrieval
models and evaluation measures.
In Chapter 2, we give the background to our work, discussing the various
areas from which we draw together ideas: information retrieval,
particularly link structure analysis and anchor text indexing, and
bibliometrics, in particular citation analysis. We show that there is a
close relatedness of ideas between these areas but that these ideas have
not been fully explored experimentally. Chapter 3 discusses the test
collection paradigm for evaluation of information retrieval systems and
describes how and why we built our test collection. In Chapter 4 we
introduce the ACL Anthology, the archive of computational linguistics
papers that our test collection is centred around. The archive contains
the most prominent publications since the beginning of the field in the
early 1960s, consisting of one journal plus conferences and workshops,
resulting in over 10,000 papers. Chapter 5 describes how the PDF papers
are prepared for our experiments, including identification of references
and citations in the papers, once converted to plain text, and
extraction of citation information to an XML database. Chapter 6
presents our experiments: we show that adding citation terms to the
full-text of the papers improves retrieval effectiveness by up to 7.4%,
that weighting citation terms higher relative to paper terms increases
the improvement and that varying the context from which citation terms
are taken has a significant effect on retrieval effectiveness. Our main
hypothesis that citation terms enhance a full-text representation of
scientific papers is thus proven.
There are some limitations to these experiments. The relevance
judgements in our test collection are incomplete but we have
experimentally verified that the test collection is, nevertheless, a
useful evaluation tool. Using the Lemur toolkit constrained the method
that we used to weight citation terms; we would like to experiment with
a more realistic implementation of term weighting. Our experiments with
different citation contexts did not conclude an optimal citation
context; we would like to extend the scope of our investigation. Now
that our test collection exists, we can address these issues in our
experiments and leave the door open for more extensive experimentation.
--
University of Cambridge, Computer Laboratory,
Technical Reports (ISSN 1476-2986)
http://www.cl.cam.ac.uk/techreports/