CFP: Structure Extraction from Digitized Books

2 views
Skip to first unread message

Antoine Doucet

unread,
Jun 25, 2008, 1:10:56 PM6/25/08
to IAPR TC5 Benchmarking and Software
[Feel free to forward this CFP - Apologies for cross-posting]

We would like to draw your attention to the Structure Extraction Task
of
the Book Track at INEX 2008, which aims to promote and evaluate
methods
for the extraction of structure from digitized books. The task intends
to
provide the following means to its participants to evaluate extracted
table of contents:

* Direct comparison to a manually-built ground truth,
* Qualitative evaluation by manual assessments using a graded quality
scale,
* Indirect evaluation via participation in book search tasks:
Investigating the use of extracted structures to improve retrieval
performance.


*****************************************************************
Call for Participation in the Structure Extraction Task
of the Book Track at INEX 2008

Book track: http://inex.otago.ac.nz/tracks/books/books.asp
INEX: http://inex.otago.ac.nz/tracks/
*****************************************************************

Since 2002, the Initiative for the Evaluation of XML retrieval (INEX)
has
been evaluating focused retrieval approaches on collections of XML
documents.

The Book track was launched in 2007 with the goal to study book-
specific
relevance ranking strategies for informational search tasks, to
investigate user interface issues and user behaviour, exploiting
book-specific features, such as back of book indexes provided by
authors,
and associated metadata like library catalogue information.

In 2008, the book search track aims to look beyond and bring together
researchers and practitioners in Digital Libraries, Information
Retrieval,
Human Computer Interaction, and eBooks to explore common challenges
and
opportunities around digitized book collections.

Towards this goal, the track investigates, among other topics,
mechanisms
to increase accessibility to the contents of digitized books. This is
the
purpose of the Structure Extraction Task introduced this year.

GOALS
The goal of this task is to test and compare automatic techniques for
deriving structural information from digitized books in order to build
a
hyperlinked table of contents.

MOTIVATION
Current digitization and OCR technologies produce the full text of
digitized books with only minimal structure information. Pages and
paragraphs are usually identified and marked up in the OCR, but more
sophisticated structures, such as chapters, sections, etc., are
currently
not recognised.

TASK DESCRIPTION
The task is to build table of contents for a sample collection of 100
digitized books of different genre and style, using information from
the
OCR (in DjVu XML format), PDF or JPEG image files. The generated table
of
contents may be used by an e-book reader system and presented to users
as
a hyperlinked hierarchy. Users expect to see the section titles as
entries
and should be able to click on an entry and jump to the start of the
selected section in the book.

The table of contents created by participants will be compared to a
manually built ground truth (from the PDF of a book), and will be
evaluated using recall/precision like measures at different structural
levels (i.e., different depths in the table of contents). In addition,
because the ground truth may not necessarily be optimal, we intend to
evalute the quality of the created tables of contents independently:
Participants will be asked to grade them on a multi-level quality
scale.

APPLICATION IN OTHER TASKS OF THE BOOK SEARCH TRACK
Participants may apply their table of contents extraction techniques
to
the main corpus of the Book track (50,000 books) and submit runs to
the
Book Retrieval task that exploit this additional information. Note,
however, that in the case of the main corpus only the OCR text, in
OCRML
format, will be provided as input (no PDF or JPEG is available). Also
note
that the OCRML markup is rather different from the DjVu format: the
basic
structural elements are named differently and additional structure is
also
marked up in OCRML.

Participants may also enhance runs they plan to submit to the Page in
Context task, as long as the resulting XPaths conform to the structure
of
the OCRML files. The generated table of contents may also be used and
evaluated through user studies in the Active Reading task.

IMPORTANT DATES
The schedule of the structure extraction track is as follows:
27/06/2008 Sample set of 100 Books is available for download
25/07/2008 Submission deadline for table of contents (ToC)
01/08/2008 Ground truth and evaluation results distributed
25/08/2008 User assessment of relative quality of ground truth and
submitted ToC
20/10/2008 Evaluation and assessments distributed
24/11/2008 Submission deadline for papers for pre-proceedings
08/12/2008 Release of workshop pre-proceedings
15-18/12/2008 INEX Workshop in Dagstuhl

RELEVANCE ASSESSMENTS
Participants will be required to provide quality judgements on the set
of
generated table of contents, submitted by all participants.
Furthermore,
participants will be asked to provide relevance judgements for a
minimum
of one test topics for the book search tasks. Please note that the
assessment of one topic may take one person about 1/2 to 2 days to
complete.

WORKSHOP AND PROCEEDINGS
Participants will be invited to publish their approaches and results
in
the INEX workshop pre-proceedings, and present their work at the INEX
workshop to be held in December at Schloss Dagstuhl, Germany. Selected
and
revised papers will be published in the INEX post-workshop
proceedings,
expected to be published by Springer as part of the Lecture Notes in
Computer Science (LNCS) series.


Best regards,

Gabriella Kazai, Antoine Doucet and Monica Landoni Organizers of the
INEX
Book Track 2008


For queries related to the structure extraction task, please contact
Antoine Doucet: antoine...@info.unicaen.fr
Reply all
Reply to author
Forward
0 new messages