Discussion group for Interlinear standard format

33 views
Skip to first unread message

Alexandre Arkhipov

unread,
Feb 19, 2011, 5:04:09 AM2/19/11
to interli...@googlegroups.com
Dear friends,

Following some discussions during the past ICLDC conference, IпїЅm posting a proposal for a standard XML format for interlinear glossed texts (IGT) based on Fieldworks (FLEx) Interlinear XML and invite you to comment upon. I'm also sending slides from my ICLDC presentation.

Since the number of people concerned was high I took the liberty create this discussion group and subscribe you to it; if you donпїЅt wish to receive the messages from this group, please excuse me and drop me a line at sar...@mail.ru. Also please do so if you know of someone whoпїЅd be interested in joining the group.
The list currently includes the following:

Alexander Arkhipov (MGU)
Michael Daniel (MGU)
Oleg Belyaev (MGU)
Valentin Goussev (Institute of Linguistics RAS; SearchToo)
Dmitry Gorshkov (SearchToo)
Han Slotjes (MPI Nijmegen, ELAN)
Beth Bryson (FLEx)
Susanna Imrie (FLEx)
John Thomson (FLEx)
John Hatton (WeSay)
Mike Maxwell (U of Maryland)
Jeff Good (U of Buffalo)
Alexander Nakhimovsky (U of Colgate)
Tom Myers (N-Topus Software)
Sebastian Drude (U of Frankfurt)
Sebastian Nordhoff (MPI Leipzig)
Johannes Helmbrecht (U of Regensburg)
Peter Bouda (U of Munich)
Nick Thieberger (U of Melbourne)

Best wishes,
Sasha Arkhipov

interlinear-proposal.zip

Nick Thieberger

unread,
Feb 19, 2011, 2:23:06 PM2/19/11
to interlinear-xml, sarkipo
Sasha,

This list is a good idea, thanks. There was a similar attempt made by
Alexis Palmer a few years ago
(http://groups.google.com/group/igt-anno/topics only a few messages
before it fell into disuse) to get consensus on a schema for IGT.

I have been working on a representation of IGT called EOPAS and would
have presented it at ICLDC if I had gone. It allows you to upload XML
files of IGT together with media and to read the IGT. It calls
timecodes in media using HTML5 mdia calls (i.e., no flash or media
server is needed) The page describing it is here:
http://www.linguistics.unimelb.edu.au/research/projects/eopas.html

The requirements and specifications are here:
http://www.linguistics.unimelb.edu.au/research/projects/eopasReqs.html

This is a much simpler system than the one Sasha proposes and is
really aimed at getting samples of monologic IGT and media online. It
is, however, an open-source system and one that could be used as the
basis for adding higher-level units (paragraphs) or multi-participant
discourse.

I'm sure there will be simple (!) XSL conversions between the various
proposed systems.

I look forward to this group having a productive discussion,

Nick

2011/2/19 Alexandre Arkhipov <sar...@mail.ru>:
> Dear friends,
>
> Following some discussions during the past ICLDC conference, I’m posting a proposal for a standard XML format for interlinear glossed texts (IGT) based on Fieldworks (FLEx) Interlinear XML and invite you to comment upon. I'm also sending slides from my ICLDC presentation.
>
> Since the number of people concerned was high I took the liberty create this discussion group and subscribe you to it; if you don’t wish to receive the messages from this group, please excuse me and drop me a line at sar...@mail.ru. Also please do so if you know of someone who’d be interested in joining the group.

Alexandre Arkhipov

unread,
Feb 20, 2011, 4:13:57 PM2/20/11
to interli...@googlegroups.com
Hi Nick,

Thank you very much for your reply. I was so much sure I'd meet you at the
conference, hope all is well for you now!
Thanks for the links, I'll check them all. I've heard about EOPAS but not of
Alexis Palmer's initiative.
And yes, I hope that independently of the format details there will be
simple XSL transforms between the things we all need.

All best,
Sasha

Sebastian Nordhoff

unread,
Feb 23, 2011, 8:53:45 AM2/23/11
to interli...@googlegroups.com
Dear all,
I very much welcome this discussion group. Good initiative, Sasha!
At the moment, most of the theoretical discussions about interlinear text
and its representation are derived from manipulating corpora, with ELAN,
FLEx, Toolbox, or whatever. It is of course desirable to have well-defined
interchange and storage formats that different programs can understand.
On the other hand, there are uses of interlinear text not directly related
to corpora. In my work, I am dealing with the representation of legacy
grammars, i.e. grammatical descriptions in book form. These works
obviously contain a lot of IGTs, but only a fraction of them correspond to
the things we find in corpora (<40%). Furthermore, these examples often
make use of typographical features which are not used when representing
corpora.

Among the deviations, we find

* Examples where the number of lines does not equal three. For instance,
ungrammatical examples often only have one line. Four-liners and higher
are found when adding phonetics, intonation, or several languages. It
should be possible to take care of this with the type="" attribute of the
FLEx schema. One could maybe inventarize the different types of extra
lines in order to constrain the possible representations

* Speaking of ungrammatical examples, one would probably want an attribute
judgment="*", judgment="?", judgment="OK", in order to separate semantics
from presentation

* Some examples have an introduction line, such as "relative clause with
argument in genitive" or the like, preceding the actual IGT

* Another problem are what I call etymological examples, e.g.

(1) uomo < homo

Here, we are not even dealing with one phrase, but the content is found
on one line nevertheless. Obviously, the language changes in the middle of
the line

* a related issue are very short examples where the gloss is on the same
line as the object language text

(2) bateau `ship'
It would be nice if this typographic arrangement could be recovered from
the XML

* phonological examples are also sometimes compressed on one line

(3) CVVCV: baaru [ba:ru] `new'
(4) /baru/ --> [ba:ru]

* subexamples are also something which is not found in corpora

(5)a. The government has decided ...
b. The government have decided ...

Within subexamples, several lines can be shared, e.g. the translation
line

I do not have any solution at hand at the moment (I hope to have some at
the end of this year), but I wanted to raise awareness of the fact that
IGT is not limited to corpora alone, and that when developing standards,
one could have those other uses in mind.

Best wishes
Sebastian

On Wed, 23 Feb 2011 12:56:56 +0100, peterbouda <peb...@googlemail.com>
wrote:

> Hi,
>
> indeed, a good start, although I am not familiar with FLEx and which
> needs it addresses...
>
> I am currently working on using the ELAN files itself as a base to
> add
> morpho-syntactic annotations. I developed a simple "Interlinear
> Editor"
> we used in a DOBES project ("Minderico"):
>
> http://www.cidles.eu/ltll/poio-ile
>
> It just adds tiers for those levels we want to annotate, namely
> "morpheme", "gloss" and "translation" ("phrase" and "word" should
> already be there and were created with ELAN), and allows to edit them
> in
> an interlinear style. Everything is based on EAF, if you save a file
> you
> can open it in ELAN again (see the demo video on the page).
>
> I am currently working on an analyzation tool to search those files
> on
> the different tiers, this will work for ELAN, Toolbox and Kura files.
>
> I am still wondering what is the best way to "tell" the software
> which
> tiers contain the interlinear information to display. I was thinking
> about some kind of additional XSD, that not only checks for basic
> structure, but also some kind of "logic" in the tiers, like: Is there
> some kind of "tier tree": "phrase" -> "word" -> "morpheme" ->
> "gloss"?
> This would allow me to just quickly check the file if it is
> displayable
> in my window.
>
> To be able to support different file formats I created a Python
> module
> PyAnnotation (http://www.cidles.eu/ltll/poio-pyannotation), that is
> my
> "proxy" for the annotation formats. I just derive a common (Python)
> data
> structure from the data files. See the examples on the page, how the
> common structure looks like. I use those data structure to display
> the
> data in different views (currently a read-only view for analyzation,
> and
> an edit view for the interlinear editor; the read-only view supports
> quick loading for a batch of EAF and Toolbox files, the editor is
> quite
> slow for big files right now).
>
> Peter
>
>
>


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

peterbouda

unread,
Feb 23, 2011, 6:56:56 AM2/23/11
to Interlinear XML

Michael Maxwell

unread,
Feb 23, 2011, 10:42:26 AM2/23/11
to interli...@googlegroups.com, Michael Maxwell
Sebastian Nordhoff wrote:
> * Examples where the number of lines does not equal three...
> Four-liners and higher
> are found when adding phonetics, intonation, or several languages.

We're using four-liners in our grammars of languages that use right-to-left scripts (in our case, Perso-Arabic scripts). The first line is the utterance in the right-to-left script, while the remaining lines are the usual (utterance in a left-to-right transcription, morpheme-level glosses, free translation). The 2nd and 3rd lines are aligned at the word level, the first and fourth lines are unaligned. The motivation is that it's difficult for the reader to make sense of glosses ordered right-to-left when the glosses themselves are written in a left-to-right script. (I can illustrate this if it's unclear.)

I think in general the structure of interlinear glossed text is
unaligned_text_line*
aligned_text*
unaligned_text_line*
There are of course limitations--if you have any aligned text lines, there must be at least two of them, otherwise it doesn't make sense to speak of alignment; and at least one of the above must be present, or you have an empty interlinear! And preferably there must be at least two lines, or it isn't interlinear. But these are hard to constrain, and I think unnecessary to make explicit.

With regard to the aligned_text in the above structure, there are two ways of looking at it. One is that aligned text consists of two or more lines, which must contain equal numbers of tokens, and that the alignment is implicit. The other is that aligned text consists of one or more bundles of tokens, where each bundle has the same number of tokens (at least two); here the alignment is explicit, while the lines are implicit. Both representations have their place, and inter-conversion is not too difficult.

> * Some examples have an introduction line, such as "relative clause with
> argument in genitive" or the like, preceding the actual IGT.

The issue of examples, as opposed to the interlinear text itself, becomes quite messy. For instance, yesterday I saw a grammar in which some of the examples had explanatory text right-justified to the right of the first line of the example--sort of like a call-out. And in our work we allow footnotes on the interlinears (we use them to present "correctly" spelled versions of the vernacular, when the example is taken from a "freely" spelled corpus of written text). Not to mention the sub-examples, phonological examples etc. that Sebastian mentions in his email below, some of which aren't even glossed interlinear in the usual sense.

My suggestion would be that we concentrate on the interlinear text itself, and not on the examples. We may or may not want to include what I would call in-line examples, like /barco/ "ship" (I think this is what Sebastian means by his bateau `ship' example). The structure of these is quite similar to glossed interlinear text, and the main difference (that they are generally formatted with the two or three parts as running text, rather than vertically aligned interlinear) being a presentation issue, not a structure issue. (That said, in our own work we did provide two different XML schemas, one for standard interlinears, and one for in-line text. But we probably didn't need to.)

Mike Maxwell
Area Director, Technology Use
CASL/ University of Maryland
"Digital data lasts forever -- or five years, whichever comes first."
--Jeff Rothenberg, 1997

Michael Maxwell

unread,
Feb 23, 2011, 10:48:15 AM2/23/11
to Interlinear XML, Michael Maxwell
Peter Bouda wrote:
> ... Is there some kind of "tier tree":
> "phrase" -> "word" -> "morpheme" -> "gloss"?

The difficulty with the "phrase" level is that it (unlike the word/ morpheme/ gloss levels, at least in most theories) is recursive. That is, a context-free (at least) grammar, not a finite state grammar. So if you want to ensure that brackets match (phrases are properly nested), it becomes much more difficult.

Are there good use cases of phrase marking in glossed interlinear texts? Perhaps in practice the marking is not nested?

Mike Maxwell
Area Director, Technology Use
CASL/ University of Maryland
"Digital data lasts forever -- or five years, whichever comes first."
--Jeff Rothenberg, 1997


> -----Original Message-----
> From: interli...@googlegroups.com [mailto:interlinear-

Alexandre Arkhipov

unread,
Mar 5, 2011, 3:33:30 PM3/5/11
to interli...@googlegroups.com
Hi Peter,

FLEx (or Fieldworks Language Explorer) is part of SIL Fieldworks suite
designed primarily for field linguists. It is a complex tool which allows to
create a dictionary and to interlinearize texts, with tight links between
the two (every morpheme occurrence in the texts is linked to its entry in
the dictionary). It also allows to train a rule-based parser which will use
your morpheme dictionary and the constraints you specify to automate
morphological analysis. Here is their home page:
http://fieldworks.sil.org/flex/.


We're currently using both FLEx and ELAN in our documentation projects, and
there are colleagues around who use either but also Toolbox or just MS Word.
So we had an idea very similar to your Poio Analyzer, i.e. a lightweight
search tool which would make searches across multiple files of all these
kinds. A standalone tool is necessary to avoid limitations of both ELAN
(heavy media files slow down single result display, limited output options)
and FLEx (search only one project=language at a time, no search for
constructions across tiers). We chose to bring them first to an intermediary
format, and that was a starting point of my paper at ICLDC and of this
discussion group.

The tool is called SearchToo, it is available for download here (it has an
English interface but the manual is only in Russian):
http://www.iling-ran.ru/searchtoo (work in progress...). Currently it takes
files exported from FLEx, and can transform Toolbox files or Toolbox-like
Word documents into the same kind of XML. SearchToo does not limit the tiers
to use, you can search on any tiers present in your file. That's why I haven't
thought of inferring the tier relations from data. I believe it should be
easy to get the hierarchy itself from Elan files, but if there are multiple
tiers at the same level (e.g. phrase number, literal translation, free
translation at phrase level; underlying morpheme form, English gloss,
Russian gloss at morpheme level) there is probably no way to guess which
one(s) do you need. Do you limit the set of tiers displayed for display/ease
of processing reasons, or are there some deeper considerations?

But it will probably be of use anyway to store a description of the tiers'
meaning somewhere?

Sasha

----- Original Message -----
From: "peterbouda" <peb...@googlemail.com>
To: "Interlinear XML" <interli...@googlegroups.com>
Sent: Wednesday, February 23, 2011 2:56 PM
Subject: [interlinear: 3] Re: Discussion group for Interlinear standard
format

Han Sloetjes

unread,
Apr 3, 2011, 4:51:48 PM4/3/11
to interli...@googlegroups.com
Dear Sasha, all,

Establishing a standard is often a time-consuming process. This will
probably not be different for a standard for interlinear glossed texts.
In the mean time we can work on some of the other points that you
mention, concerning the interoperability between FLEx and ELAN.

For the ELAN part of it, one of most important issues is retaining
time-alignment information, your point 5.
It would be good to know whether there is a chance that FLEx/the FLEx
xml will have dedicated elements or attributes for that. (These elements
or attributes can have more generic names than "time-from" and "time-to"
in order to support other types of media, e.g. text, to be annotated. If
that would make sense for the FLEx tool, that is.)
If it is unlikely that such elements/attributes will be added to FLEx,
we can work on some convention for encoding/decoding alignment
information in e.g. "note" items or maybe there are better candidates.

Instead of concatenating "note" items the importer in ELAN could create
numbered note-n tiers, like in your example. There is an Interlinear
Viewer in ELAN which needs to be improved a lot and then it might be a
better choice for imported FLEx data than the default Timeline viewer
(because the visualization is based on the length of the text of the
annotations).
"punct" items are now added to the corresponding "text" tier. This
differs from the way most other items are dealt with and it has to be
seen how this can be exported again correctly.
But before starting to work on an export function for FLEx it would be
good to have an idea of what can be expected (and what not) with respect
to the alignment notation issue.

Best wishes,

Han

Reply all
Reply to author
Forward
0 new messages