Following some discussions during the past ICLDC conference, IпїЅm posting a proposal for a standard XML format for interlinear glossed texts (IGT) based on Fieldworks (FLEx) Interlinear XML and invite you to comment upon. I'm also sending slides from my ICLDC presentation.
Since the number of people concerned was high I took the liberty create this discussion group and subscribe you to it; if you donпїЅt wish to receive the messages from this group, please excuse me and drop me a line at sar...@mail.ru. Also please do so if you know of someone whoпїЅd be interested in joining the group.
The list currently includes the following:
Alexander Arkhipov (MGU)
Michael Daniel (MGU)
Oleg Belyaev (MGU)
Valentin Goussev (Institute of Linguistics RAS; SearchToo)
Dmitry Gorshkov (SearchToo)
Han Slotjes (MPI Nijmegen, ELAN)
Beth Bryson (FLEx)
Susanna Imrie (FLEx)
John Thomson (FLEx)
John Hatton (WeSay)
Mike Maxwell (U of Maryland)
Jeff Good (U of Buffalo)
Alexander Nakhimovsky (U of Colgate)
Tom Myers (N-Topus Software)
Sebastian Drude (U of Frankfurt)
Sebastian Nordhoff (MPI Leipzig)
Johannes Helmbrecht (U of Regensburg)
Peter Bouda (U of Munich)
Nick Thieberger (U of Melbourne)
Best wishes,
Sasha Arkhipov
This list is a good idea, thanks. There was a similar attempt made by
Alexis Palmer a few years ago
(http://groups.google.com/group/igt-anno/topics only a few messages
before it fell into disuse) to get consensus on a schema for IGT.
I have been working on a representation of IGT called EOPAS and would
have presented it at ICLDC if I had gone. It allows you to upload XML
files of IGT together with media and to read the IGT. It calls
timecodes in media using HTML5 mdia calls (i.e., no flash or media
server is needed) The page describing it is here:
http://www.linguistics.unimelb.edu.au/research/projects/eopas.html
The requirements and specifications are here:
http://www.linguistics.unimelb.edu.au/research/projects/eopasReqs.html
This is a much simpler system than the one Sasha proposes and is
really aimed at getting samples of monologic IGT and media online. It
is, however, an open-source system and one that could be used as the
basis for adding higher-level units (paragraphs) or multi-participant
discourse.
I'm sure there will be simple (!) XSL conversions between the various
proposed systems.
I look forward to this group having a productive discussion,
Nick
2011/2/19 Alexandre Arkhipov <sar...@mail.ru>:
> Dear friends,
>
> Following some discussions during the past ICLDC conference, I’m posting a proposal for a standard XML format for interlinear glossed texts (IGT) based on Fieldworks (FLEx) Interlinear XML and invite you to comment upon. I'm also sending slides from my ICLDC presentation.
>
> Since the number of people concerned was high I took the liberty create this discussion group and subscribe you to it; if you don’t wish to receive the messages from this group, please excuse me and drop me a line at sar...@mail.ru. Also please do so if you know of someone who’d be interested in joining the group.
Thank you very much for your reply. I was so much sure I'd meet you at the
conference, hope all is well for you now!
Thanks for the links, I'll check them all. I've heard about EOPAS but not of
Alexis Palmer's initiative.
And yes, I hope that independently of the format details there will be
simple XSL transforms between the things we all need.
All best,
Sasha
Among the deviations, we find
* Examples where the number of lines does not equal three. For instance,
ungrammatical examples often only have one line. Four-liners and higher
are found when adding phonetics, intonation, or several languages. It
should be possible to take care of this with the type="" attribute of the
FLEx schema. One could maybe inventarize the different types of extra
lines in order to constrain the possible representations
* Speaking of ungrammatical examples, one would probably want an attribute
judgment="*", judgment="?", judgment="OK", in order to separate semantics
from presentation
* Some examples have an introduction line, such as "relative clause with
argument in genitive" or the like, preceding the actual IGT
* Another problem are what I call etymological examples, e.g.
(1) uomo < homo
Here, we are not even dealing with one phrase, but the content is found
on one line nevertheless. Obviously, the language changes in the middle of
the line
* a related issue are very short examples where the gloss is on the same
line as the object language text
(2) bateau `ship'
It would be nice if this typographic arrangement could be recovered from
the XML
* phonological examples are also sometimes compressed on one line
(3) CVVCV: baaru [ba:ru] `new'
(4) /baru/ --> [ba:ru]
* subexamples are also something which is not found in corpora
(5)a. The government has decided ...
b. The government have decided ...
Within subexamples, several lines can be shared, e.g. the translation
line
I do not have any solution at hand at the moment (I hope to have some at
the end of this year), but I wanted to raise awareness of the fact that
IGT is not limited to corpora alone, and that when developing standards,
one could have those other uses in mind.
Best wishes
Sebastian
On Wed, 23 Feb 2011 12:56:56 +0100, peterbouda <peb...@googlemail.com>
wrote:
> Hi,
>
> indeed, a good start, although I am not familiar with FLEx and which
> needs it addresses...
>
> I am currently working on using the ELAN files itself as a base to
> add
> morpho-syntactic annotations. I developed a simple "Interlinear
> Editor"
> we used in a DOBES project ("Minderico"):
>
> http://www.cidles.eu/ltll/poio-ile
>
> It just adds tiers for those levels we want to annotate, namely
> "morpheme", "gloss" and "translation" ("phrase" and "word" should
> already be there and were created with ELAN), and allows to edit them
> in
> an interlinear style. Everything is based on EAF, if you save a file
> you
> can open it in ELAN again (see the demo video on the page).
>
> I am currently working on an analyzation tool to search those files
> on
> the different tiers, this will work for ELAN, Toolbox and Kura files.
>
> I am still wondering what is the best way to "tell" the software
> which
> tiers contain the interlinear information to display. I was thinking
> about some kind of additional XSD, that not only checks for basic
> structure, but also some kind of "logic" in the tiers, like: Is there
> some kind of "tier tree": "phrase" -> "word" -> "morpheme" ->
> "gloss"?
> This would allow me to just quickly check the file if it is
> displayable
> in my window.
>
> To be able to support different file formats I created a Python
> module
> PyAnnotation (http://www.cidles.eu/ltll/poio-pyannotation), that is
> my
> "proxy" for the annotation formats. I just derive a common (Python)
> data
> structure from the data files. See the examples on the page, how the
> common structure looks like. I use those data structure to display
> the
> data in different views (currently a read-only view for analyzation,
> and
> an edit view for the interlinear editor; the read-only view supports
> quick loading for a batch of EAF and Toolbox files, the editor is
> quite
> slow for big files right now).
>
> Peter
>
>
>
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
FLEx (or Fieldworks Language Explorer) is part of SIL Fieldworks suite
designed primarily for field linguists. It is a complex tool which allows to
create a dictionary and to interlinearize texts, with tight links between
the two (every morpheme occurrence in the texts is linked to its entry in
the dictionary). It also allows to train a rule-based parser which will use
your morpheme dictionary and the constraints you specify to automate
morphological analysis. Here is their home page:
http://fieldworks.sil.org/flex/.
We're currently using both FLEx and ELAN in our documentation projects, and
there are colleagues around who use either but also Toolbox or just MS Word.
So we had an idea very similar to your Poio Analyzer, i.e. a lightweight
search tool which would make searches across multiple files of all these
kinds. A standalone tool is necessary to avoid limitations of both ELAN
(heavy media files slow down single result display, limited output options)
and FLEx (search only one project=language at a time, no search for
constructions across tiers). We chose to bring them first to an intermediary
format, and that was a starting point of my paper at ICLDC and of this
discussion group.
The tool is called SearchToo, it is available for download here (it has an
English interface but the manual is only in Russian):
http://www.iling-ran.ru/searchtoo (work in progress...). Currently it takes
files exported from FLEx, and can transform Toolbox files or Toolbox-like
Word documents into the same kind of XML. SearchToo does not limit the tiers
to use, you can search on any tiers present in your file. That's why I haven't
thought of inferring the tier relations from data. I believe it should be
easy to get the hierarchy itself from Elan files, but if there are multiple
tiers at the same level (e.g. phrase number, literal translation, free
translation at phrase level; underlying morpheme form, English gloss,
Russian gloss at morpheme level) there is probably no way to guess which
one(s) do you need. Do you limit the set of tiers displayed for display/ease
of processing reasons, or are there some deeper considerations?
But it will probably be of use anyway to store a description of the tiers'
meaning somewhere?
Sasha
----- Original Message -----
From: "peterbouda" <peb...@googlemail.com>
To: "Interlinear XML" <interli...@googlegroups.com>
Sent: Wednesday, February 23, 2011 2:56 PM
Subject: [interlinear: 3] Re: Discussion group for Interlinear standard
format
Establishing a standard is often a time-consuming process. This will
probably not be different for a standard for interlinear glossed texts.
In the mean time we can work on some of the other points that you
mention, concerning the interoperability between FLEx and ELAN.
For the ELAN part of it, one of most important issues is retaining
time-alignment information, your point 5.
It would be good to know whether there is a chance that FLEx/the FLEx
xml will have dedicated elements or attributes for that. (These elements
or attributes can have more generic names than "time-from" and "time-to"
in order to support other types of media, e.g. text, to be annotated. If
that would make sense for the FLEx tool, that is.)
If it is unlikely that such elements/attributes will be added to FLEx,
we can work on some convention for encoding/decoding alignment
information in e.g. "note" items or maybe there are better candidates.
Instead of concatenating "note" items the importer in ELAN could create
numbered note-n tiers, like in your example. There is an Interlinear
Viewer in ELAN which needs to be improved a lot and then it might be a
better choice for imported FLEx data than the default Timeline viewer
(because the visualization is based on the length of the text of the
annotations).
"punct" items are now added to the corresponding "text" tier. This
differs from the way most other items are dealt with and it has to be
seen how this can be exported again correctly.
But before starting to work on an export function for FLEx it would be
good to have an idea of what can be expected (and what not) with respect
to the alignment notation issue.
Best wishes,
Han