FYI: Toolbox/ELAN > FLEx import with glosses

205 views
Skip to first unread message

Alexandre Arkhipov

unread,
Dec 15, 2023, 9:03:26 AM12/15/23
to flex...@googlegroups.com, Riaposov, Alexander

Dear FLEx community,

It has been a long-standing problem that interlinear texts can only be imported into FLEx with their sentence-level information (e.g. transcription and translations), but word or morpheme analyses are discarded.
For smaller text collections this can be disregarded if one can just repeat all the analysis in FLEx. However, there exist larger collections of texts glossed in Toolbox or ELAN, for which the burden of reanalysis is too heavy.

This year, we succeeded to write a set of scripts which allowed us to import a very important and valuable collection of IGT originally glossed in Toolbox, via their ELAN version as the source, into an existing FLEx project. The size of the imported collection was over 110,000 tokens. The work was done
in the INEL project at Hamburg University (mainly by my colleague Aleksandr Riaposov); thanks go to Ken Zook for consultations.
We are very much satisfied with the result, which I had considered impossible for years.

Now we'd like to ask whether some of you would be interested in this kind of import. The current set of scripts is tailored for our particular setup, and it will need some work to make it more generally applicable. So our question is, how much this is still needed (how many Toolbox collections are still waiting to be FLEx-ed).
So if you're interested, please let us know. We cannot promise anything beforehand, but we're willing to fit some work on this conversion into our schedule if it's indeed useful for the community.

All best,
Alexandre Arkhipov

Ken K

unread,
Dec 16, 2023, 12:59:07 PM12/16/23
to flex...@googlegroups.com, Riaposov, Alexander
Hi Alexandre, 

This sounds great! Are you considering making the scripts available to the larger community? 

The linguistic community could really benefit from this! 

Best wishes, 
Ken 

--
"FLEx list" messages are public. Only members can post.
flex_d...@sil.org
http://groups.google.com/group/flex-list.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/bbead2f3-e671-44ed-ad95-36a7e7e5e361%40mail.ru.

Emmanuel Adegbuyi

unread,
Dec 17, 2023, 6:13:53 PM12/17/23
to FLEx list
Hello Alexandre,

This is really great!

I am looking forward to getting further updates on this.

Françoise Rose

unread,
Dec 18, 2023, 2:37:08 AM12/18/23
to flex...@googlegroups.com, Riaposov, Alexander

Dear Alexander,

I think that the process you’ve made possible is not only worth for people who want to shift from using Toolbox to using Flex, but also for people who use both Flex and Elan.

Getting the transcription/translation from Elan to Flex goes fine, and then once the analysis is made in Flex it can easily be exported in Elan, but here stops the cycle: if one makes changes in Elan, then the content of the files is not fully importable in Flex, since the interlinear analysis will be lost in the process. So the actual fluidity between the two software is limited for now.

I would be interested if your script enables me to reimport data from Elan to Flex without losing information.

Françoise

 

 

De : 'Alexandre Arkhipov' via FLEx list <flex...@googlegroups.com>
Envoyé : vendredi 15 décembre 2023 15:03
À : flex...@googlegroups.com
Cc : Riaposov, Alexander <aleksandr...@uni-hamburg.de>
Objet : [FLEx] FYI: Toolbox/ELAN > FLEx import with glosses

--

Claire Bowern

unread,
Dec 18, 2023, 7:13:06 AM12/18/23
to flex...@googlegroups.com, Riaposov, Alexander
Hi Françoise,
Check out Flibl (a tool that Amalia Skilton and Sunny Ananthanarayan have been working on); it may meet your needs for working between flex and elan,
Claire

Beth-docs Bryson

unread,
Dec 18, 2023, 9:42:34 AM12/18/23
to flex...@googlegroups.com, Riaposov, Alexander
Is there a link to download fibl and try it out?

-Beth

Alexandre Arkhipov

unread,
Dec 18, 2023, 12:02:03 PM12/18/23
to flex...@googlegroups.com

Dear Françoise,

Yes, potentially it could also be useful for ELAN <> FLEx roundtrip.
For now I'm cautious because the way it is currently implemented is not too quick & easy to run -- it is ok as a one-off action but too tricky for a daily routine (which would be nice for a roundtrip).
But, if you just need to do it once or twice, we can surely look at it. Actually, for data which were already once exported from FLEx, it should be even easier than for our own use case.

All best,
Alexandre

18/12/2023 08:37, 'Françoise Rose' via FLEx list пишет:
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Hugh

unread,
Mar 9, 2024, 5:40:36 PM3/9/24
to FLEx list
Hi I would strongly encourage the good documentation and prominent accessibility for these scripts.  I work in language archives moving data from legacy formats in older workflows to current workflows is something every archive should be interested in but sadly many don’t know enough about the formats to really prepare well. 

So even if there are a limited number of scholars today interested, archivist don’t know to ask this question, and it is likely that future scholars will appreciate your work. 

Kind regards,
Hugh

Message has been deleted

Alexandre Arkhipov

unread,
Mar 14, 2024, 11:45:34 PM3/14/24
to flex...@googlegroups.com

Dear all,

Thanks for your interest. I'll now explain a bit how it works.

0. Our use case is this:
We have a large collection of texts glossed in Toolbox, now imported into ELAN.
On the other hand, we have a FLEx project for the same language, initially independent, with other text glossed in a slightly different way. The goal was to merge both in FLEx keeping glosses in both parts but making them follow the current FLEx project consistently.

So in addition to 'just' importing glosses into FLEx, we are also doing some replacements to get the same transcription conventions, glosses, POS labels and speaker codes as we have in the existing FLEx project.
Other users may not have this problem; it will be easier to import everything as is.
However, you do need to have an existing FLEx project before starting with the import. Crucially, you will need to know FLEx codes for all the tiers and writing systems you will be importing.

In our case, we had two glossing lines already in Toolbox, with English and Russian glosses. The importing script is checking if a given morph with the given POS and glosses already exists in the FLEx lexicon.
If you only have one glossing line, it will be simpler (less to check). If you have more than two, the script will need to be adapted to import and check more than two glosses.

1. Preprocessing stage
At this stage, we get from ELAN files to flextext files with all the necessary tier and writing system codes.
We're actually using a series of four XSLT scripts (using XSLT 3.0 with XPath 3.1, which I process with Saxon 9.9 PE via Oxygen; however the free edition of Saxon, SaxonC-HE or
SaxonJ-HE, should do perfectly well).
(i) systematic rearrangement and renaming of the ELAN tiers, replacements in speaker codes
(ii) replacements in transcription (tx, mb), POS labels (ps: parts of speech and morphological slots, like "n", "n>v", "n:case"), and glosses (ge, also gr in our case)
(iii) capitalize first word in a sentence, add full stop in the end of a sentence
>> At this moment, the ELAN files are ready for export into flextext. Done in ELAN with "Export multiple files" command.
(iv) tweaking the flextext: extracting punctuation from inside a word into a separate word with type="punct", adding the complete text of a sentence as a phrase/item element

My guess is that most users will not necessarily need (i)-(iii) and just start with export from ELAN. All the tiers and writing systems codes can be specified in the export dialog window, although it can take some clicks.
But (iv) is probably necessary to get punctuation treated the right way.

2. Import stage
The importer script is written in Python (by my colleague Aleksandr Riaposov, not by myself so I won't be so detailed as for the preprocessing step). Any recent version of Python 3 should be ok.
It should be possible to run the XSLT scripts from within Python and thus avoid a separate preprocessing step for the user if starting from (iv), but we haven't yet tried to.

This is also not a single script but a collection of Python scripts. They analyze and modify the *.fwdata file directly.
That is, one must naturally make a backup and close FLEx, then preferably copy the *.fwdata file from the ProgramData/SIL/Fieldworks/Projects folder elsewhere and run the scripts.

First, the scripts analyze the database to make lists of existing objects of all kinds (morphs, wordforms, analyses, texts, speakers, etc.).
Then, they proceed to add new texts to the *.fwdata file.
In every text, every wordform is checked against the existing ones. If there is already a wordform with the same analysis, just a link to the existing object is added. Otherwise, a new wordform is created.
Same for other types of objects, e.g. morphs: each morph is checked against the lexicon. If there exists already a morph with the same form (either main form or alternate form (allomorph)), same part of speech and same glosses, then just a link to it is added. Otherwise, a new lexical entry is created. Homograph numbers are adjusted if necessary, but these should better be reassigned after the import by launching the corresponding utility.
All the newly added objects are appended to the end of the *.fwdata file.

When the scripts are done, the modified *.fwdata should be copied in place of the old one.
Naturally, there is a risk that something will get broken, so you must have a backup to revert to the stage before running the scripts. Until now, we didn't have any serious issues with the modified project file *after* the scripts have completed successfully (i.e. without stopping with an error message). Some built-in FLEx utilities should probably be run to merge duplicate wordforms and analyses and reassign homographs, for instance.

Please feel free to ask further questions. We will try to make the scripts customizable (e.g. by providing the writing system codes through a dialog or config file) and add some documentation, and let you know when it's worth to try.

All best,
Alexandre

06/03/2024 00:46, Patricia A пишет:
I am also very interested in this set of scripts. Particularly curious to see the mechanism you used to move the data back into FLEx.

I am not worried about the complexity or the need for customization - I have a software engineering background and would likely be retooling this for another use. 

Are these scripts sharable? What programming language are they written in? 

Thx!

On Monday, December 18, 2023 at 9:02:03 AM UTC-8 Alexandre Arkhipov wrote:

Alexandre Arkhipov

unread,
Mar 14, 2024, 11:45:38 PM3/14/24
to flex...@googlegroups.com

Claire Bowern

unread,
Mar 16, 2024, 2:05:21 PM3/16/24
to flex...@googlegroups.com
Hi all,
The flibl team is currently working on a paper and documentation for the code and it should be available soon,
Claire

Bart-Jacqueline Eenkhoorn

unread,
Mar 18, 2024, 9:01:35 AM3/18/24
to asigwan via FLEx list
Hello Alexandre,

 You recently asked:
> Now we'd like to ask whether some of you would be interested in this kind of import
The answer is an affirmative YES!

One of the reasons why we stayed so long on toolbox was the inability to import glosses into flex. Once we finally (2023) did convert to Flex we had to leave our texts behind ....

I can very well image that many of those who invested years into compiling language data in toolbox have reluctantly done the same, and were just not able to invest the (extra) time to work through thousands of lines of texts. So I think there will indeed be many sfm texts database out there, just waiting to be unearthed and integrated. If the scripts could be used without ELAN, that would be an advantage for the toolbox-only users.

Thank you for investing and filling this great void, and thank you in advance for making these scripts available.
Bart.

Virus-free.www.avast.com
Reply all
Reply to author
Forward
0 new messages