Dear FLEx community,
It has been a long-standing problem that interlinear texts can
only be imported into FLEx with their sentence-level information
(e.g. transcription and translations), but word or morpheme
analyses are discarded.
For smaller text collections this can be disregarded if one can
just repeat all the analysis in FLEx. However, there exist
larger collections of texts glossed in Toolbox or ELAN, for
which the burden of reanalysis is too heavy.
This year, we succeeded to write a set of scripts which allowed
us to import a very important and valuable collection of IGT
originally glossed in Toolbox, via their ELAN version as the
source, into an existing FLEx project. The size of the imported
collection was over 110,000 tokens. The work was done in the INEL project at Hamburg University (mainly
by my colleague Aleksandr Riaposov); thanks go to Ken Zook for
consultations.
We are very much satisfied with the result, which I had
considered impossible for years.
Now we'd like to ask whether some of you would be interested in
this kind of import. The current set of scripts is tailored for
our particular setup, and it will need some work to make it more
generally applicable. So our question is, how much this is still
needed (how many Toolbox collections are still waiting to be
FLEx-ed).
So if you're interested, please let us know. We cannot promise
anything beforehand, but we're willing to fit some work on this
conversion into our schedule if it's indeed useful for the
community.
All best,
Alexandre Arkhipov
--
"FLEx list" messages are public. Only members can post.
flex_d...@sil.org
http://groups.google.com/group/flex-list.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/bbead2f3-e671-44ed-ad95-36a7e7e5e361%40mail.ru.
Dear Alexander,
I think that the process you’ve made possible is not only worth for people who want to shift from using Toolbox to using Flex, but also for people who use both Flex and Elan.
Getting the transcription/translation from Elan to Flex goes fine, and then once the analysis is made in Flex it can easily be exported in Elan, but here stops the cycle: if one makes changes in Elan, then the content of the files is not fully importable in Flex, since the interlinear analysis will be lost in the process. So the actual fluidity between the two software is limited for now.
I would be interested if your script enables me to reimport data from Elan to Flex without losing information.
Françoise
De : 'Alexandre Arkhipov' via FLEx list <flex...@googlegroups.com>
Envoyé : vendredi 15 décembre 2023 15:03
À : flex...@googlegroups.com
Cc : Riaposov, Alexander <aleksandr...@uni-hamburg.de>
Objet : [FLEx] FYI: Toolbox/ELAN > FLEx import with glosses
--
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/49d76bf1614d468c9693bf9f53724ce9%40univ-lyon2.fr.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/CAPbY3nE1A39wiLirvTp0AoT9PBFHfd3s-RjASiS94yFGE4-b5A%40mail.gmail.com.
Dear Françoise,
Yes, potentially it could also be useful for
ELAN <> FLEx roundtrip.
For now I'm cautious because the way it is currently implemented
is not too quick & easy to run -- it is ok as a one-off
action but too tricky for a daily routine (which would be nice
for a roundtrip).
But, if you just need to do it once or twice, we can surely look
at it. Actually, for data which were already once exported from
FLEx, it should be even easier than for our own use case.
All best,
Alexandre
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/49d76bf1614d468c9693bf9f53724ce9%40univ-lyon2.fr.
Dear all,
Thanks for your interest. I'll now explain a bit how it works.
0. Our use case is this:
We have a large collection of texts glossed in Toolbox, now
imported into ELAN.
On the other hand, we have a FLEx project for the same language,
initially independent, with other text glossed in a slightly
different way. The goal was to merge both in FLEx keeping
glosses in both parts but making them follow the current FLEx
project consistently.
So in addition to 'just' importing glosses into FLEx, we are
also doing some replacements to get the same transcription
conventions, glosses, POS labels and speaker codes as we have in
the existing FLEx project.
Other users may not have this problem; it will be easier to
import everything as is.
However, you do need to have an existing FLEx project before
starting with the import. Crucially, you will need to know FLEx
codes for all the tiers and writing systems you will be
importing.
In our case, we had two glossing lines already in Toolbox, with
English and Russian glosses. The importing script is checking if
a given morph with the given POS and glosses already exists in
the FLEx lexicon.
If you only have one glossing line, it will be simpler (less to
check). If you have more than two, the script will need to be
adapted to import and check more than two glosses.
1. Preprocessing stage
At this stage, we get from ELAN files to flextext files with all
the necessary tier and writing system codes.
We're actually using a series of four XSLT scripts (using XSLT
3.0 with XPath 3.1, which I process with Saxon 9.9 PE via
Oxygen; however the free edition of Saxon, SaxonC-HE or SaxonJ-HE, should do
perfectly well).
(i) systematic rearrangement and renaming of the ELAN tiers,
replacements in speaker codes
(ii) replacements in transcription (tx, mb), POS labels (ps:
parts of speech and morphological slots, like "n", "n>v",
"n:case"), and glosses (ge, also gr in our case)
(iii) capitalize first word in a sentence, add full stop in the
end of a sentence
>> At this moment, the ELAN files are ready for export
into flextext. Done in ELAN with "Export multiple files"
command.
(iv) tweaking the flextext: extracting punctuation from inside a
word into a separate word with type="punct", adding the complete
text of a sentence as a phrase/item element
My guess is that most users will not
necessarily need (i)-(iii) and just start with export from ELAN.
All the tiers and writing systems codes can be specified in the
export dialog window, although it can take some clicks.
But (iv) is probably necessary to get punctuation treated the
right way.
2. Import stage
The importer script is written in Python (by my colleague
Aleksandr Riaposov, not by myself so I won't be so detailed as
for the preprocessing step). Any recent version of Python 3
should be ok.
It should be possible to run the XSLT scripts from within Python
and thus avoid a separate preprocessing step for the user if
starting from (iv), but we haven't yet tried to.
This is also not a single script but a collection of Python
scripts. They analyze and modify the *.fwdata file directly.
That is, one must naturally make a backup and close FLEx, then
preferably copy the *.fwdata file from the
ProgramData/SIL/Fieldworks/Projects folder elsewhere and run the
scripts.
First, the scripts analyze the database to make lists of
existing objects of all kinds (morphs, wordforms, analyses,
texts, speakers, etc.).
Then, they proceed to add new texts to the *.fwdata file.
In every text, every wordform is checked against the existing
ones. If there is already a wordform with the same analysis,
just a link to the existing object is added. Otherwise, a new
wordform is created.
Same for other types of objects, e.g. morphs: each morph is
checked against the lexicon. If there exists already a morph
with the same form (either main form or alternate form
(allomorph)), same part of speech and same glosses, then just a
link to it is added. Otherwise, a new lexical entry is created.
Homograph numbers are adjusted if necessary, but these should
better be reassigned after the import by launching the
corresponding utility.
All the newly added objects are appended to the end of the
*.fwdata file.
When the scripts are done, the modified
*.fwdata should be copied in place of the old one.
Naturally, there is a risk that something will get broken, so
you must have a backup to revert to the stage before running the
scripts. Until now, we didn't have any serious issues with the
modified project file *after* the scripts have completed
successfully (i.e. without stopping with an error message). Some
built-in FLEx utilities should probably be run to merge
duplicate wordforms and analyses and reassign homographs, for
instance.
Please feel free to ask further questions.
We will try to make the scripts customizable (e.g. by providing
the writing system codes through a dialog or config file) and
add some documentation, and let you know when it's worth to try.
All best,
Alexandre
I am also very interested in this set of scripts. Particularly curious to see the mechanism you used to move the data back into FLEx.
I am not worried about the complexity or the need for customization - I have a software engineering background and would likely be retooling this for another use.
Are these scripts sharable? What programming language are they written in?
Thx!
On Monday, December 18, 2023 at 9:02:03 AM UTC-8 Alexandre Arkhipov wrote:
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/112f3e7e-31ba-4cad-b61b-03f4e47f9143n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/112f3e7e-31ba-4cad-b61b-03f4e47f9143n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/29b0dffc-201e-430c-bd11-0b0ef5c7d78e%40mail.ru.