Displaying alternative analyses in text

10 views
Skip to first unread message

Andy Black

unread,
Jul 2, 2009, 11:25:40 AM7/2/09
to FieldWorks Language Explorer Discussion
(Since this message is not about foreign/borrowed words, I'm starting a
new thread.)

On 7/1/2009 3:44 PM, Alexandre Arkhipov wrote:
> This is to add a vote for this feature; I've just heard from my
> colleagues in St Petersburg that the two major problems restraining
> them from using FLEx are (1) no convenient way to display alternative
> analyses in text simultaneously before the homonymy is resolved, and
> (2) no convenient way to mark borrowings and their origin.

Alexandre:

On item (1) above, please tell us more about what your colleagues expect
to see or would like to see. What might they consider to be a
convenient way to see the alternatives?

Thanks,

--Andy

Alexandre Arkhipov

unread,
Jul 2, 2009, 1:00:29 PM7/2/09
to flex...@googlegroups.com
Hi Andy,

I will forward your question to them, but I can tell you what they
already tried. They were entering all the alternative analyses for
homographs in one field separated by slashes -- producing something
like

can
be.able.Prs/container.Nom.Sg

so that they can have a printout of a very roughly analyzed text and
yet always have a correct analysis (although among some incorrect
ones) for each morpheme. But this means they have to rework most of
the dictionary as the text analysis progresses.

I guess this representation with slashes is more or less what they
would like as a display option.

Best,
Alexandre

2009/7/2, Andy Black <andy_...@sil.org>:

Doug Higby

unread,
Jul 3, 2009, 2:01:32 PM7/3/09
to Alexandre Arkhipov
Hi Alexandre,

Thursday, July 2, 2009, 5:00:29 PM, you wrote:

Alexandre> I will forward your question to them, but I can tell you what they
Alexandre> already tried. They were entering all the alternative analyses for
Alexandre> homographs in one field separated by slashes -- producing something
Alexandre> like

Alexandre> can
Alexandre> be.able.Prs/container.Nom.Sg

Alexandre> so that they can have a printout of a very roughly analyzed text and
Alexandre> yet always have a correct analysis (although among some incorrect
Alexandre> ones) for each morpheme. But this means they have to rework most of
Alexandre> the dictionary as the text analysis progresses.

I'm not sure how practical a request this is when considering the screen
area. The current method is for people to hop through the analysis a word
at a time and select the correct sense from a drop down list. At times
there can be a dozen alternatives, and I'm not sure anyone would want all
twelve parading across the screen separated by slashes.

Or are they thinking in terms of a printed view?

--
Doug mailto:Doug_...@sil.org
Language Software Coordinator
SIL Africa Area


Oleg Belyaev

unread,
Jul 5, 2009, 11:52:41 AM7/5/09
to FLEx list
Hello everyone,

I guess I'm one of the 'colleagues from St. Petersburg' that Alexandre
was talking about (btw thanks for bringing up our problem - I tend to
forget that this discussion group exists), even though I'm actually in
Moscow (but the other guy is from St. Petersburg) :)

In fact, I've already had a small exchange of letters with Susanna
about this feature (along with some other issues), and from what she
wrote I understood that such a feature is not something really
required for most users and should not end up really high on the list
of top priorities. But maybe my explanation was not clear enough. In
any case, since this has been brought up, I will describe what we
would like to see, exactly.

You are correct to say that we are thinking in terms of print view.
Essentially the idea here is that for each wordform the parser should
generate all the possible analyses (and not just one, preferable or
whatever) and display al of them concurrently _in the text itself_ for
the wordform in question. I. e. in Sasha's example, there should be
two glosses for the word 'can' displayed simulatenously in the
'analysis'/print view.

The idea is that the FLEx parser should be able to automatically
generate glosses for each wordform in the text, at least one of which
should always be correct. Then, a human can optionally resolve this
homonymy (by selecting an analysis appropriate in context), but by
default there should be all the possible variants available.

The idea is to create a large corpus of glossed text without human
supervision or at least with a minimal amount thereof. So by default
you have a set of homonymous analyses for each wordform _in text_ (and
it's an important point that it should be in text - I'm aware of the
wordform analyses section of FLEx), and then the homonymy can be
resolved at a later stage (like in the Russian National Corpus for
example, only in a subsection of which the morphological homonymy is
resolved).

Of course it goes without saying that this should be optional.

I hope I'm clear enough, pardon me for my English if it's hard to
understand. If you have any questions, I'll be glad to answer them.

Best,

--Oleg.
> Doug                            mailto:Doug_Hi...@sil.org

Alexandre Arkhipov

unread,
Jul 5, 2009, 12:21:39 PM7/5/09
to flex...@googlegroups.com
Hi Oleg and all,

I don't have the same task to handle as you do, so my observations would
probably be inappropriate.
Anyway, if such a feature is implemented, I think it should distinguish
between all the analyses generated by the parser and what is shown on the
screen/in print. For some morphemes, especially for short (and frequent)
ones, you can easily arrive at five or ten or dozen homographs (this, of
course, depends on the language in question). Having them all displayed at
once for each occurrence would render the text unreadable. I'd then suggest
displaying two or three most probable analyses plus a sort of an ellipsis
(...) to indicate there are more of them.

Best,
Alexandre

Andy Black

unread,
Jul 6, 2009, 11:35:46 AM7/6/09
to flex...@googlegroups.com
On 7/5/2009 8:52 AM, Oleg Belyaev wrote:
> ... But maybe my explanation was not clear enough. In

> any case, since this has been brought up, I will describe what we
> would like to see, exactly.
>

Oleg, I think I understand what you are asking for. What I'm still
puzzled about is the motivation behind the request. Please help me
understand why you all want this capability in the printed output. What
do you plan to do with the printed result? Maybe have someone go
through and circle the correct analysis in context? If so, after they
do this, what is the plan?

Thanks,

--Andy

Oleg Belyaev

unread,
Jul 6, 2009, 1:25:55 PM7/6/09
to FLEx list
Well, this applies not only to print view, but to analysis/gloss, too.
Basically for each ambiguous wordform in text there should be several
analyses at once, and these should also be exported into XML etc.

E. g. we have the wordform 'Iriston-i (Ossetia-GEN/INESS)', where -i
can mean either genitive or inessive. What would be desired is for the
parser to generate both analyses for this word at once and display
them both side by side by default (i.e. without human intervention).
There are of course more complicated cases, e. g. with different
possible morpheme boundaries etc.

So that you would better understand the motivation behind this, I will
just explain our specific case.

We have a (relatively) large collection of plain text Ossetic texts
(literary works, journals, newspapers etc) which we would like to make
a morphologically annotated corpus from. We could of course go the
standard way - input all of it into FLEx, run the parser, and then
browse through all the ambiguous cases (basically through all the
wordforms since you can never be sure), resolving the homonymy by
selecting the correct interpretations in context etc. But as I said
the corpus is fairly large, with several million tokens. This task is
simply not feasible with the resources we have - our team is only
three and of these three only two can actively participate in the
project at the moment.

So the idea is to have the corpus usable even with just automatic
processing by the parser. Right now you have to examine all the parser-
generated analyses in the texts to have reliable data. On the other
hand, unresolved ambiguity is certainly not very pretty, but it is
more useful than having the parser arbitrarily select one analysis and
have it there like this until it is explicitly approved (or changed)
by the user.

Even larger projects have the same problem. I am sure you are aware of
the fact that even for 'industry-level' corpora providing several
analyses for a wordform if the homonymy has not been resolved yet is a
standard practice. I have already given the example of the Russian
National Corpus; another example could be the Eastern Armenian corpus
(www.eanc.net), the developers of which have by far more resources
than we do. You can see how they show several analyses for each
ambiguous wordform in the popup window, that's close to what we are
aiming for.

What we want in the end is to export the texts into XML for further
processing (since FLEx unfortunately lacks some important corpus-
specific features at the moment, e. g. search could see improvement),
but to do this we want to have the XML contain the data for all the
analyses available for each wordform.

I hope it is more or less clear now.

I can understand your puzzlement though: creating such a large corpus
is probably not what FLEx is originally geared towards, and that's why
I'm a bit hesitant to bother you with our problems. But I believe FLEx
can be really useful for this task and especially for creating corpora
of minority languages like Ossetic. With some improvement to the
search function and adding some statistical capabilities, it can be
used very efficiently as a corpus, and using it as such really saves
time and resources (compared to e. g. writing your own parser for the
language in question or fine-tuning/training an existing one, to name
but one of the possible complications).

Thanks,

--Oleg

Beth

unread,
Jul 6, 2009, 1:25:50 PM7/6/09
to flex...@googlegroups.com
On Jul 5, 2009, at 9:21 AM, Alexandre Arkhipov wrote:

> Having them all displayed at
> once for each occurrence would render the text unreadable. I'd then
> suggest
> displaying two or three most probable analyses plus a sort of an
> ellipsis
> (...) to indicate there are more of them.

This is an interesting idea.

One thing I have asked for is that the display would make some visual
distinction between words that have only one parse and those that
have more than one. That way, when I go back to the text to work on
ambiguities, I can quickly see which ones are the ambiguous ones.
The unambiguous guesses also need to be confirmed, but I would
separate that process from checking the ambiguous ones.

I hadn't been asking to have more than one analysis displayed; just
make it visually clear when a word has more than one.

For anyone who has ever worked with the CARLAstudio program, or any
of the tools behind it (AMPLE, etc.), there is a "manual
disambiguator" that takes the output of AMPLE and displays it as
text, but with all the ambiguities showing. The user then moves
through the text, and for each ambiguous one, chooses one of the
alternatives. The program then edits the underlying (SFM) database
so that now it has only that one choice. It might be illustrative to
see what that display is like. I think that one shows a max of 5
ambiguities, but makes it possible to get to all of them.

That tool is intended as a tool for choosing among the ambiguities
and recording those choices in the database, that is then used for
other things. It is about the data, not about printing, so it seems
a little different from what you guys are wanting.

-Beth


Andy Black

unread,
Jul 6, 2009, 2:06:15 PM7/6/09
to flex...@googlegroups.com
Thanks, Oleg.  That does clear things up for me.

--Andy

John_T...@sil.org

unread,
Jul 6, 2009, 3:50:20 PM7/6/09
to flex...@googlegroups.com

> ...For some morphemes, especially for short (and frequent)
> ones, you can easily arrive at five or ten or dozen homographs (this, of
> course, depends on the language in question). Having them all displayed at
> once for each occurrence would render the text unreadable. I'd then suggest
> displaying two or three most probable analyses plus a sort of an ellipsis
> (...) to indicate there are more of them.
>
"Unreadable" was my first reaction, too. It's easy to make a small, simple demonstration of this idea involving only a couple of lines of annotation and a few ambiguities. But it's easy to make a grammar that will generate thousands of analyses for common words. Your idea of showing the "most probable" ones sounds promising, if we can pin down what it means.

So far, I think, the examples have only shown multiple glosses for a complete word. Is that all you need, or is it also important to be able to show this ambiguity in more complex interlinear displays with morphology showing? If we need to do it for morphology, we need to consider how to display alternative analyses that involve a different way of breaking the word into morphemes, and possibly how we deal with different amounts of ambiguity at different levels. For example, "banks" has at least two main analyses (verb + 3ps, noun + plural), so there are two alternatives to show for each morpheme on the POS line. But there are many more for the morpheme gloss line (side of river, place to put money, row [of switches], tilt aircraft, deposit money, cover[fire], etc.) and at least as many for the word gloss line--perhaps more, if we are deriving a word gloss from lexeme glosses and don't have enough grammar to rule out ones like *tilt aircraft.plural.

Also, we'd have to decide what we mean by "most probable". For example, if some text has been fully hand-annotated, we could start by picking the analyses that have most frequently been confirmed by the human annotator. In fact, we could give the tool a mode where we show the most popular five human-confirmed analyses of a word (if there are that many), and allow the user to jump quickly to words that have NO human-confirmed analyses, in hopes that we can get at least one such analysis for every word in the text. Maybe we could even provide a way for the human to indicate that the first occurrence was an unusual analysis, and request a concordance of other occurrences so as to make sure the common one doesn't get missed. But even that much hand-annotation might be too much work for the available annotators.

If we have to come up with "most-probable" analyses for words that have never been human-confirmed, about all I can see to do is to pick the most popular human-confirmed meaning of each morpheme (or failing that, the first five senses in the lexical entries for each one...or maybe a higher priority should go to the first sense of each grammatically-possible homograph?).

Of course we'd have to somehow combine these algorithms if some analyses and senses have human approvals and others do not. And somehow prioritize different ways of breaking the word into morphemes if that is possible.

This is not to say something can't be done, just that it has some non-trivial details to work out.

John Thomson

Oleg Belyaev

unread,
Jul 6, 2009, 3:57:15 PM7/6/09
to FLEx list
Hi Beth,

No, in fact data is what we are wanting, printing is of secondary
importance. I was talking of print view as opposed to e. g. 'word
analyses' section of 'words and texts' tab (since in the beginning
there was some misunderstanding about what we wanted, exactly). Sorry,
I should have stressed this a bit more.

Thanks a lot for the suggestion about CARLAstudio, I didn't know it
had this feature of having multiple analyses. From your description it
is pretty close to what we are looking for. Is 5 an internal limit on
data or just a display limitation in that tool? Do you think that
maybe CARLA would be more useful for our goals? Although it appears
that its primary aim is rather different (and it looks like it's no
longer being developed), and FLEx just seems to have more potential in
the long run and has lots of other nifty features that really make
life easier. Besides, migrating to another tool at this stage would be
a pain.

In any case I believe FLEx could also profit from the feature we are
suggesting, especially since gloss ambiguity should be managed somehow
in any case.

Thanks again to everyone!

-- Oleg

Oleg Belyaev

unread,
Jul 6, 2009, 4:30:22 PM7/6/09
to FLEx list
Hello John,

Thanks for the really in-depth reply! Sorry, didn't see it when I was
writing mine.

Morphology is important for us, yes, and I understand your concern. As
for the word 'banks' for example, I'm thinking more in terms of cross-
categorial and grammatical homonymy. That is, only the two main
analysis should be displayed - bank.PRS.3SG and bank.PL. All the
different nominal and verbal senses should probably not be displayed
not to clutter everything too much (those should then be manually
selected from the drop-down menu, as it stands now). Although maybe
ideally, all of this should be customizable to some extent. Perhaps
the homonymy between lexemes/roots of the same category should could
be displayed in some different way?

As for the most probable ones: frequency works well enough for
selecting the more probable interpretations of specific morphemes, I
guess (and a pretty large amount of manual glossing/analysis is
required in any case - the idea is that basically from a certain point
the process can be made automatic), but I can't currently think of how
exactly it can be tied to different ways of breaking up the word into
morphemes. This especially applies to wordforms that have never been
encountered before. To be sure, in the language we are studying this
does not appear to be that much of a problem, at least not on the
scale of dozens, but I can of course easily see how this can go out of
control (especially with things like derivation affixes and
compounds).

There should maybe be some kind of rule that the 'simpler' analyses
come first, e. g. those with the least number of morphemes in them.
Also, inflectional morphemes come before derivational. Compounds go in
the very end. This should rule out some if not most of the more
questionable stuff. Exceptions can then be specified ad-hoc for
certain lexemes.

This is certainly something to be more thoroughly discussed.

Best,

-- Oleg.

Beth

unread,
Jul 7, 2009, 1:42:15 AM7/7/09
to flex...@googlegroups.com
On Jul 6, 2009, at 10:25 AM, Oleg Belyaev wrote:

> Well, this applies not only to print view, but to analysis/gloss, too.
> Basically for each ambiguous wordform in text there should be several
> analyses at once, and these should also be exported into XML etc.

Actually, it sounds to me like you're most concerned with exporting
analyses whether they have been approved or not, is that right? It's
not about printing a document that shows them, and it doesn't even
seem to me that having them displayed in the Interlinear view is
necessarily crucial to your task. Is that correct? You just want
them accessible for a post-process that would apply to what you
export. I would think that exporting all guessed analyses would be
easier than finding a good way to display them.

As to whether the CARLA tools would be useful for this, it's true
they were designed in order to transfer texts between two languages,
but in order to get to that step, you have to be able to parse each
language. Many people have set the tools up to parse one language,
and then simply stopped there, because that gave them a lot of value,
in terms of helping them understand the language, getting better
spell-checking than one can from a list, and creating interlinear text.

If you were to use the CARLA tools to parse your texts, the result
would be an SFM database that contained all the possible parses for
each word, no limit on the number of parses. (The (separate) manual
disambiguator tool only takes that database and displays it in a
human-friendly way.) You could then convert that SFM database into
an XML format using some text-transforming tool. I don't know
whether the CARLA tools would or would not be a better solution than
FLEx--FLEx is certainly a very good way to manage your lexical
database. There is also a (not-so-trivial) way to export a FLEx
dictionary so it can be used with CARLAstudio parsing. I think there
are documents about this that ship with FLEx.

But I do agree that it would be good if FLEx could have an export
option that would export all possible guesses.

-Beth


Alexandre Arkhipov

unread,
Jul 8, 2009, 3:53:28 PM7/8/09
to flex...@googlegroups.com
Finally, it looks like it's good to have just some marker for each word that
does have alternative analyses, and only have all the analyses themselves on
exporting (or when going to a special tab like e.g. "Assign analysis").

Oleg, once we're talking about exporting, and especially since you've got
such a large corpus, I guess you'll need an "Export all texts" command, too?
I still don't know if it is on the to-do list (or even already implemented).

Alexandre

----- Original Message -----
From: "Beth" <lxb...@yahoo.com>
To: <flex...@googlegroups.com>
Sent: Tuesday, July 07, 2009 9:42 AM
Subject: [FLEx] Re: Displaying alternative analyses in text



Jeff and Peg Shrum

unread,
Jul 8, 2009, 4:19:07 PM7/8/09
to flex...@googlegroups.com
Reading Alexandre's comments below sparked some ideas in my head that I will
throw out for everyone's consideration. I like the idea of having parsed
words that have other possible valid analyses marked in some obvious way. I
am sorry to say, that to me using different colors too indicate different
states is hard for me to remember what they all mean and we already have 3
or 4 colors in use. So I would recommend not using a color change to
indicate a word with multiple valid parse. I would prefer an interface that
is more iconic. Does anyone remember the old hypercard program from the
Macintosh? What if each entry that had more than one valid parse appeared as
a stack of index cards? And what if the user could shuffle through the
"stack" by clicking on a specific spot? To save space, I don't think the
apparent "size" of the stack would have to reflect the actual number of
valid analyses. It would be enough just to give the user the visual clue
that there is a stack there, and then he or she could choose to look at it
or not.

Jeff Shrum

Oleg Belyaev

unread,
Jul 10, 2009, 1:44:47 PM7/10/09
to FLEx list
Yes, I saw your post on exporting all texts and I have to say that I
wholly support it. We already have many interlinearized texts and
exporting them all one by one is a pain. Also, exporting metadata
should be more explicit: either all the metadata should be exported,
or there should be a dialog where one could specify which fields
exactly are to be included in the XML.

Also, I agree that the multiple analyses feature is OK to be present
only on export or on a special 'assign analyses' view and not
everywhere. On the other hand, I really liked Jeff's idea about stacks
of cards, I think such a representation would be great: easy to
understand and not cluttering the display at the same time. I don't
know the hypercard program, but this idea somehow reminds me of how
Gmail handles long threads with many e-mails: the older mails are
displayed as a stack (with minimal info about each) until you click
there, after which all of them expand.
Reply all
Reply to author
Forward
0 new messages