[FLEx] to choose or not to choose the hermit crab parser?

168 views
Skip to first unread message

Jon C

unread,
Apr 22, 2010, 5:20:55 PM4/22/10
to flex...@googlegroups.com

The FLEx help file for "Parsing words (Hermit Crab)" reads:
"This parser handles only one word at a time, so the Parse result field
will not show "Successful" for phrases, such as idioms."

Am I understanding correctly that the default parser (XAmple) does
handle phrases, so for the time being this is the main trade-off between
the two parsers: handling phonology vs. handling phrases?

Will Hermit Crab be improved to handle phrases too, eventually replacing
XAmple as the default parser?

-Jon


--
You received this message because you are subscribed to the discussion group "FLEx list". This group is hosted by Google Groups and is open for anyone to browse.
To post to this group, send email to flex...@googlegroups.com
To unsubscribe from this group, send email to flex-list-...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/flex-list

Andy Black

unread,
Apr 22, 2010, 5:13:28 PM4/22/10
to flex...@googlegroups.com
On 4/22/2010 2:20 PM, Jon C wrote:
>
> The FLEx help file for "Parsing words (Hermit Crab)" reads:
> "This parser handles only one word at a time, so the Parse result field
> will not show "Successful" for phrases, such as idioms."

At least with version 3.0.2, both parsers will handle phrases as long as
any affixation occurs at the edges (i.e. at the beginning of the first
word and/or at the end of the last word). I think this statement in the
Helps is not completely correct.

>
> Am I understanding correctly that the default parser (XAmple) does
> handle phrases, so for the time being this is the main trade-off between
> the two parsers: handling phonology vs. handling phrases?

No, the main trade-off is between using phonological rules versus using
allomorphs conditioned by environments. See appendix B.3 in the "Intro
to Parsing" document for the known limitations of the Hermit Crab parser.

>
> Will Hermit Crab be improved to handle phrases too, eventually replacing
> XAmple as the default parser?

We have not yet decided whether the Hermit Crab parser will eventually
become the default parser or not.

Since Jon is asking about this kind of thing (and since his subject line
is about it), if you have tried the Hermit Crab parser, how did it work
for you? Please let us know.

Thanks,

--Andy

Craig Farrow

unread,
Apr 23, 2010, 12:37:14 AM4/23/10
to flex...@googlegroups.com
Hi Andy & co,

I discovered the Hermit Crab parser yesterday when I was browsing the
"Intro to Parsing" document (I'm not sure how I missed hearing about it)
and I was very pleased to start working on phonological rules to account
for inter-morpheme processes like vowel-final deletion on stems. I got a
rule going quickly to handle that.

What I'd like to hear more about (like Jon's questions) is what do we
lose by moving to the Hermit Crab parser? The "Intro" doc says, "This
means that one is supposed to be able to move from an item and
arrangement description to an item and process description as one
determines what these processes are. " Does this mean that we can
gradually transition to the phonological-based rules as we discover/have
time/etc. while the old allomorph-based functions continue to work? I.e.
is the Hermit Crab essentially a super-set of the XAmple one (aside from
the limitations documented in "Intro to Parsing")?

Also, the note about giving it a try on a copy of your project (or test
project) makes me wonder what the dangers are of using the Hermit Crab.
Can I go back to using the XAmple one later (and it ignores the
phonological rule stuff), or am I walking a one-way bridge? I'm not
expecting the danger to be in losing data, but what are my risks?

I'm excited about the potential of the new parser and would like to
understand these issues more before I go too much further.

Thanks Andy & Mike (and everyone else involved)!

Craig.



23/04/2010 5:13 a.m. dï, Andy Black pišdimiš:

Andy Black

unread,
Apr 23, 2010, 5:23:25 PM4/23/10
to flex...@googlegroups.com
On 4/22/2010 9:37 PM, Craig Farrow wrote:
> ...
> What I'd like to hear more about (like Jon's questions) is what do we
> lose by moving to the Hermit Crab parser? The "Intro" doc says, "This
> means that one is supposed to be able to move from an item and
> arrangement description to an item and process description as one
> determines what these processes are. " Does this mean that we can
> gradually transition to the phonological-based rules as we
> discover/have time/etc. while the old allomorph-based functions
> continue to work? I.e. is the Hermit Crab essentially a super-set of
> the XAmple one (aside from the limitations documented in "Intro to
> Parsing")?

First, in theory the Hermit Crab parser is supposed to be able to parse
words just like the default parser does (i.e. using item and arrangement).

Second, therefore the hope is that one can indeed move from an item and
arrangement to phonological rule-based parsing piecemeal via Hermit
Crab. For example, one could potentially add phonological rules as one
discovers them, leaving the other entries in an item and arrangement
approach until you figure out the phonological rules for them, too.
Please note, thought that we do not have much experience in actually
doing this.

Perhaps you've noted that we've called Hermit Crab an "experimental"
parser. You may also have noted my hedge words above ("in theory" and
"hope"). The reason is that Hermit Crab does not have the same track
record of success that the default parser has. It is "new," if you
will. Unlike the AMPLE parser that the default parser is based on,
Hermit Crab has not been successfully applied to many different
languages from many different language families; and that with thousands
of word forms being thrown at it. To our knowledge, the original Hermit
Crab parser had only been applied to test data. Our slight revision of
Hermit Crab that we include in FLEx, of course, is even newer. We did
try to "kick it around," but we realize that it is no where near as
robust as the default parser. We like its potential and so have chosen
to make it available. Having said all this, I suspect that you'll
understand if we are not surprised if you try it and it discover a bug
or two.


>
> Also, the note about giving it a try on a copy of your project (or
> test project) makes me wonder what the dangers are of using the Hermit
> Crab. Can I go back to using the XAmple one later (and it ignores the
> phonological rule stuff), or am I walking a one-way bridge? I'm not
> expecting the danger to be in losing data, but what are my risks?

We suggest you try on a copy of your project (or even a subset since
that might be more efficient) because we do not know how robust it is.
It may work great. It may not. It may work on some forms, but have
problems with others. As I said, unlike the default parser which has a
history of success, Hermit Crab is new. If your experimenting with it
show great results, then you should be able to make the same changes to
the phonemes and phonological rules and affixes in your "real" project
and it should continue to work for you.

As to switching back and forth between XAmple and Hermit Crab in the
same project, in theory you should be able to do this with one major
caveat: any affix forms that are "affix processes" for Hermit Crab will
have to be reverted (back) to allomorphs for XAmple. This, of course,
is because XAmple does not have processes like Hermit Crab does. Note
that I say "in theory." I do not know of any one that has tried this.
There may be some "gotchas" that I'm not aware of. Hence, our
suggestion that you experiment with Hermit Crab in a separate project first.

>
> I'm excited about the potential of the new parser and would like to
> understand these issues more before I go too much further.
>
> Thanks Andy & Mike (and everyone else involved)!

One of the key players in producing the FLEx version of Hermit Crab is
Damien Daspit. He did the lion's share of the work (to put it mildly).

Andy Black

unread,
Apr 23, 2010, 5:36:43 PM4/23/10
to flex...@googlegroups.com

On 4/23/2010 2:23 PM, Andy Black wrote:
> As to switching back and forth between XAmple and Hermit Crab in the
> same project, in theory you should be able to do this with one major
> caveat: any affix forms that are "affix processes" for Hermit Crab
> will have to be reverted (back) to allomorphs for XAmple. This, of
> course, is because XAmple does not have processes like Hermit Crab
> does. Note that I say "in theory." I do not know of any one that has
> tried this. There may be some "gotchas" that I'm not aware of.
> Hence, our suggestion that you experiment with Hermit Crab in a
> separate project first.

Another issue I just thought of is that XAmple only uses Natural Classes
defined in terms of phonemes. Hermit Crab uses these and also Natural
Classes defined in terms of phonological features. So if in developing
Hermit Crab you created some Natural Classes defined in terms of
phonological features and then used those Natural Classes in
environments, you'll need to fix these, too.

I may think of some other issue after I send this...

Jon C

unread,
Apr 27, 2010, 3:00:23 PM4/27/10
to flex...@googlegroups.com
Thanks, Andy. I'm really glad to hear that Hermit Crab can handle phrases.


the main trade-off is between using phonological rules versus using allomorphs conditioned by environments.  See appendix B.3 in the "Intro to Parsing" document for the known limitations of the Hermit Crab parser.

That's helpful. I'm appending that content below for those of us who are using older versions of FW or just want quick access. Section B.3.2 says you can "add extra allomorphs", so apparently Hermit Crab does make use of allomorphs (but not their enviroments?).

If the two parsers are going to coexist for quite some time, it would be helpful to document a concrete example, explaining how each parser would be set up to handle it.

In the language I'm working on, we have hardly any inflection (just n- m- and p- prefixes), and most derivations are phonologically straightforward. The worst we've got is this:

n- aN- ala --> nangala (n- is realis / non-future)
n- aN- kondi --> nangkondi
n- aN- boli --> namboli
n- aN- wei --> nambei
n- aN- silo --> nancilo

p- aN- ala --> pangala (p- is imperative)
p- aN- kondi --> pangkondi
...

p--a aN- kondi --> pangkondia (p--a is a nominalizing circumfix)

p--a aN- kondi -a --> pangkondiaa (this double -a isn't frequent at all; -a is a nominalizing suffix)

nu= kou --> ngkou (nu= is a proclitic usually pronounced N=; it's sort of a preposition meaning 'of')
nu= bengi --> mbengi
nu= watu --> nu watu
nu= sou --> ncou

(As you can see, our naN- prefix behaves quite similarly to the meN- prefix you mention in B.3.3. Our language is distantly related to Bahasa Indonesia.)

We also get strings of pronominal and aspectual enclitics, usually on the verb. They occur in a specific order and only one from each set can occur at a time, so we may tell FLEx that these are suffixes, in order to be able to use an affix template on them in FLEx. (Hopefully those templates can be made to support clitics soon.)

Our teammate's language is basically the same, but its enclitics misbehave a bit:
n- aN- koni =ra =da =pa --> nangkoniradapa
n- aN- koni =da =pa =i --> nangkonidipi

I'd like to handle both languages with the same parser. Overall, which parser do you think would handle this data better?

Thanks,
Jon

P.S. to Marlon. I believe that FLEx's "affix templates" are synonymous with what we called "position class diagrams" in school. It might be good if looking for either term found the right Help pages, especially since this feature is rather buried.

B.3 Known limitations

There are several known limitations of the current implementation of the new experimental phonological rule-based parser.

B.3.1 Seeing what steps the parser takes not implemented yet

The default parser for FieldWorks Language Explorer has a way for you to see what steps the parser took while parsing a word (see the Try a Word tool). This has not been implemented yet for the new experimental phonological rule-based parser. Our apologies for the fact that this new parser is thus a “black box” with no way to see inside. We simply ran out of time to do all that we would have liked to do in this area. A later version of FieldWorks Language Explorer will include this capability.

B.3.2 Affixes are tried only once per word

While the default FieldWorks Language Explorer parser will try a given affix as many times as its form is found within a single word, the new experimental phonological rule-based parser tries a given form (or affix process) only once per word. This is normally not an issue since it is quite rare for an affix to be repeated several times within a word. There are cases, however, where this is an issue. For example, Coward & Coward (2000) note that in Selaru, “It is possible to reduplicate /nini/, /soso/ and others basically without limit. As many as eight reduplication levels have been encountered in natural text.”

If you run into this limitation, a possible work-around is to add extra allomorphs for the affix involved or to add the form as a distinct lexical entry. You also, of course, have the option of just allowing the new parser to fail to parse such words.

B.3.3 Natural classes defined by segments may or may not work as expected

When you define a natural class by listing the segments (as opposed to using phonological features), the new experimental phonological rule-based parser may not treat this natural class exactly as you expect. If you do not have any phonological features defined, then the new parser will treat the class as consisting solely of the segments listed in the class.

If, on the other hand, you have defined phonological features, then the new experimental phonological rule-based parser converts all the segments listed in the natural class into their respective feature sets. It then takes the set intersection of all those features and uses that to determine if a given segment is in that natural class. Normally, this is not an issue. In one case, however, when I was trying to deal with the recalcitrant case of the meN- prefix in Bahasa Indonesia (see section B.1.2.1.4) where a following p, t, k, or s, deletes, I knew that I was not aware of a real natural class that would cover these segments and not also include the other voiceless obstruents that do not delete. So I tried to by-pass this by creating a segment-based natural class that just included these four segments. Since I had also defined phonological features, this approach did not work for me. I had to create a special phonological feature whose value was + for these four segments and - for all other segments.

B.3.4 Affix allomorphs conditioned by features not implemented yet

The kind of allomorphy described in section 3.8 is currently not handled by the new experimental phonological rule-based parser.


Endnotes

[2]

This new parser is an enhanced and updated version of Mike Maxwell's Hermit Crab parser. See http://www.sil.org/computing/HermitCrab/. We are deeply indebted to Mike for his pioneering work on this parser.



Ron Lockwood

unread,
Apr 28, 2010, 4:39:05 PM4/28/10
to flex...@googlegroups.com
I got a small test with hermit crab working. I found it a bit tricky trying
to get the features all correct both in the phonemes and the rule I wrote
and I'm wondering how it is going to work with my practical orthography that
is not a nice 1-1 phonemic orthography.

Ron
> unsub...@googlegroups.com

Kevin Warfel

unread,
Apr 29, 2010, 12:09:44 PM4/29/10
to flex...@googlegroups.com

I am using the Hermit Crab parser on Phuien (Puguli), a Gur language from Burkina Faso.  In this language, there are also a few orthographic conventions that deviate from the phonological reality.  My experience to this point has been that, if I write my rules to reflect the orthography instead of the phonology, they work.

 

Here's a simple (I think) example.  The past tense suffix, which is a single vowel, when appended to a verb root, will assimilate to the features of the "rightmost" root vowel in a number of ways.  One of them is nasality, so the suffix vowel becomes nasalized when the root vowel is nasalized.  There is an orthographic convention, however, that nasalization (marked by a tilde over the vowel in the orthography) only needs to be indicated on the first vowel of a sequence, this being unambiguous in the language since Phuien does not have vowel sequences that are of mixed nasality (oral+nasal or nasal+oral), but only oral+oral or nasal+nasal.

 

Thus, nasal assimilation example.jpg 

 

So, I had to make sure that any nasal-assimilation rule that I had for vowels would *not* apply in this particular case, contrary to the phonological reality.

 

Kevin Warfel

image002.jpg

Andy Black

unread,
Apr 30, 2010, 5:17:48 PM4/30/10
to flex...@googlegroups.com
On 4/27/2010 12:00 PM, Jon C wrote:
Thanks, Andy. I'm really glad to hear that Hermit Crab can handle phrases.

the main trade-off is between using phonological rules versus using allomorphs conditioned by environments.  See appendix B.3 in the "Intro to Parsing" document for the known limitations of the Hermit Crab parser.

That's helpful. I'm appending that content below for those of us who are using older versions of FW or just want quick access. Section B.3.2 says you can "add extra allomorphs", so apparently Hermit Crab does make use of allomorphs (but not their enviroments?).

It is supposed to make use of both allomorphs and environments.  Currently, though, we have a couple of reports where it appears some allomorphs do not work as expected.  Damien is slated to look into these when he can.



If the two parsers are going to coexist for quite some time, it would be helpful to document a concrete example, explaining how each parser would be set up to handle it.

Good point.  Now just for lots of round TUITs to be able to do this as well as everything else.  ;-)



In the language I'm working on, we have hardly any inflection (just n- m- and p- prefixes), and most derivations are phonologically straightforward. ......


Our teammate's language is basically the same, but its enclitics misbehave a bit:

I'd like to handle both languages with the same parser. Overall, which parser do you think would handle this data better?

Looking quickly at the data (which I removed from this copy of the message), I'd say you'd be better off with XAmple since at least some of it appears to depend not purely on phonology but the actual identification of particular morphemes.

--Andy

Jon C

unread,
May 5, 2010, 8:42:01 PM5/5/10
to flex...@googlegroups.com
Thanks, Andy. I suspected that XAmple would be good enough but was tempted by the thought that Hermit Crab might simplify the setup. (And on the one hand perhaps it would simplify things to have a rule for replacing root-initial s with c after aN-, rather than having an allomorph for each of those roots.) But really, apart from that we don't have a ton of phonology going on, and some of what we do have is ironed out by the orthography. blessings,
Jon
Reply all
Reply to author
Forward
0 new messages