Language identification in LIFT

3 views
Skip to first unread message

Jeff Good

unread,
Feb 6, 2011, 4:03:39 PM2/6/11
to lexiconinter...@googlegroups.com
Hello LIFT community,

In doing work on the LEGO project (http://linguistlist.org/projects/lego1.cfm), the research team has encountered some issues with language identification in LIFT that we wanted to take to the LIFT developers. The problems can be roughly classified as:

1. Using coding systems not included in RFC 4646

2. Associating multiple codes with a single form

I'll describe each of these issues in more detail in turn. Before I begin, I should add that we don't have any RFC 4646 experts on our team. So, there is a chance that we've misunderstood something in that standard.

On the issue of using coding systems not included in RFC 4646, the "competing" code set we have in mind is the one in use by the Multitree project (http://multitree.linguistlist.org/). For instance, we have a wordlist which we can easily associate with Multitree code "0ke" (http://multitree.linguistlist.org/codes/0ke), but that we don't have an 639-3 code for. Now, it may be the case that there is an appropriate 639-3 code for this data, but, with the information we have available, we simply don't know what it is. We could use the [und] language tag (for unknown languages), but this is clearly less informative than "0ke", and, perhaps someday, someone will figure out how "0ke" relates to 639-3, and then it will be easy to assign a 639-3 code to the resource automatically (certainly much easier than if we just used [und]).

So, our key question here is: Is there any possibility of downgrading use of RFC 4646 to something like a best-practice recommendation for LIFT instead of a requirement (which it seems to be right now)? If that is done, we would then need a way to specify the non RFC 4646 code system we are using in the resource as well.

On the second issue of associating multiple codes with a form, our need for this comes out of encoding a multi-variety resource (for a group of closely related languages and/or dialects) wherein some entries are associated with more than one variety. (As an example, consider a hypothetical English lexical resource covering Scots English and southern British English where some entries would be marked specifically for Scots, others specifically for the southern variety, and many as being lexical elements for both.)

RFC 5646 (which supersedes 4646 and therefore will be incorporated into LIFT?) actually anticipates the need for multiple codes in section 4.3. (See http://tools.ietf.org/html/rfc5646.)

Our question here is one of implementation. Assuming that LIFT will allow multiple language codes attached to a single form, following RFC 5646, then how should this be done? To make this concrete, here is one suggestion (not one we're endorsing, but just to put something out there):

<form lang="eng,fra"><text>table</text></form>

As you can see, I've separated the two codes in the lang attribute of <form> with a comma.

I hope this is clear, and thanks for your help.

Best,
Jeff

Martin Hosken

unread,
Feb 6, 2011, 8:46:10 PM2/6/11
to lexiconinter...@googlegroups.com, jcg...@gmail.com
Dear Jeff,

>1. On the issue of using coding systems not included in RFC 4646, the "competing" code set we have in mind is the one in use by the Multitree project (http://multitree.linguistlist.org/). For instance, we have a wordlist which we can easily associate with Multitree code "0ke" (http://multitree.linguistlist.org/codes/0ke), but that we don't have an 639-3 code for. Now, it may be the case that there is an appropriate 639-3 code for this data, but, with the information we have available, we simply don't know what it is. We could use the [und] language tag (for unknown languages), but this is clearly less informative than "0ke", and, perhaps someday, someone will figure out how "0ke" relates to 639-3, and then it will be easy to assign a 639-3 code to the resource automatically (certainly much easier than if we just used [und]).


>
> So, our key question here is: Is there any possibility of downgrading use of RFC 4646 to something like a best-practice recommendation for LIFT instead of a requirement (which it seems to be right now)? If that is done, we would then need a way to specify the non RFC 4646 code system we are using in the resource as well.

There is no intention to allow non-standardised language tags since doing so would only increase confusion rather than reduce it. Instead, if you really cannot find a standard language tag then you may want to assemble something that at least means something to you. Thus for your example you might use: und-x-tl-0ke

> RFC 5646 (which supersedes 4646 and therefore will be incorporated into LIFT?) actually anticipates the need for multiple codes in section 4.3. (See http://tools.ietf.org/html/rfc5646.)

We'll probably update LIFT to specify BCP47 which is a way of always using the most current RFCs for language tagging without having to update the document every time a new one comes out.

> Our question here is one of implementation. Assuming that LIFT will allow multiple language codes attached to a single form, following RFC 5646, then how should this be done? To make this concrete, here is one suggestion (not one we're endorsing, but just to put something out there):

I'm afraid your assumption is not currently correct. LIFT does not allow multiple language codes attached to a single form at this time. IF it were to be added then I would probably propose a simple space separated list. In the meantime, you might consider using a hierarchical model to save on the number of repetitions of the form. For example, if you were distinguishing scots English (en-scotland) and southern british english (en-GB since there is no specific variant for this. If you really needed one you should either register it or use -x-southern or somesuch) and you want to say that a word is the same in both, then you might label the form en-GB, since Scotland is still part of Great Britain ;), even if Northern Ireland isn't, but GB represents the whole UK, just to keep everyone guessing) or even just en. Of course this approach may not always work, and then it would be necessary to repeat the form.

Having said this, I don't think we considered too hard the question of whether people would want to multiply tag forms. It would involve a considerable programming change on the part of implementors and for that reason, would require a more consensual decision to allow such things. Feel free to try to persuade folks :)

HTH,
Yours,
Martin

Jeff Good

unread,
Feb 6, 2011, 10:18:27 PM2/6/11
to lexiconinter...@googlegroups.com
Dear Martin,

Thanks for your quick and clear answers. I'm going to insert some of my own responses to your responses below, but I should start with the caveat that I was writing as a representative of sorts of the LEGO team, and other people on the project will be in a better position to fill in some of the details than me. So, you may be hearing from them as well (especially if I misrepresent them accidentally).


> There is no intention to allow non-standardised language tags since doing so would only increase confusion rather than reduce it. Instead, if you really cannot find a standard language tag then you may want to assemble something that at least means something to you. Thus for your example you might use: und-x-tl-0ke

The idea of extending [und] in some way is a helpful one that I am sure we will seriously consider. One clarification I would like to make is that I did not intend to propose opening up LIFT to non-standardised tags, per se. Rather, the question was whether we could use a different "standard". Multitree codes certainly are not standardized in the way that that 639 family is, but they are at least published, and my understanding is that LINGUIST List is committed to maintaining the relevant namespace. Also, each code is documented to some extent.

I also wonder what kind of confusion you are concerned about in allowing non IETF tags. Obviously, this can cause confusion regarding knowing what language is being documented, but, assuming whatever non-IETF tags one might employ don't allow for some sort of disastrous syntax, is there a specific tool implementation issue here? My understanding of Multitree is that it is now the most comprehensive database of languages and subgroups anywhere and, therefore, has potential value for providing a set of tags for language data where, say, 639-3 might not work. I think we'd all agree that the ideal would be to use a 639 code, but it's not clear to me that it's more confusing to allow codes from a different, well-maintained code set than it is to tag data as [und] when you actually have some idea of what the language is, especially if the fault may lie with 639-3 rather than actual knowledge of the world.

I should add, here, that I don't personally have much of a stake in the Multitree codes. Rather, I'm re-packaging some of the internal discussion we had on this. Indeed, at first, I was hesitant to want to use them, but, after the course of some discussion, I started to think it would be good if one could use them in LIFT under certain circumstances, assuming there was a way to stipulate a namespace (or whatever) for the code set one was using.


> I'm afraid your assumption is not currently correct. LIFT does not allow multiple language codes attached to a single form at this time. IF it were to be added then I would probably propose a simple space separated list. In the meantime, you might consider using a hierarchical model to save on the number of repetitions of the form. For example, if you were distinguishing scots English (en-scotland) and southern british english (en-GB since there is no specific variant for this. If you really needed one you should either register it or use -x-southern or somesuch) and you want to say that a word is the same in both, then you might label the form en-GB, since Scotland is still part of Great Britain ;), even if Northern Ireland isn't, but GB represents the whole UK, just to keep everyone guessing) or even just en. Of course this approach may not always work, and then it would be necessary to repeat the form.

I believe for the actual resource we have in mind, the hierarchical approach will not work (at least not straightforwardly). I have not been working directly with the relevant resource, a Tamashek dictionary, but my understanding is that its structure is something like five dialects/languages where arbitrary subsets may share any given form. For one form it may be dialects A, B, and C, while for the next, B, C, and D, and another might be, A and C. It's not easy to use hierarchical codes for such cases (or certainly is harder than my hypothetical case). Susan Smith, who is on this list I believe, can provide real examples.

You do imply the possibility of one solution that we considered, which is to repeat the form. In principle, this is possible, of course, but we don't believe it would faithfully represent the claims of the original resource and, of course, it would amount to a kind of denormalization, which could cause maintenance problems down the road. So, we were hoping to avoid this.


> Having said this, I don't think we considered too hard the question of whether people would want to multiply tag forms. It would involve a considerable programming change on the part of implementors and for that reason, would require a more consensual decision to allow such things. Feel free to try to persuade folks :)

This does not seem to me to be a particularly uncommon situation in lexical resources because so many are comparative in nature, and that's where this problem might come up. As a concrete use case, it seems to me that once one is dealing with closely related languages/dialects, this sort of feature would be quite handy for creating "localized" lexicons for a small speaker community and could, perhaps, even assist in semi-automated translation efforts. (For example, if you knew that language A was similar to language B for this set of lexical items but not that set, then you could more intelligently use your lexical resource to help you translate a text from language A to language B.) So, I can see this being useful both for academic uses like ours and applied uses as well.

That being said, for me, the strongest endorsement of allowing for multiple codes is probably the fact that the need for this is articulated in RFC 5646 itself.

Thanks again for your quick and helpful responses!

Jeff

Reply all
Reply to author
Forward
0 new messages