Report and question about OLAC metadata

14 views
Skip to first unread message

Helen

unread,
Sep 1, 2010, 10:24:53 AM9/1/10
to LexiconInterchangeFormat
Hello, Everyone,

Our LEGO project hasn't been very visible on this list recently, but I
thought you might want to know that we just made a presentation on
LIFT in Nijmegen, which resulted in LIFT being chosen as the
interchange format for the RELISH project, a joint project between
LINGUIST List, U. of Frankfort, and MPI-Nijmegen designed to harmonize
lexicon standards in the US and Europe. If we are successful with
RELISH, LIFT will be written into LEXUS (the MPI lexicon creation
tool) as one of the standard outputs. This will go a long way toward
increasing its use in Europe.

I might add that we also presented on LIFT during the course we taught
at the Comp Ling Summer School at U. of Zadar; and, again, made a few
converts, I think. TEI is probably the front-runner to be chosen as
the CLARIN serialization of Lexical Markup Format (the ISO standard),
but the relative simplicity of LIFT makes it appealing to the NLP
world, so I believe that LIFT is a contender.

Hope you guys are happy to hear this. It is a tribute to all your
hard work developing a flexible schema, with lucid documentation.

So--now that we hopefully have garnered some brownie points in the
LIFT world--let me ask an elementary question: in the LEGO project,
we are trying to put OLAC metadata into the XML files. Jeff Good, who
is the PI handling the wordlist side of the project, is the furthest
along in working with metadata; and he tells me that there is no
provision for an OLAC container in LIFT. How are you all handling
OLAC metadata, if you are?

Thanks,
-Helen

Cambell Prince

unread,
Sep 2, 2010, 11:31:02 AM9/2/10
to lexiconinter...@googlegroups.com
Hi Helen,

Thanks for you email.  Always happy to hear that our work is useful to others.

Regarding OLAC, LIFT doesn't tie itself to any particular meta data standard.  Unless this discussion prompts it we haven't got any plans to add support for this any time soon.

However, one method which might be appropriate is the use of 'stand off markup'.  Whereby you have another file, say MyLexicon.olac which contains olac data pointing to the entries in the lexicon MyLexicon.lift.  We have recently decided to standardise on having guids as id's for each entry so that each entry is well identified.  This could be used in the .olac file to refer to the entry.

Regards,
Cambell

John Hatton

unread,
Sep 3, 2010, 9:02:12 PM9/3/10
to lexiconinter...@googlegroups.com
Hi Helen,
Thanks for the update. As Cambell said, we don't currently have any
metadata in LIFT. But I think we probably should, and that OLAC would be
the one to use (with extensions?). My only hesitation is on the cost to
existing applications. I assume it would be done using namespaces. I
assume that apps that aren't ready to create/read that data would just need
to not choke when they came across it.

John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:
hatto...@googlewave.com

Helen

unread,
Sep 5, 2010, 12:07:14 PM9/5/10
to LexiconInterchangeFormat
Hi, Campbell and John,

Thanks for the clarification. I understand the concern about the cost
to existing applications. But I don't think stand-off metadata is
going to work very well for the LEGO project. Our (LL) side of the
project has only about 20 lexicons, so we could do it; but Jeff is
working with over 4000 word lists, so I think it will be most
practical for him to put metadata into the XML file. I think he's on
this list, so I invite him to comment.

Our situation with regard to LIFT is somewhat complicated. We are
ingesting lexicons automatically and mapping them to the GOLD
ontology. To facilitate automatic processing, we created our own LIFT
schema (LL-LIFT or LEGO LIFT), which is a restricted version of LIFT
which still validates against the official LIFT schema. Jeff is using
a word-list schema which is as close to LL-LIFT as he could make it
and still deal with word lists (which are organized via concepts, not
word senses). I believe that his version also validates against the
official LIFT schema--except for the OLAC wrapper for metadata.

Metadata is probably something you should consider; and OLAC is
certainly the simplest metadata standard for language resources. But
there's probably a lot on your plate right now.

All the best,
-Helen


On Sep 3, 9:02 pm, "John Hatton" <john_hat...@sil.org> wrote:
> Hi Helen,
> Thanks for the update.  As Cambell said, we don't currently have any
> metadata in LIFT.  But I think we probably should, and that OLAC would be
> the one to use (with extensions?). My only hesitation is on the cost to
> existing applications.  I assume it would be done using namespaces.  I
> assume that apps that aren't ready to create/read that data would just need
> to not choke when they came across it.
>
> John Hatton
> SIL Papua New Guinea, Palaso, & SIL International Software Development
> Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:
> hattonj...@googlewave.com

John Hatton

unread,
Sep 5, 2010, 8:55:10 PM9/5/10
to lexiconinter...@googlegroups.com
Hi Helen,
Could you post the relevant portion of one of your LIFT files with embedded
OLAC?

Thanks

John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:

hatto...@googlewave.com

Jeff Good

unread,
Sep 5, 2010, 6:00:54 PM9/5/10
to lexiconinter...@googlegroups.com
Hello everyone,

To follow up on Helen's message, right now, as she said, I'm working on a LIFT-based XML schema for wordlist data. Wordlists are different from regular lexicons in a number of important ways but they can be expressed in a subset of LIFT for the most part. Since the data sources I'm dealing with amount to several thousand word lists and since we want to disseminate them broadly, we need to be careful about keeping track of metadata. This is why, ideally, we'd be able to include metadata with each word list. That way, we wouldn't have to worry about the metadata getting separate from the data.

We can include our metadata in a semi-structured way using a <note>, but we'd rather include proper OLAC metadata, if possible, for all the obvious reasons. We could generate stand-off metadata, of course, it would just add to the data management overhead more than I'd like. I'm not so much worried about my own work--after all, I have all the metadata in my own database--but if someone else wants to download, say, 1000 of the wordlists for some application, I think it would be easier for them if the metadata were packaged with the lists (or, at least, I think they should have that option).

I suppose, at this point, what I'm most interested in is knowing if there are any recommendations about metadata from the LIFT community. If there are not, we'll probably adopt some option using a <note> inside the wordlist and then work on producing an OLAC dump for the metadata for all the wordlists.

Thanks,
Jeff

Jeff Good

unread,
Sep 5, 2010, 10:56:32 PM9/5/10
to lexiconinter...@googlegroups.com
Hello,

I'm attaching a LIFT file with an attempt to embed OLAC in an initial <metadata> tag. I made this file by hand since we don't have an official way to do this yet in the project. So, this should be viewed as an example of a possibility rather than a proposal.

The lexicon (a wordlist) is called Bezhta.xml. I'm also attaching an RNG file called Hacked-Lift.rng which took the constrained LIFT that LEGO has been developing and changed it to get the attached lexicon to validate. The big change is adding a metadata element to the document definition that can have any kind of content. Obviously, we wouldn't actually want it to have any kind of content. I just put this in for testing.

Please let me know if you have any questions,
Jeff

Bezhta.xml
Hacked-Lift.rng

John Hatton

unread,
Sep 6, 2010, 12:48:20 AM9/6/10
to lexiconinter...@googlegroups.com

Hi Jeff,

Thanks for the info.

 

>We can include our metadata in a semi-structured way using a <note>, but we'd rather include proper OLAC metadata, if possible, for all the obvious reasons. We could generate stand-off metadata, of course, it would just add to the data management overhead more than I'd like. I'm not so much worried about my own work--after all, I have all the metadata in my own database--but if someone else wants to download, say, 1000 of the wordlists for some application, I think it would be easier for them if the metadata were packaged with the lists (or, at least, I think they should have that option).

 

So, would it work for you to just use a namespace for OLAC?

<lift producer="some lift thing v2.1"     version="0.15">

      <header>

            <olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.1/"

               xmlns="http://purl.org/dc/elements/1.1/"

               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

               xsi:schemaLocation="http://www.language-archives.org/OLAC/1.1/

                  http://www.language-archives.org/OLAC/1.1/olac.xsd">

               <creator>Graham, Albert</creator>

                  Etc...

            </olac:olac>

 

 

Then all we (WeSay, FLEx authors) have to do is make sure we round trip any data which comes in a namespace like this. And eventually, we'd like to support entering some of this meta data from the editors, as well. We'd also like to embed license information (e.g. ccREL).

 

John Hatton

SIL PNG, Palaso, & SIL International Software Development

Google Talk chat: hattonjohn

 

 

 

Jeff Good

unread,
Sep 6, 2010, 6:11:32 PM9/6/10
to lexiconinter...@googlegroups.com
Hello John (and others),

I don't see any problem on our end with just declaring the namespaces in an <olac:olac> portion of the document. It looks like you're also thinking of putting metadata in the header, which I don't anticipate any problem with, either.

I wonder if it might be worth considering having a special <metadata> block which could contain OLAC metadata, or other kinds of metadata, if needed, and thens suggesting that applications support retaining any information in that block even if the application can't do anything with it.

Jeff

John Hatton

unread,
Sep 7, 2010, 2:53:40 AM9/7/10
to lexiconinter...@googlegroups.com
Hi Jeff,

>I wonder if it might be worth considering having a special <metadata> block
which could contain OLAC metadata, or other kinds of metadata, if needed,
and thens suggesting that applications support retaining any information in
that block even if the application can't do anything with it.

We batted this around here (I'm visiting Thailand this week) and decided to
leave it up to you to say what the standard should be on this one; you
decide if the rule is

a) apps should round-trip other-namespace data found in the <header>, or
b) We add a <metadata> tag to the header which has no contents, and say that
apps should make sure they round-trip anything found in that tag.

Jeff Good

unread,
Sep 7, 2010, 9:23:44 AM9/7/10
to lexiconinter...@googlegroups.com
Hello John,

> a) apps should round-trip other-namespace data found in the <header>, or
> b) We add a <metadata> tag to the header which has no contents, and say that
> apps should make sure they round-trip anything found in that tag.

I think the LEGO team will need to discuss this internally and come back to you with an answer. As it turns out, there is some other information (e.g., mappings to an ontology) that we've been trying to get into the lexicons that might be relevant to this discussion, too, which is why I'd need to confer with them. At present, we have assumed that such mappings could not be done inside of a LIFT document because of things like namespace issues. However, your solution (a) might give us a good solution for that, too.

I've never written an application where I've had to worry about round-tripping data that the application may not be designed to handle. The reason why I thought about having a <metadata> tag is that I thought it would make processing easier if one could say in the specification, for example, that whatever is inside the <metadata> tag is officially "terra incognita". Then special handlers could be written to deal with the content of that tag. Is it similarly easy to write code that would detect any element using an arbitrary outside namespace? (I don't recall this being the sort of thing that would be built into an XML parser, but, again, it's never a problem I had to worry about.)

If the people working on applications using LIFT don't think detecting arbitrary outside namespaces is a problem, then it's fine with me. I'm just explaining why I thought it might be helpful to "quarantine" any/all metadata using external namespaces in a dedicated (unqualified) tag. One could easily imagine generalizing this to other data types (e.g., copyright, ontological mappings) as well.

In any event, I'll start a conversation with the rest of LEGO about this, and we'll get back to you.

Thanks,
Jeff

Gary Simons

unread,
Sep 22, 2010, 11:09:07 AM9/22/10
to LexiconInterchangeFormat
Hi John and all,

I would vote for (b) for the reason that having a data file include
its own metadata is clearly best practice so making a specific slot
for it in the header (which is where one would expect to find it),
shows that you are in tune with best practice. Solution (a) fails to
demonstrate that LIFT believes in metadata and would seem to put an
unrealistic burden on applications in every part of the file rather
than localizing the issue.

I think it would be fine if you remained agnostic about metadata
schemas and simply declared:

metadata = element metadata (anyElement*)
anyElement =
element * {
(attribute * { text }
| text
| anyElement)*
}

That is, any content in any namespace would be accepted, and multiple
metadata containers in different namespaces would be accepted. If a
single container in a specific namespace seems preferable, then I
think OLAC would be the logical choice, but if you were going to
hammer out a metadata standard that LIFT applications were expected to
be able to handle, you would probably want a subset of OLAC rather
than the whole thing. So the easiest thing at this point would be to
accept anything, and then let the community work out over time a
subset that applications would be expected to understand.

Incidentally, CC licenses can be handled in OLAC metadata by putting
the license URL in a Rights element, e.g.:

<dc:rights>http://creativecommons.org/licenses/by-sa/2.5/</dc:rights>

-Gary

Gary Simons

unread,
Sep 22, 2010, 2:53:23 PM9/22/10
to LexiconInterchangeFormat
Hello John and all,

I would vote for option (b) for the reason that including metadata as
part of a language documentation file is clearly best practice and
LIFT needs to align itself with best practice. While option (a) would
allow the metadata container to get in the document, it does not
demonstrate that LIFT is in tune with best practices. And perhaps
even worse, it would probably put an undue burden on applications to
be able to deal with extra stuff that could appear in unanticipated
places. By following (b), the extra stuff is not unanticipated and is
in a predictable place in the header.

Incidentally, I think it would be fine for the <metadata> element to
be agnostic about what kind of metadata it will find. E.g. the
definition could be:

metadata = element metadata {anyElement*}
anyElement =
element * {
(attribute * { text }
| text
| anyElement)*
}

This would even permit multiple metadata descriptions following
different schemas. If you did want to settle on one standard, I think
OLAC would be your best bet, but in the end, what you really might
want to do is come up with a subset of the OLAC schema that a
compliant LIFT application would be expected to deal with, rather than
expecting LIFT applications to support the entire schema. So until
such a subset can be agreed on, simply accepting anything would solve
the immediate problem.

Incidentally, you mentioned wanting to record CC licenses in
metadata. That is done in OLAC metadata by putting the URL of the
license in a Rights element, e.g.

<dc:rights>http://creativecommons.org/licenses/by-sa/2.5/</dc:rights>

-Gary




On Sep 7, 8:23 am, Jeff Good <jcg...@gmail.com> wrote:

John Hatton

unread,
Sep 23, 2010, 4:43:42 AM9/23/10
to lexiconinter...@googlegroups.com
Hi Gary,
Thanks for weighing in here. What is the implication of this "schema free
zone" on validation of the contents? I would guess that sticking to
namespaces will make it easier to say what is valid, and what isn't, within
the metadata tag. E.g., if my app does understand OLAC, but I can't seem to
parse what your app put in there, I need to be able to know (at runtime)
that what I'm looking at claims to be good olac (or something else), and
then run a schema over it to prove where the fault lies. WeSay does this all
the time, as bad lift comes to it from other apps. It saves us having to
deal with so many "why won't wesay read my data" emails, when WeSay can say
"wherever you got this here data from, go talk to them, not us".

John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development


Gary Simons

unread,
Sep 24, 2010, 7:56:55 AM9/24/10
to LexiconInterchangeFormat
John,

It isn't exactly a schema-free zone, rather it would be a look-
elsewhere-for-the-schema zone. That is, LIFT would be saying, "As far
as my schema goes, I'll accept anything here." However, as soon as an
embedded element declares a namespace, then XML kicks in and says,
"But it still needs to be valid with respect to the schema for the
namespace."

One of the details of the XML family of standards is the xsi (or
XMLSchema-instance) namespace which provides a way of telling a
document where to find the schema for the namespace. Our OLAC
metadata standard tells people to include this with their namespace
declaration. For instance, here is a standalone OLAC record that
could be dropped into another document:
xmlns:dc="http://purl.org/dc/elements/1.1/"
<dc:creator>Bloomfield, Leonard</dc:creator>
<dc:date>1933</dc:date>
<dc:title>Language</dc:title>
<dc:publisher>New York: Holt</dc:publisher>
</olac:olac>

This is where the xsi namespace is documented in the W3C standards:

http://www.w3.org/TR/xmlschema-1/#Instance_Document_Constructions

-Gary

Gary Simons

unread,
Sep 24, 2010, 7:57:14 AM9/24/10
to LexiconInterchangeFormat
John,

It isn't exactly a schema-free zone, rather it would be a look-
elsewhere-for-the-schema zone. That is, LIFT would be saying, "As far
as my schema goes, I'll accept anything here." However, as soon as an
embedded element declares a namespace, then XML kicks in and says,
"But it still needs to be valid with respect to the schema for the
namespace."

One of the details of the XML family of standards is the xsi (or
XMLSchema-instance) namespace which provides a way of telling a
document where to find the schema for the namespace. Our OLAC
metadata standard tells people to include this with their namespace
declaration. For instance, here is a standalone OLAC record that
could be dropped into another document:

xmlns:dc="http://purl.org/dc/elements/1.1/"
<dc:creator>Bloomfield, Leonard</dc:creator>
<dc:date>1933</dc:date>
<dc:title>Language</dc:title>
<dc:publisher>New York: Holt</dc:publisher>
</olac:olac>

This is where the xsi namespace is documented in the W3C standards:

http://www.w3.org/TR/xmlschema-1/#Instance_Document_Constructions

-Gary


On Sep 23, 3:43 am, "John Hatton" <john_hat...@sil.org> wrote:

John Hatton

unread,
Sep 24, 2010, 5:47:43 PM9/24/10
to lexiconinter...@googlegroups.com
Gary,
I think we're now saying the same thing; I must have missed that you wanted
to retain namespaces in the metadata.

thanks

John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development

Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:

hatto...@googlewave.com


Helen

unread,
Sep 25, 2010, 11:30:00 AM9/25/10
to LexiconInterchangeFormat
Hello, Everyone--especially Jeff,

Reading over this discussion, it sounds like LEGO should put a
metadata tag into the LL-LIFT header, and that should enclose the OLAC
namespace and OLAC metadata. So would it look like Gary's example,
but inside a metadata tag, i.e.:

<metadata>
<olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.language-archives.org/OLAC/1.1/
http://www.language-archives.org/OLAC/1.1/olac.xsd">
<dc:creator>Bloomfield, Leonard</dc:creator>
<dc:date>1933</dc:date>
<dc:title>Language</dc:title>
<dc:publisher>New York: Holt</dc:publisher>
</olac:olac>
</metadata>

BTW, I don't think we will be putting GOLD mappings in the header,
after all. LEGO must ingest LEXUS-generated XML as part of our
collaboration with MPI-Nijmegen on the RELISH project (Rendering
Endagered Lexicons Interoperable through Standards Harmonization).
The LEXUS-generated XML will have the GOLD URI made explicit in each
lexical entry. Also, the LEGO team leader tells me that the current
plan to output a "GOLD-mapped lexicon" includes specifying the
relevant GOLD concept on each entry, rather than just adding in the
header a listing that specifies which author labels are to be
interpreted as which GOLD concepts.

Thanks,
-Helen

Cambell Prince

unread,
Sep 25, 2010, 1:28:42 PM9/25/10
to lexiconinter...@googlegroups.com
Hi,

One benefit of having the metadata element is that it says that other name spaces are not welcome in the main body of lift.  Which I think is a good statement to make.

Regards,
Cambell

Gary Simons

unread,
Sep 27, 2010, 7:21:39 PM9/27/10
to LexiconInterchangeFormat
Yes, the example you give is what I am proposing the metadata would
look like.

Regarding GOLD mappings in the header, I think that is still the ideal
way to handle it and is what I will recommend when we finally get to
the point of FieldWorkds exporting lexicons with mappings to GOLD.
Besides the normalization it achieves by ensuring that each long URI
occurs only once, there will ultimately be the issue of one-to-many
matches (in which case do they represent union or intersection?) and
inexact matches. Thus, the complete solution is going to require a
way to describe the mapping from a range set element to its semantics
as expressed in terms of concepts in one or more external namespaces.
But I don't think we'll be ready to cross that bridge until we have a
few complete examples of fully mapped lexicons that we are ready to
encode.

-Gary

Helen

unread,
Sep 28, 2010, 2:23:11 PM9/28/10
to LexiconInterchangeFormat
Hi, Campbell,

Does your comment apply to the rest of the header as well? That is,
if we put the GOLD mappings in the header, as Gary suggests,
can we put a GOLD namespace in the header? I'm assuming that there
will be some container element, like the <metadata> element for the
OLAC metadata, which would demarcate the GOLD mappings and that we
could put the namespace inside that.

Thanks,
-Helen

On Sep 25, 1:28 pm, Cambell Prince <cambell.pri...@gmail.com> wrote:
>   Hi,
>
> One benefit of having the metadata element is that it says that other
> name spaces are not welcome in the main body of lift.  Which I think is
> a good statement to make.
>
> Regards,
> Cambell
>
> On 25/09/2010 4:47 a.m., John Hatton wrote:
>
>
>
> > Gary,
> > I think we're now saying the same thing; I must have missed that you wanted
> > to retain namespaces in the metadata.
>
> > thanks
>
> > John Hatton
> > SIL Papua New Guinea, Palaso,&  SIL International Software Development
> > Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:
> > hattonj...@googlewave.com

Jeff Good

unread,
Sep 25, 2010, 11:56:00 AM9/25/10
to lexiconinter...@googlegroups.com
Hello everyone,

If the LIFT developers agree to this, then what Helen has written here looks good to me.

As Helen knows, my side of the LEGO project, involving wordlists, has not needed to make use of GOLD mappings because of the nature of the data. So, I'm not personally worried too much if the solution we choose now is specific to metadata or more general. This leaves open the issue as to whether or not LIFT may want to incorporate a general provision allowing people to extend LIFT, of course. But, I find Gary's arguments regarding the role of metadata in best practice compelling enough that I think it makes sense to give it a specific place in the schema regardless of other provisions for extending LIFT.

Jeff

Martin Hosken

unread,
Sep 28, 2010, 10:56:13 PM9/28/10
to lexiconinter...@googlegroups.com
Dear Helen,

> BTW, I don't think we will be putting GOLD mappings in the header,
> after all. LEGO must ingest LEXUS-generated XML as part of our
> collaboration with MPI-Nijmegen on the RELISH project (Rendering
> Endagered Lexicons Interoperable through Standards Harmonization).
> The LEXUS-generated XML will have the GOLD URI made explicit in each
> lexical entry. Also, the LEGO team leader tells me that the current
> plan to output a "GOLD-mapped lexicon" includes specifying the
> relevant GOLD concept on each entry, rather than just adding in the
> header a listing that specifies which author labels are to be
> interpreted as which GOLD concepts.

One way to handle GOLD is to autogenerate a range header file for it. You can even put in a field in each range-element to give its url. Then you can use simple traits in the lexicon. This is probably better than defining a field that somewhat hides the information.

Yours,
Martin

Reply all
Reply to author
Forward
0 new messages