John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:
hatto...@googlewave.com
Thanks
John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:
To follow up on Helen's message, right now, as she said, I'm working on a LIFT-based XML schema for wordlist data. Wordlists are different from regular lexicons in a number of important ways but they can be expressed in a subset of LIFT for the most part. Since the data sources I'm dealing with amount to several thousand word lists and since we want to disseminate them broadly, we need to be careful about keeping track of metadata. This is why, ideally, we'd be able to include metadata with each word list. That way, we wouldn't have to worry about the metadata getting separate from the data.
We can include our metadata in a semi-structured way using a <note>, but we'd rather include proper OLAC metadata, if possible, for all the obvious reasons. We could generate stand-off metadata, of course, it would just add to the data management overhead more than I'd like. I'm not so much worried about my own work--after all, I have all the metadata in my own database--but if someone else wants to download, say, 1000 of the wordlists for some application, I think it would be easier for them if the metadata were packaged with the lists (or, at least, I think they should have that option).
I suppose, at this point, what I'm most interested in is knowing if there are any recommendations about metadata from the LIFT community. If there are not, we'll probably adopt some option using a <note> inside the wordlist and then work on producing an OLAC dump for the metadata for all the wordlists.
Thanks,
Jeff
I'm attaching a LIFT file with an attempt to embed OLAC in an initial <metadata> tag. I made this file by hand since we don't have an official way to do this yet in the project. So, this should be viewed as an example of a possibility rather than a proposal.
The lexicon (a wordlist) is called Bezhta.xml. I'm also attaching an RNG file called Hacked-Lift.rng which took the constrained LIFT that LEGO has been developing and changed it to get the attached lexicon to validate. The big change is adding a metadata element to the document definition that can have any kind of content. Obviously, we wouldn't actually want it to have any kind of content. I just put this in for testing.
Please let me know if you have any questions,
Jeff
Hi Jeff,
Thanks for the info.
>We can include our metadata in a semi-structured way using a <note>, but we'd rather include proper OLAC metadata, if possible, for all the obvious reasons. We could generate stand-off metadata, of course, it would just add to the data management overhead more than I'd like. I'm not so much worried about my own work--after all, I have all the metadata in my own database--but if someone else wants to download, say, 1000 of the wordlists for some application, I think it would be easier for them if the metadata were packaged with the lists (or, at least, I think they should have that option).
So, would it work for you to just use a namespace for OLAC?
<lift producer="some lift thing v2.1" version="0.15">
<header>
<olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.1/"
xmlns="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.language-archives.org/OLAC/1.1/
http://www.language-archives.org/OLAC/1.1/olac.xsd">
<creator>Graham, Albert</creator>
Etc...
</olac:olac>
Then all we (WeSay, FLEx authors) have to do is make sure we round trip any data which comes in a namespace like this. And eventually, we'd like to support entering some of this meta data from the editors, as well. We'd also like to embed license information (e.g. ccREL).
John Hatton
SIL PNG, Palaso, & SIL International Software Development
Google Talk chat: hattonjohn
I don't see any problem on our end with just declaring the namespaces in an <olac:olac> portion of the document. It looks like you're also thinking of putting metadata in the header, which I don't anticipate any problem with, either.
I wonder if it might be worth considering having a special <metadata> block which could contain OLAC metadata, or other kinds of metadata, if needed, and thens suggesting that applications support retaining any information in that block even if the application can't do anything with it.
Jeff
>I wonder if it might be worth considering having a special <metadata> block
which could contain OLAC metadata, or other kinds of metadata, if needed,
and thens suggesting that applications support retaining any information in
that block even if the application can't do anything with it.
We batted this around here (I'm visiting Thailand this week) and decided to
leave it up to you to say what the standard should be on this one; you
decide if the rule is
a) apps should round-trip other-namespace data found in the <header>, or
b) We add a <metadata> tag to the header which has no contents, and say that
apps should make sure they round-trip anything found in that tag.
> a) apps should round-trip other-namespace data found in the <header>, or
> b) We add a <metadata> tag to the header which has no contents, and say that
> apps should make sure they round-trip anything found in that tag.
I think the LEGO team will need to discuss this internally and come back to you with an answer. As it turns out, there is some other information (e.g., mappings to an ontology) that we've been trying to get into the lexicons that might be relevant to this discussion, too, which is why I'd need to confer with them. At present, we have assumed that such mappings could not be done inside of a LIFT document because of things like namespace issues. However, your solution (a) might give us a good solution for that, too.
I've never written an application where I've had to worry about round-tripping data that the application may not be designed to handle. The reason why I thought about having a <metadata> tag is that I thought it would make processing easier if one could say in the specification, for example, that whatever is inside the <metadata> tag is officially "terra incognita". Then special handlers could be written to deal with the content of that tag. Is it similarly easy to write code that would detect any element using an arbitrary outside namespace? (I don't recall this being the sort of thing that would be built into an XML parser, but, again, it's never a problem I had to worry about.)
If the people working on applications using LIFT don't think detecting arbitrary outside namespaces is a problem, then it's fine with me. I'm just explaining why I thought it might be helpful to "quarantine" any/all metadata using external namespaces in a dedicated (unqualified) tag. One could easily imagine generalizing this to other data types (e.g., copyright, ontological mappings) as well.
In any event, I'll start a conversation with the rest of LEGO about this, and we'll get back to you.
Thanks,
Jeff
John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
thanks
John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
Chat Google Talk: hattonjohn Skype: hattonjohn Google Wave:
If the LIFT developers agree to this, then what Helen has written here looks good to me.
As Helen knows, my side of the LEGO project, involving wordlists, has not needed to make use of GOLD mappings because of the nature of the data. So, I'm not personally worried too much if the solution we choose now is specific to metadata or more general. This leaves open the issue as to whether or not LIFT may want to incorporate a general provision allowing people to extend LIFT, of course. But, I find Gary's arguments regarding the role of metadata in best practice compelling enough that I think it makes sense to give it a specific place in the schema regardless of other provisions for extending LIFT.
Jeff
> BTW, I don't think we will be putting GOLD mappings in the header,
> after all. LEGO must ingest LEXUS-generated XML as part of our
> collaboration with MPI-Nijmegen on the RELISH project (Rendering
> Endagered Lexicons Interoperable through Standards Harmonization).
> The LEXUS-generated XML will have the GOLD URI made explicit in each
> lexical entry. Also, the LEGO team leader tells me that the current
> plan to output a "GOLD-mapped lexicon" includes specifying the
> relevant GOLD concept on each entry, rather than just adding in the
> header a listing that specifies which author labels are to be
> interpreted as which GOLD concepts.
One way to handle GOLD is to autogenerate a range header file for it. You can even put in a field in each range-element to give its url. Then you can use simple traits in the lexicon. This is probably better than defining a field that somewhat hides the information.
Yours,
Martin