Loanword Typology data model

16 views
Skip to first unread message

Robert Forkel

unread,
Dec 4, 2008, 3:08:17 AM12/4/08
to GOLD Ontology
hi all,
i just put together a basic example of the data for my Loanword
Typology project in GOLD (thus leaving out the loan aspects so far).
the turtle representation looks as follows:

# lwt data model example in turtle (http://www.w3.org/2007/02/turtle/
primer/)

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix gold: <http://purl.org/linguistics/gold#>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix lwt: <http://www.livingreviews.org/lwt/>.

#
# describe a word:
#
lwt:word/72181920467485626 rdf:type gold:LinguisticSign .
lwt:word/72181920467485626 gold:hasForm lwt:word/
72181920467485626#Form .
lwt:word/72181920467485626 rdf:hasMeaning lwt:meaning/3.597 .

lwt:word/72181920467485626#Form rdf:type gold:SyntacticWord .
lwt:word/72181920467485626#Form gold:inLanguage lwt:languoid/Dutch .
lwt:word/72181920467485626#Form gold:writtenRealization lwt:word/
72181920467485626#LexicalForm .

lwt:word/72181920467485626#LexicalForm rdf:type
gold:OrthographicWord .
lwt:word/72181920467485626#LexicalForm gold:orthographicRep
"aalscholver" .

#
# describe a meaning:
#
lwt:meaning/3.597 rdf:type gold:SemanticUnit .
lwt:meaning/3.597 rdfs:label "the cormorant" .

#
# describe a language:
#
lwt:languoid/Dutch rdf:type gold:Language .
lwt:languoid/Dutch rdfs:label "Dutch" .
lwt:languoid/Dutch dc:Identifier "iso-639-3:nld" .
lwt:languoid/Dutch owl:sameAs <http://wals.info/languoid/lect/
wals_code_dut> .


does that make sense? i'm not a linguist, so bear with me.

Jeff Good

unread,
Dec 4, 2008, 11:19:39 AM12/4/08
to gold-o...@googlegroups.com
Dear Robert (et al.),

I took a look at your sample LWT example and have some comments on
possible changes. I'm not that conversant in how GOLD has chosen to
model certain things. So, some details may not be GOLD-compliant. I
also added some concepts and predicates that aren't actually in GOLD
in the gold namespace (but it looks like you may have too).

Major changes:

1. For wordlists, after much discussion with various people, I think
it's safe to say a consensus has emerged not to code a wordlist
concept like "the cormorant" as a meaning in the GOLD sense of
meaning, but rather as a kind of "comparative concept" which has
"counterparts" in various languages. The GOLD sense of "meaning" would
then be reserved for particular translations of specific words in a
given language. This is kind of abstract, I know, but it has to do
with the special uses of wordlist concepts to try to generalize over
language-particular meanings. For example, we can talk of a concept
"person" which (at least a century ago) would have as a counterpart in
English as "man" and in German as "Mensch". However, in English "man"
has two important senses "person" and "male person" while German
"Mensch" lacks that ambiguity. In a wordlist context, both would be
linked to LWT's "the person", but if we were to assign each a language-
internal meaning, they would differ.

2. In an academic context, like the LWT project, where attribution is
vital, it is probably inadvisable to ever link any kind of data about
a language directly to a node for that language. Rather, an
intermediating device specifying the source of the data should be
employed. In the RDF below, I've used the notion of Doculect for this
(i.e., "a variety of a language which is documented in some source")
and connected the doculect to a language. I had thought GOLD
"described variety" was used for this concept, but upon reading the
documentation, it seems that is used for something different.

3. (Relatively minor) You had some content coming off of of a Form
node which struck me as being better connected to a Word node. So, I
moved some annotation "up" in the tree.

I've pasted revised RDF below. It should be valid if you want to
examine it in detail.

Jeff


----------------


# lwt data model example in turtle (http://www.w3.org/2007/02/turtle/
primer/)

#
# describe a word:
#
<lwt:word/72181920467485626> rdf:type gold:LinguisticSign .
<lwt:word/72181920467485626> gold:hasForm <lwt:word/
72181920467485626#Form> .

<lwt:word/72181920467485626> gold:translatesConcept <lwt:meaning/
3.597> .
<lwt:word/72181920467485626> rdf:type gold:SyntacticWord .
<lwt:word/72181920467485626> gold:inDoculect <lwt:languoid/
vanderSijs2008Dutch> .

<lwt:word/72181920467485626#Form> gold:OrthographicWord <lwt:word/
72181920467485626#OrthographicForm> .

<lwt:word/72181920467485626#OrthographicForm> rdf:type
gold:OrthographicWord .
<lwt:word/72181920467485626#OrthographicForm>
gold:orthographicRepresentation "aalscholver" .


#
# describe a meaning:
#

<lwt:meaning/3.597> rdf:type gold:ComparativeSemanticConcept .


<lwt:meaning/3.597> rdfs:label "the cormorant" .

#
# describe a doculect
#
<lwt:languoid/vanderSijs2008Dutch> rdf:type <gold:Doculect> .
<lwt:languoid/vanderSijs2008Dutch> gold:describesLanguoid
<lwt:languoid/Dutch> .

Robert Forkel

unread,
Dec 4, 2008, 1:45:10 PM12/4/08
to GOLD Ontology
hm. i'm somewhat at a loss. from a first look, GOLD seems pretty heavy
- in particular compared to ontologies like the one for wordnet
(http://www.w3.org/2006/03/wn/wn20/). but now it seems as if i'd still
have to make up quite a few concepts not (yet) in GOLD:
gold:translatesConcept
gold:ComparativeSemanticConcept
gold:inDoculect
gold:Doculect
gold:describesLanguoid
so maybe i should sort of dumb down my data and go with wordnet after
all, because it would be an idealisation so obvious as to not be
confused with inaccuracy?

as to the doculet concept: might gold:AttestedVariety fit?

Jeff Good

unread,
Dec 4, 2008, 2:33:06 PM12/4/08
to gold-o...@googlegroups.com
Hello again,

I think it's fair to say (but Scott can correct me) that while, in
many respects, GOLD far outdoes other linguistic typologies out there,
there are still lots of concepts to be put in at this point. GOLD is
particularly strong at grammatical categories. But, there are other
prominent areas where it is not well built out. (For example, in the
current version of GOLD, PhonologicalProperty has yet to be built out
at all.)

So, having to add GOLD concepts for a new project, particularly one
like LWT which was not among the core use cases envisioned early on by
GOLD, shouldn't be too surprising.

My sense here is that WordNet might look appropriate but is not really
what we want here because, despite some important similarities to the
vocabulary needed for LWT, WordNet is about modelling relationships
among words within a single language, whereas LWT has a comparative
focus. There are some cases in the LWT databases where statements are
made about how words within a language are related--and for these
WordNet concepts may be appropriate. But, once you step out of
language-internal vocabulary description, I think you'll be in
territory not yet covered by any ontology.

Hopefully Scott can chime in at some point about this. I may be pretty
wrong about the details.

As for AttestedVariety--yes, perhaps this is the same as doculect,
though I'm not completely sure. Based on the definitions in GOLD, it
seems to me AttestedVariety's core use is to distinguish between
languages which we have historical evidence existed as opposed to
those which we reconstruct on the basis of attested languages (e.g.,
German would be an AttestedVariety while Proto-Germanic would be
UnattestedVariety).

If that's the case, then doculect is somewhat different. It's a
variety of a language as described by some linguist. It could even be
a doculect of an unattested, but reconstructed, language. The issue
here is that I might say, for example, that the word for "cormorant"
in Dutch is "aalscholvers" but someone else might say it's "vogel".
Now, we might happen to know that one word is a better choice than
another, but in terms of data coding, that's separate from the fact
that the two sources disagree. As linguists, we pretend to talk about
"Dutch" and "English", but what we are really talking about is "Dutch"
and "English" as described by another linguist. That distinction
strikes me as worth modelling because, often, there will be
disagreements about the properties of "Dutch" or "English" and, in
modelling the data, we certainly don't want to have to broker those
disagreements. Associating elements to doculects allows us to do this--
and then different linguists can assign different trust values to
different doculects as desired.

Jeff

Robert Forkel

unread,
Dec 4, 2008, 2:57:43 PM12/4/08
to GOLD Ontology
ok. i think i do understand the issues a bit better now. but i look at
the data modeling from a different aspect (and btw i only think of
modelling of data for exchange). while there certainly may be
distinctions worth modelling, i think what may be lacking for GOLD are
examples of the usefulness of interoperability - even if this means
simplification.
and in particular for the LWT project i'd say one could start out
simple, and provide more accurate data models (possibly in addition)
later.
after having spent some time hanging around semantic web circles, i
came to the conclusion that eventually it's the application built on
top of the data, which will determine the data model.
e.g. there's been a complex big specification for geographical data
(GML) available for some time. but as soon as google came out with
google maps, it was pretty clear that you have to provide your data in
KML - even if that meant dumbing it down.
so before having a clearer idea about what these applications
exploiting the linguistic data may be, i'm not too willing to invest
energy trying to get new concepts into GOLD.
so basically what i'm looking for is a "good enough" data model to
enable some sort of data reuse.
best regards,
robert

Jeff Good

unread,
Dec 5, 2008, 9:53:21 AM12/5/08
to gold-o...@googlegroups.com
Hello Robert (et al.),

I think it's excellent to have someone with your perspective on data
modeling contributing to discussions on GOLD, as we could definitely
use input from people who may want to develop tools based on it.

Regarding LWT and the complications I introduced, I think the issues
regarding "doculect" can be safely put aside for a project like yours
at this point, if you want to keep things simple. In the context of
the LWT data one is (at least overall) only dealing with one primary
doculect--the version of the language of the relevant LWT author.
Where doculect could become more important is in a future situation
where there is so much RDF out there than one finds many
contradictions in the description of a particular language. If such a
scenario arises, I think we would all be quite happy since it means
people really are using Semantic Web services.

I do think, however, it is important to use a different property for
the relationship between a word list "meaning" and a word in a
language than between a word and its "definition". I have been in a
number of discussions with linguists about this (including in the
context of LWT), and there really is consensus that these
relationships are (i) very different and (ii) need to be explicitly
marked as being different for linguistic purposes. Non-linguists may
be less concerned about this, but I think this is a case where the
domain experts (i.e., linguists) are not simply being overly specific
but, rather, are really drawing on their better understanding of the
issues.

In particular, one must keep in mind (and I borrow this formulation
from Michael Cysouw) that comparative word list creation and
dictionary creation are two "opposite" (but not strictly inverse)
endeavors. To make a dictionary one starts with a word in a given
language and then tries to associate it with concepts. To make a
wordlist, one starts with a concept and then tries to associated with
a word. In simple cases, one will get the same mapping either way, but
in many, many cases one will not.

In the particular case of the LWT project, one actually finds both
cases. The "core" of the project is concept->word mappings, but
authors were allowed to optionally give additional word->concept
mappings for their specific languages when they felt the need to.

For example, in my LWT database, I have the following information for
LWT 1.22 "the mountain or hill"

"the mountain or hill" LWT meaning -> "kúnunu" Saramaccan word
"kúnunu" Saramaccan word -> "hill" general meaning

The purpose of such distinct mappings is to say that the best
counterpart in Saramaccan for the concept of "the mountain or hill" is
a word in Saramaccan whose primary sense is restricted to only "hill"
_not_ "mountain _or_ hill". (In this case, Saramaccan is not
dissimilar to English or German which also do not have words meaning
"mountain or hill" but, rather, separate words for each concept.)

The meaning->word mapping is especially valuable for comparative
linguistic purposes, the word->meaning is the more usual notion of
"definition", which, obviously, has a range of uses. LWT itself allows
for both to be coded in a single database but only requires the
meaning->word mapping since that's the point of the project.

Some researchers may choose to collapse these distinct mapping
relations into one "symmetric" relation. But, I think that, in such a
case, they can simply do two "queries" to get all pairs and then put
them all together in one pile, losing the relevant distinction. There
will be enough applications in linguistics where researchers will only
want the concept->word mappings, however, to make sure they are coded
in a distinct way.

That's my feeling about loanwords at least. I'm not really in a
position to speak generally for GOLD on the balance between getting
the model right and getting the application to work.

Jeff

Robert Forkel

unread,
Dec 5, 2008, 10:41:36 AM12/5/08
to gold-o...@googlegroups.com
hi jeff,
regarding the problem with gold:hasMeaning: aren't the concerns you
raise met by the fact that the relation between word and meaning is a
many-to-many relation? and comparative studies are particularly
interested in the cases where the mapping is not 1-1 or injective? of
course one could also introduce more structure on the set of meanings
like inclusion.
best regards,
robert

Jeff Good

unread,
Dec 5, 2008, 11:43:55 AM12/5/08
to gold-o...@googlegroups.com
Hi Robert,

Your question is a good one, and I think the right way to think about
this isn't so much that a word list concept->word relation isn't a
kind of "meaning" (which it definitely can be) but, rather, that it
is, in principle, a highly-curated kind of meaning with value added.

For example, when I say:
German "Mann" means the same thing as English "man"
I've made a reasonable translation, that will have value for a wide
range of purposes.

But, when I say:
German "Mann" is the best counterpart for the LWT wordlist concept THE
MAN
I'm actually making a stronger statement that says: If your intended
use for the data is to do comparisons of vocabulary items across
languages and you are, in particular, interested in words
corresponding to the concept "THE MAN", then for German, the word you
should look at is "Mann", and you should not consider, for example,
"Mensch" even though "Mensch" could be translated as "man" in some
cases.

I put "THE MAN" in capitals here to explicitly mark that we aren't
concerned with the English word "man" but, rather, some general
concept (say, "human male") for which we use a label based on English
as a convenient mnemonic device. We could have used a picture, as
well, for example, or a twenty-page encyclopedia entry if we wanted to
be very careful.

In other words, we can probably, to some extent, say that
"hasCounterpart" is a sub-property of "hasMeaning". That is, a
"hasCounterpart" relation between A and B implies a "hasMeaning"
relationship between A and B, but we can't go the other way around.
For certain purposes--in particular many of the purposes envisioned
for LWT by the editors--the "hasCounterpart" relationship is superior
to the "hasMeaning" relationship. If LWT's goals were not specifically
to exploit the "hasCounterpart" relationships in the database to come
to general conclusions about borrowability, then the point I'm raising
here would probably be considered an "interesting" detail not worth
implementing. But, since LWT has been specially curated specifically
to create a database of "hasCounterpart" relationships, it seems to me
that it's worth coding in the data. In particular, one needs to
distinguish between the core set of "hasCounterpart" relationships and
the optional additional "hasMeaning" relationships which one will find
as well. The former are the reason the project exists. The latter are
simply largely incidental, clarificational annotations.

I hope this is clear,
Jeff

Robert Forkel

unread,
Dec 5, 2008, 12:02:12 PM12/5/08
to gold-o...@googlegroups.com
ok. yes, i think i'm clear on the distinction now. and i do think that
distinct properties make it actually easier to distinguish a "best"
counterpart - i.e. the meaning or definition - than e.g. imposing an
ordering on the set of counterparts.
while this still puts me in the position of having to cook up some
terms on my own, i'll accept that as being part of pioneering work.
i'd still hope to find ideas for actual reuse of the LWT data - maybe
comparison to other wordlists or dictionaries - to help me understand
what level of interoperability we should go for.
best regards,
robert

S. Farrar

unread,
Dec 5, 2008, 7:58:15 PM12/5/08
to GOLD Ontology
Hi All

I mentioned in my previous post that we're putting together a toolkit to
help migrate data to GOLD aware RDF. We should have an update posted this
weekend. So far, I have implemented migration code for these formats:

-Praat (IGT only)
-Elan (IGT only)
-bibtex (for citing data)
-Leipzig glossed text

These formats are planned in the near future:

-phonetic feature geometries (in a text format)
-WordNet style entries

If there are any other formats that you think would be useful, please let
me know.

Scott

University of Washington
Department of Linguistics
B-201 Padelford Hall
Box 354340
Seattle, WA 98195-4340
Phone: (206) 616 5728
Fax: (206) 685-7978
webpage: http://faculty.washington.edu/farrar

Jeff Good

unread,
Dec 5, 2008, 8:02:41 PM12/5/08
to gold-o...@googlegroups.com
Robert,

> terms on my own, i'll accept that as being part of pioneering work.
> i'd still hope to find ideas for actual reuse of the LWT data - maybe
> comparison to other wordlists or dictionaries - to help me understand
> what level of interoperability we should go for.

This will be part of the LEGO project--there's plan to start work on
this early next year, if you can wait. We want to put a large number
of wordlists and dictionaries into a broadly interoperable format.
Details to be determined...

Jeff

Reply all
Reply to author
Forward
0 new messages