Call for Property Definitions

5 views
Skip to first unread message

Chris

unread,
Jun 4, 2009, 10:17:49 PM6/4/09
to sswl.linguistics
Dear All,

We have worked hard to get a core set of word order properties in
place,
inspired by Greenberg's orginal article (which focussed mostly on word
order).

We would now like to start populating the database with more
properties.
I have sent this e-mail to some people. If anybody has suggestions on
people
to ask, and which properties to work on next, please write to the
group.

Any and all suggestions welcome!!

Chris

----------------------------------------------------------------------------------------------------------

Dear XX,

I would like to invite you to take a look at the SSWL database

of the syntactic structures of the world's languages:

http://sswl.railsplayground.net/



SSWL is a searchable database that allows users to discover which
properties characterize

a language (morphological, syntactic, and semantic), as well as how
these properties relate

across languages.



We would like to have as much data as possible entered into the

database, on as many different grammatical properties as possible.

I was wondering if you would like to write a set of properties on YY

(YY = some grammatical topic/area/construction).



If so, please read the Guidelines for Properties Authors at:

http://sswl.railsplayground.net/documents/Guidelines_for_Property_Authors.pdf

Then, just sign up (through the login button in the navigation bar at
the top

of the home page).


If you have any comments on the database, please let me know.


Chris

Robert Forkel

unread,
Jun 15, 2009, 1:08:38 PM6/15/09
to sswl.linguistics
hi chris,
i'm the developper behind http://wals.info/
and i'm trying to figure out how sswl and wals could fit together/
complement each other.
while i'm not quite sure about this now, i already have some ideas
about how to integrate data from both sources: http://linkeddata.org/
i.e. i think we should work towards a simple ontology to use together
with rdf to make our data mutually interoperable.
i've looked at GOLD (http://www.linguistics-ontology.org/) before. but
it looks too heavy-weight. maybe we could come up with something more
simplistic, and see whether we can gain something from combining our
data?
what do you think about this? do you already have other plans to make
the sswl data available?
regards,
robert
> http://sswl.railsplayground.net/documents/Guidelines_for_Property_Aut...

Chris

unread,
Jun 15, 2009, 2:52:50 PM6/15/09
to sswl.linguistics
Dear Robert,

Thanks for the letter.

Our goal is maximal interoperability.

Also, I am in favor of collaborations with WALS
(which I think is a fantastic tool with great data),
and other databases.

About the technical specifics, I need to discuss that with Dennis
Shasha first,
who is the architect of SSWL. I will get back to you in a few days.

Chris

On Jun 15, 1:08 pm, Robert Forkel <xrotw...@googlemail.com> wrote:
> hi chris,
> i'm the developper behindhttp://wals.info/
> > ---------------------------------------------------------------------------­-------------------------------
>
> > Dear XX,
>
> > I would like to invite you to take a look at the SSWL database
>
> > of the syntactic structures of the world's languages:
>
> >http://sswl.railsplayground.net/
>
> > SSWL is a searchable database that allows users to discover which
> > properties characterize
>
> > a language (morphological, syntactic, and semantic), as well as how
> > these properties relate
>
> > across languages.
>
> > We would like to have as much data as possible entered into the
>
> > database, on as many different grammatical properties as possible.
>
> > I was wondering if you would like to write a set of properties on YY
>
> > (YY = some grammatical topic/area/construction).
>
> > If so, please read the Guidelines for Properties Authors at:
>
> >http://sswl.railsplayground.net/documents/Guidelines_for_Property_Aut...
>
> > Then, just sign up  (through the login button in the navigation bar at
> > the top
>
> > of the home page).
>
> > If you have any comments on the database, please let me know.
>
> > Chris- Hide quoted text -
>
> - Show quoted text -

Chris

unread,
Jun 16, 2009, 9:51:41 AM6/16/09
to sswl.linguistics
Dear Robert,

We are very interested in collaborating.
Your system Wals is very nicely designed and has invaluable data.

We expect SSWL to be very dynamic.
New languages and properties will be entered all the time.
So, in order for us to collaborate, we would have to publish
dumps of our data in a form that you can access.

Fortunately, we are already performing dumps every few minutes
to support our browsing pages (see "languages" and "properties"
in the navigation bar). Changing that output to any format you like
would
be easy to do (unless you feel like scraping it:).

We note that it is also possible to dump the results of
searches, using the Download Results function that appears at
the top of search results (to the right of Map It). This function
does not however dump the entire contents of the database.

Our data format is very simple.
We have three main tables in our mysql database:

properties: (defines properties)
Collation Attributes Null Default Extra Action
id int(11) No auto_increment
property varchar(255) latin1_swedish_ci Yes NULL
description text latin1_swedish_ci Yes NULL
author varchar(255) latin1_swedish_ci Yes NULL
date date Yes NULL
time datetime Yes NULL
created_at datetime Yes NULL
updated_at datetime Yes NULL
comments text latin1_swedish_ci Yes NULL
Check All / Uncheck All With selected:

languages (that discusses the properties and values of languages)
Field Type Collation Attributes Null Default Extra Action
id int(11) No auto_increment
language varchar(255) latin1_swedish_ci Yes NULL
property varchar(255) latin1_swedish_ci Yes NULL
value varchar(255) latin1_swedish_ci Yes NULL
author varchar(255) latin1_swedish_ci Yes NULL
date date Yes NULL
time datetime Yes NULL
created_at datetime Yes NULL
updated_at datetime Yes NULL
comments text latin1_swedish_ci Yes NULL

and examples (examples for each language illustrating various property-
values)
Field Type Collation Attributes Null Default Extra Action
id int(11) No auto_increment
language varchar(255) latin1_swedish_ci Yes NULL
etype varchar(255) latin1_swedish_ci Yes NULL
sentenceNumber varchar(255) latin1_swedish_ci Yes
NULL
property varchar(255) latin1_swedish_ci Yes NULL
value varchar(511) latin1_swedish_ci Yes NULL
author varchar(255) latin1_swedish_ci Yes NULL
date date Yes NULL
time datetime Yes NULL
created_at datetime Yes NULL
updated_at datetime Yes NULL
Check All / Uncheck All With selected:

In every case, we use the schema technique "property-as-value".
Thus if someone adds a new property P into properties
and then wants to apply that to language L, that person would
simply insert:

L, P, Yes into the languages table (there are other fields, but
these are the essential one).

So, creating new properties does not affect the schema at all.

You propose an rdf share format.
That would be fine as we said, though we could also give
you the three relational tables.

It will be up to the linguists to see how the semantics
of our properties (that take values yes, no, not applicable)
map to the Wals properties.

Dennis Shasha and
Chris Collins

Robert Forkel

unread,
Jun 16, 2009, 12:42:20 PM6/16/09
to Chris, sswl.linguistics
> We expect SSWL to be very dynamic.
> New languages and properties will be entered all the time.
> So, in order for us to collaborate, we would have to publish
> dumps of our data in a form that you can access.

yes, that's what i understood about sswl. so i could see sswl as a
workbench, where values for a certain feature/property could be
assembled until it reaches a critical mass to be included in wals; so
i'd think sswl could provide the collaboration/dynamic aspect, while
wals is more on the publishing/static side of the spectrum. but i
can't really talk abut this, as i'm not one of the editors; it's only
a motivation for me to look for interoperability.

the reason why i think rdf and linkeddata would be a good way to
ensure interoperability is because i could well imagine a scenario
where someone wants to work with data from both sswl and wals. to do
this, one would simply have to pull the desired data as rdf triples
from sswl and wals, put it into a triple store - e.g. the talis
platform [1] - and start analysing it, e.g. by running sparql queries
against it.

[1] http://www.talis.com/platform/

and yes, to make the data available without putting too much load on
the server providing regular dumps would be ok. wals is basically
read-only, and sits behind a squid as cache, so things are easier for
us.

> Fortunately, we are already performing dumps every few minutes
> to support our browsing pages (see "languages" and "properties"
> in the navigation bar). Changing that output to any format you like
> would
> be easy to do (unless you feel like scraping it:).

i'm not quite sure about the format myself yet. something that is
already sort of established in the field would be nice. but so far, i
haven't found such a format. so we might end up establishing a
standard of our own.

speaking about the data model: for the next edition of wals online, we
will add examples. i wonder, how you model these in sswl. i've seen
that you have these triplets (phrase, gloss, translation). do you also
have some sort of alignment between the parts of phrase and gloss? in
our data i've also found cases of multiple translations for the same
phrase. does your data model allow this?

another thing i noticed: the URLs for your language pages are formed
with the name of the language. are you expecting to keep these URLs
stable over time? with wals we had to change language names quite a
bit, so abstracting the URL from the name may be useful.

>
> Our data format is very simple.
> We have three main tables in our mysql database:
>
> In every case, we use the schema technique "property-as-value".
> Thus if someone adds a new property P into properties
> and then wants to apply that to language L, that person would
> simply insert:
>
> L, P, Yes into the languages table (there are other fields, but
> these are the essential one).

i see. so the properties are basically boolean (with the additon of
"not applicable"). so each single value of a feature (in wals speak)
would be a property in sswl, e.g the values of
http://wals.info/feature/112 would translate to properties like "has
Negative affix for negative morphemes", ... right?

if that's the case, do you consider allowing grouping of properties?
with wals we still stick to the values of the 2005 edition, which
seems sometimes too coarse. e.g. for feature "consonant inventories"
(http://wals.info/feature/1) it would seem more appropriate to simply
store the number of consonants, and leave the decision about a
grouping ("small", "moderately small", ...) to some output logic. the
case of numeric values for properties doesn't seem to fit well into
your model, though.

> So, creating new properties does not affect the schema at all.

that seems a good choice. i've done this for the reference database
(which is also part of wals).

> You propose an rdf share format.
> That would be fine as we said, though we could also give
> you the three relational tables.
>
> It will be up to the linguists to see how the semantics
> of our properties (that take values yes, no, not applicable)
> map to the Wals properties.

right. it will be interesting to see the feedback to your call.

anyway, it's intersting to see your project evolve; and it's certainly
good to have people around dealing with the same kind of data :)
best regards,
robert

Dennis Shasha

unread,
Jun 16, 2009, 12:56:05 PM6/16/09
to cc...@nyu.edu, xrot...@googlemail.com, sswllin...@googlegroups.com
Dear Robert,
1. We'll let you drive as far as the format for the rdf is concerned
(but we probably won't deliver it to you before the fall
because of summer jobs and stuff).
You know our table schemas so anything you like that is consistent
with those is good.

2. You ask:

==========


speaking about the data model: for the next edition of wals online, we
will add examples. i wonder, how you model these in sswl. i've seen
that you have these triplets (phrase, gloss, translation). do you also
have some sort of alignment between the parts of phrase and gloss? in
our data i've also found cases of multiple translations for the same
phrase. does your data model allow this?

=========

Each example for language L has a sentence number (also known
as example number).
In the examples table there are multiple rows with the same sentence number,
one for phrase, one for gloss, one for translation, and then zero or
more with property-values that the example illustrates.
So, it's easy to add new property-values to examples.
A property alternatetranslation would be entirely possible (the
data model doesn't care).
Phrase and gloss are aligned because there is a gloss element
for every phrase.

3.
=============


another thing i noticed: the URLs for your language pages are formed
with the name of the language. are you expecting to keep these URLs
stable over time? with wals we had to change language names quite a
bit, so abstracting the URL from the name may be useful.

============

We are generating the pages on the fly every few minutes, so the worst
thing that would happen if we changed the name of a language is that
there would still be a page with the old language name.
This doesn't seem to me to be a problem.

3.
=========


>
> Our data format is very simple.
> We have three main tables in our mysql database:
>
> In every case, we use the schema technique "property-as-value".
> Thus if someone adds a new property P into properties
> and then wants to apply that to language L, that person would
> simply insert:
>
> L, P, Yes into the languages table (there are other fields, but
> these are the essential one).

i see. so the properties are basically boolean (with the additon of
"not applicable"). so each single value of a feature (in wals speak)
would be a property in sswl, e.g the values of
http://wals.info/feature/112 would translate to properties like "has
Negative affix for negative morphemes", ... right?

============

Yes, that is correct I think but Chris should confirm.

4.
==============


if that's the case, do you consider allowing grouping of properties?
with wals we still stick to the values of the 2005 edition, which
seems sometimes too coarse. e.g. for feature "consonant inventories"
(http://wals.info/feature/1) it would seem more appropriate to simply
store the number of consonants, and leave the decision about a
grouping ("small", "moderately small", ...) to some output logic. the
case of numeric values for properties doesn't seem to fit well into
your model, though.

===============

We don't know of a good property hierarchy. If you have one, we're interested.


5.
===========


anyway, it's intersting to see your project evolve; and it's certainly
good to have people around dealing with the same kind of data :)
best regards,
robert

=========

Agreed. Warm Regards, Dennis

Chris

unread,
Jun 16, 2009, 1:30:22 PM6/16/09
to sswl.linguistics
Dear Robert,

Here are just a few comments to add to what Dennis has already
said:

> yes, that's what i understood about sswl. so i could see sswl as a
> workbench, where values for a certain feature/property could be
> assembled until it reaches a critical mass to be included in wals; so
> i'd think sswl could provide the collaboration/dynamic aspect, while
> wals is more on the publishing/static side of the spectrum. but i

This is exactly how I have seen things. There is a continuum,
and WALS is on one side of it (more toward a publication) and
we are on the other (more toward Wikipedia), this division corresponds
naturally to the fact that SSWL is language expert oriented, whereas
WALS is property author oriented (as far as who is inputting data).
So from that point of view, the two projects complement each other
very nicely.

> the reason why i think rdf and linkeddata would be a good way to
> ensure interoperability is because i could well imagine a scenario
> where someone wants to work with data from both sswl and wals. to do
> this, one would simply have to pull the desired data as rdf triples
> from sswl and wals, put it into a triple store - e.g. the talis
> platform [1] - and start analysing it, e.g. by running sparql queries
> against it.

My recommendation is, if you do this, to use the SSWL query interface,
and allow it query data in the rdf format. The SSWL query interface is
very powerful,
basically allowing naive users (i.e., linguists) to make full use of
the relational
dabase structure.

> speaking about the data model: for the next edition of wals online, we
> will add examples. i wonder, how you model these in sswl. i've seen
> that you have these triplets (phrase, gloss, translation). do you also
> have some sort of alignment between the parts of phrase and gloss? in
> our data i've also found cases of multiple translations for the same
> phrase. does your data model allow this?

Examples and glosses are aligned in the sense that each morpheme/word
in the sentence corresponds to a morpheme/word in the gloss,
as in Leipzig.

We are now trying to figure out how to present them on the page
as aligned. If you have any suggestions, please let us know.

Multiple translations could be a new property, as Dennis suggests,
or perhaps one could just add them to the already existing
translation
line, like (Alternative: ....). This would allow the existing search
interface
to work directly on them as far as search for elements of the
translation.

>
> another thing i noticed: the URLs for your language pages are formed
> with the name of the language. are you expecting to keep these URLs
> stable over time? with wals we had to change language names quite a
> bit, so abstracting the URL from the name may be useful.
>

We could change the URL names to IS0 639-3, but one thing remains
to work out. We are planning to admit lots of dialect data, and ISO
639-3
does not encompass that. Do you know of a standard we could use?
Other than that problem, ISO 639-3 would be best for page names,
and as Dennis said, it is trivial to change.

>
> i see. so the properties are basically boolean (with the additon of
> "not applicable"). so each single value of a feature (in wals speak)
> would be a property in sswl, e.g the values ofhttp://wals.info/feature/112would
> translate to properties like "has Negative affix for negative morphemes", ... right?

I would put it like this:

Sentential negation is a negative affix: Yes, No, NA

(I would define sentential negation as in Pullum and Huddleston's
grammar).

For now, all properties will be binary, since that forces the
properties
to be fine grained (and hence more likely to capture significant
linguistic
variation), but this is open for debate. If there is a groundswell
against
binary properties, we will change it.

>
> if that's the case, do you consider allowing grouping of properties?
> with wals we still stick to the values of the 2005 edition, which
> seems sometimes too coarse. e.g. for feature "consonant inventories"
> (http://wals.info/feature/1) it would seem more appropriate to simply
> store the number of consonants, and leave the decision about a
> grouping ("small", "moderately small", ...) to some output logic. the
> case of numeric values for properties doesn't seem to fit well into
> your model, though.

I believe the best way to handle this problem is to let people create
their own groupings, by allowing people to search for and browse
properties.
Both of these features will be coming to SSWL in the Fall.

For example, suppose that I want to see all word order propreties,
involving verbs, I could say:

Call up word order properties involving verbs.

If I create a property hiearchy, it might not be what you like or
what
you have in mind (especially for people from different backgrounds).

>
> > So, creating new properties does not affect the schema at all.
>
> that seems a good choice. i've done this for the reference database
> (which is also part of wals).

Is there a standard reference database set of fields?
We are now looking into that.

Chris

Robert Forkel

unread,
Jun 17, 2009, 2:34:12 AM6/17/09
to Chris, sswl.linguistics
On Tue, Jun 16, 2009 at 7:30 PM, Chris<cc...@nyu.edu> wrote:
>
> Dear Robert,
>
> Here are just a few comments to add to what Dennis has already
> said:
>
> My recommendation is, if you do this, to use the SSWL query interface,
> and allow it query data in the rdf format. The SSWL query interface is
> very powerful,
> basically allowing naive users (i.e., linguists) to make full use of
> the relational
> dabase structure.

i just played with the query interface and i'm curious: why did you
limit the cross-product search for properties to only two? with wals
we did the same thing, mainly for user interface reasons; but i always
felt it was limiting.

> Examples and glosses are aligned in the sense that each morpheme/word
> in the sentence corresponds to a morpheme/word in the gloss,
> as in Leipzig.
> We are now trying to figure out how to present them on the page
> as aligned. If you have any suggestions, please let us know.

ah i see. i've heard about the leipzig glossing rules before, but have
looked them up now - i'm not a linguist myself.
the example data i have at hand is in HTML, so the morphemes come as
content of table cells. the examples do also contain markup like
"underline", "bold", "italics", ... so i'll probably go for storing
phrase and gloss as HTML snippets - which will also solve the display
problem: i'll just plug them back into a table.

> Multiple translations could be a new property, as Dennis suggests,
> or perhaps one could just add them to the already existing
> translation
> line, like (Alternative: ....). This would allow the existing search
> interface
> to work directly on them as far as search for elements of the
> translation.

i guess, i'll go for the second option you outlined, although (or
because) i have examples with up to 3 alternative translations.

> We could change the URL names to IS0 639-3, but one thing remains
> to work out. We are planning to admit lots of dialect data, and ISO
> 639-3
> does not encompass that. Do you know of a standard we could use?
> Other than that problem, ISO 639-3 would be best for page names,
> and as Dennis said, it is trivial to change.

with wals we had the same problem: not all of the languages in wals
map one-to-one to iso languages. so there are wals codes now to
identify "our" languages. although this situation is not ideal, it is
one of the problems that may get solved by semantic web mechanisms: if
two languages are the same, you can just say so using "owl:sameAs"
[1]. but clearly, using the same identifier/URL right from the start
would be a better way. that's why we are working towards a language
catalog with much lower barrier of entry than iso - basically just a
place on the web, where you can mint URLs for
languages/dialects/families. we'd leave out the whole
classification/genealogy thing to not get into these problems and
conflicts. so in case this will happen, i'll let you know :)

my question about the URLs was aiming at something different, though.
while sswl is supposed to be very dynamic, you said it should be a bit
like wikipedia. now wikipedia is something that definitely must care
about stability of URLs, because so many people link to it. if you
expect people to link to individual language pages in sswl, i'd
suggest to have a layer of abstraction between language name and URL.

[1] http://en.wikipedia.org/wiki/Web_Ontology_Language

> For now, all properties will be binary, since that forces the
> properties
> to be fine grained (and hence more likely to capture significant
> linguistic
> variation), but this is open for debate. If there is a groundswell
> against
> binary properties, we will change it.

i understand. and don't consider my opinion in this regard - as i
said, i'm no linguist myself.
>
> I believe the best way to handle this problem is to let people create
> their own groupings, by allowing people to search for and browse
> properties.
> Both of these features will be coming to SSWL in the Fall.

ah ok. yes, that would make sense. we have toyed with the idea of
allowing query functionality like this in wals - picking single
feature values from different features and comparing the results, but
so far i couldn't think of a compelling user interface to visualize
the results.

> For example, suppose that I want to see all word order propreties,
> involving verbs, I could say:
>
> Call up word order properties involving verbs.
>
> If I create a property hiearchy, it might not be what you like or
> what
> you have in mind (especially for people from different backgrounds).

you're right. i didn't think of a real hierarchy, though; it's really
more about types of values, like for example the number of consonants
in a language, where properties like "number of consonants is 1",
"number of consonants is 2", ... would feel awkward. but of course,
once you have typed values, you'd need support for queries like
"number of consonants < 5" in your search interface.

> Is there a standard reference database set of fields?
> We are now looking into that.

there are a lot of library standards for this kind of data: MODS,
MARC, .... for wals we reused a database [2] i already built for
Living Reviews [3] which is loosely based on the fields in BibTeX; but
today i'd try to get away with an existing product like RefBase [4].

again, our big plan would be to let the wals reference database become
part of bigger linguistic reference database, liknked to this language
catalog i was speaking of. but these plans are not veryu concrete yet.

best regards,
robert

[2] https://dev.livingreviews.org/projects/epubtk/wiki/RefDB
[3] http://www.livingreviews.org/
[4] http://www.refbase.net/

>
> Chris
> >
>

Chris

unread,
Jun 17, 2009, 9:51:48 AM6/17/09
to sswl.linguistics
> i just played with the query interface and i'm curious: why did you
> limit the cross-product search for properties to only two? with wals
> we did the same thing, mainly for user interface reasons; but i always
> felt it was limiting.

Tradition -- that is what Greenberg did. "Cross" (and the
corresponding
notion in WALS) models his tetrachoric table. I agree with you,
3-cross and 4-cross would be very useful.
Reply all
Reply to author
Forward
0 new messages