Re: [pedantic-web] Link quality in other issues in Our Airports dataset

12 views
Skip to first unread message

Rob Styles

unread,
Jul 28, 2010, 3:38:26 PM7/28/10
to pedant...@googlegroups.com, datain...@googlegroups.com
Richard,

Thanks for the suggestions. The airports data is not actively maintained right now, it was produced from a dump from ourairports.com, but the structure of those files kept changing and was taking too much time to keep up.

That doesn't excuse the poor handling of null values or the incorrect property mapping.

I'll keep these points in mind if I come back to it.

rob


On Wed, Jul 28, 2010 at 8:15 PM, Richard Cyganiak <ric...@cyganiak.de> wrote:
Hi,

(cc pedantic-web)

To the maintainer of the Our Airports dataset at <http://airports.dataincubator.org/>, with SPARQL endpoint at <http://api.talis.com/stores/airports>: I have a number of suggestions for improvement, mostly related to link quality.

The following query shows that ten different regions are all declared as being owl:sameAs some "null item":

   PREFIX owl: <http://www.w3.org/2002/07/owl#>
   SELECT * WHERE {
       ?s owl:sameAs <http://api.talis.com/stores/airports/items/null> .
   }

The problem seems to be related to exotic Unicode characters, many of the names of the subject regions include the characters “ž”, “ı”, or “Ş”. The object in these triples should be replaced with real identifiers for the regions. Hand-crafted corrected triples below:

<http://airports.dataincubator.org/regions/MD-SV>
  owl:sameAs <http://dbpedia.org/resource/C5%9Etefan_Vod%C4%83_District> .
<http://airports.dataincubator.org/regions/SI-179>
  owl:sameAs <http://dbpedia.org/resource/Sodra%C5%BEica> .
<http://airports.dataincubator.org/regions/TR-63>
  owl:sameAs <http://dbpedia.org/resource/%C5%9Eanl%C4%B1urfa> .
<http://airports.dataincubator.org/regions/ME-15>
  owl:sameAs <http://dbpedia.org/resource/Plu%C5%BEine> .
<http://airports.dataincubator.org/regions/SI-087>
  owl:sameAs <http://dbpedia.org/resource/Ormo%C5%BE> .
<http://airports.dataincubator.org/regions/TR-73>
  owl:sameAs <http://dbpedia.org/resource/%C5%9E%C4%B1rnak> .
<http://airports.dataincubator.org/regions/SI-176>
  owl:sameAs <http://dbpedia.org/resource/Razkri%C5%BEje> .
<http://airports.dataincubator.org/regions/LY-SS>
  owl:sameAs <http://dbpedia.org/resource/Sabratha_Wa_Surman_District> .
<http://airports.dataincubator.org/regions/SI-111>
  owl:sameAs <http://dbpedia.org/resource/Se%C5%BEana> .
<http://airports.dataincubator.org/regions/ME-17>
  owl:sameAs <http://dbpedia.org/resource/Ro%C5%BEaje> .

Running the following query shows that a number of airports are owl:sameAs linked to web pages (mostly to non-English Wikipedia pages):

   PREFIX owl: <http://www.w3.org/2002/07/owl#>
   SELECT *
   WHERE {
       ?s owl:sameAs?o
       FILTER (!REGEX(STR(?o), "http://dbpedia.org/resource/"))
       FILTER (?o != <http://api.talis.com/stores/airports/items/null>)
   }

The link predicate should be replaced with a property that is appropriate for linking to web pages, such as rdfs:seeAlso, or perhaps better foaf:page.

The dataset contains the following statement:

   <http://dbpedia.org/property/iata>
       owl:sameAs <http://www.daml.org/2001/10/html/airport-ont#iataCode>

This statement equates two properties. owl:sameAs should not be used to connect classes and properties. The right property to use here would be owl:equivalentProperty.

I also couldn't find any information about the license, creator or maintainer of this dataset. This information could be added at <http://airports.dataincubator.org/>. Also a link to the SPARQL endpoint (via void:sparqlEndpoint, perhaps).

Best,
Richard



--
Rob Styles
http://dynamicorange.com

Richard Cyganiak

unread,
Jul 28, 2010, 3:15:33 PM7/28/10
to datain...@googlegroups.com, pedant...@googlegroups.com

Keith Alexander

unread,
Jul 29, 2010, 5:30:55 AM7/29/10
to pedant...@googlegroups.com, dataincubator
Hi,

> Link quality and character encoding issues are the least of the
> problems with this data set.  The data set is one of the subjects of a

If you can describe the other problems, perhaps something practical
can be done about them. Perhaps I should wait for the fully-written
rant, but I, at least, am not sure what you're alluding to.

> rant that I have half-written on the farce called "triplification."
>

What do you mean by "triplification"? Are you arguing that RDF
datasets created by third parties are in general a bad thing?

> This almost always results in data sets which:
>
> - have no provenance
> - are disconnected from their authors/communities and are thus stale
> and never updated
> - aren't really linked to anything anyway, using an abundance of
> string literals instead of links (because linking is hard)

Isn't it more that provenance metadata and inter-dataset linking are
good things (in fact, very good things), than that datasets without
them are bad?

>
> Additionally, it's a crutch which lets people avoid the hard work of
> convincing publishers to publish their data in RDF format(s).

Can you demonstrate how "triplification" (making RDF from someone
else's data?) hinders rather than helps RDF adoption by data
publishers, preferably with real-world examples?

My impression of your stance is that it's a bad thing to publish a
dataset with flaws in it. I would disagree with that. It's hard to get
everything "right" the first time, and you might not find out what is
"right" until the data starts getting used. Moreover, being told of
flaws in a dataset is an indication to the "triplifier" that the
triples they have created are interesting to others, and worth
spending time on.

Best,

Keith

Christopher Gutteridge

unread,
Jul 29, 2010, 6:13:44 AM7/29/10
to datain...@googlegroups.com, Keith Alexander, pedant...@googlegroups.com
I think that the times they are a changing.

3 or 4 years ago it was useful to create big static one-shot datasets to
use as test cases and examples. Now these are subject to the inevitable
bitrot. That's going to become the norm, as non-pedants knock up their
foaf:depicts of <my cat> type junk.

As a community we're going to have to start differentiating between
curated datasets, static decaying datasets and static-but-that's-OK
datasets (such as an ontology definition document, or a dataset of UK
Members of Parliament 1800-1900.)

I just love modelling things correctly, but most of the triples I create
when doing so are "because I have the data", not "because this will be
useful to someone". When creating RDF for work, we're now keeping at
least one or two usecases in mind while building it. Better still
building it and a service using it at the same time.

Just making a one-time conversion of a scrape or snapshot of someone
else's data is no longer good enough, and is the "404" error of the next
generation web. It's inevitable, but we shouldn't encourage it.

--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

/ Lead Developer, EPrints Project, http://eprints.org/
/ Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/
/ Webmaster, Web Science Trust, http://www.webscience.org/

Rob Styles

unread,
Jul 29, 2010, 7:58:33 AM7/29/10
to pedant...@googlegroups.com, Tom Morris, dataincubator
Well, I have to say I'm a little saddened by this thread — putting aside the fact that I converted the data.

A group like Pedantic Web has a great deal of value to add, we all have expert knowledge and insights and the initial mail on this described a number of bugs with the data both accurately and helpfully.

The thread since then has been unhelpful. Characterising any linked data as 'bad' or 'not good enough' at this point in time is not a good use of our skills. It is good to see data published and good to see it improve through the helpful critique of those with expertise. It is not useful, or attractive, to merely rant.

If I have mis-understood the purpose of the list, and that it is actually a place for you all to rant about how crap everyone else is at doing this then do please let me know so I can unsubscribe.

rob


2010/7/29 Christophe Guéret <cgu...@few.vu.nl>
 On 07/29/2010 01:04 AM, Tom Morris wrote:
On Wed, Jul 28, 2010 at 3:15 PM, Richard Cyganiak<ric...@cyganiak.de>  wrote:

To the maintainer of the Our Airports dataset at
<http://airports.dataincubator.org/>, with SPARQL endpoint at
<http://api.talis.com/stores/airports>: I have a number of suggestions for
improvement, mostly related to link quality.
Link quality and character encoding issues are the least of the

problems with this data set.  The data set is one of the subjects of a
rant that I have half-written on the farce called "triplification."

This almost always results in data sets which:

- have no provenance
- are disconnected from their authors/communities and are thus stale
and never updated
- aren't really linked to anything anyway, using an abundance of
string literals instead of links (because linking is hard)

Additionally, it's a crutch which lets people avoid the hard work of
convincing publishers to publish their data in RDF format(s).

The whole concept of triplification is just a bad, bad idea which
should go away.
(OK, there's a mini version of the rant)

Tom

Well, I tend to agree with that but triplification is still important as a way to convince publishers to publish their data.

Most potential publisher already have their data in some format (say, a SQL database) and export it in some other (CSV files).
So why would they bother publishing in RDF? And if they think it could be nice, why would they invest money/effort/... in doing it?
Being able to show a bit of RDF (e.g. a triplified version of a CSV) is a very useful argument as it shows the benefits of RDF on a practical case.

IMHO, triplification should not be considered as evil but as a (necessary) mean to an end.

Cheers,
Christophe


--
Dr. Christophe Guéret (cgu...@few.vu.nl)
http://www.few.vu.nl/~cgueret/
Postdoc working on SOKS (http://www.few.vu.nl/soks)
Knowledge Representation&  Reasoning Group
Computational Intelligence Group
Department of Computer Science, AI
VU University Amsterdam

Richard Cyganiak

unread,
Jul 29, 2010, 9:13:27 AM7/29/10
to pedant...@googlegroups.com, dataincubator
Tom,

On 29 Jul 2010, at 00:04, Tom Morris wrote:
> The data set is one of the subjects of a
> rant that I have half-written on the farce called "triplification."
>
> This almost always results in data sets which:
>
> - have no provenance
> - are disconnected from their authors/communities and are thus stale
> and never updated
> - aren't really linked to anything anyway, using an abundance of
> string literals instead of links (because linking is hard)
>
> Additionally, it's a crutch which lets people avoid the hard work of
> convincing publishers to publish their data in RDF format(s).
>
> The whole concept of triplification is just a bad, bad idea which
> should go away.

Wow. You must have been burned pretty badly by some low-quality data!

To me this sound dangerously close to a complaint that the unwashed
masses, who don't have the proper experience, the proper social
connections, and the proper engineering resources, should stop
dabbling in putting stuff on the Web.

I think expecting every dataset to be a masterpiece a la GeoSpecies or
id.loc.gov is unrealistic. Most datasets are apprentice work. This is
how people learn. Ok, they make mistakes in public. Can't we meet this
with some tolerance and patience?

A better question might be, how to direct attention and kudos towards
the publishers that actually get everything right.

Best,
Richard

Ed Summers

unread,
Jul 29, 2010, 9:48:31 AM7/29/10
to pedant...@googlegroups.com, dataincubator
On Wed, Jul 28, 2010 at 7:04 PM, Tom Morris <tfmo...@gmail.com> wrote:
> Additionally, it's a crutch which lets people avoid the hard work of
> convincing publishers to publish their data in RDF format(s).

I've always thought of triplification as an activity that is
ultimately geared towards helping convince publishers to publish RDF
themselves. I agree with Tom that getting "data owners" to assume the
responsibility of publishing RDF is the hard part. But letting
publishers see some of the benefits of RDF publishing before making an
investment themselves is an important part of the convincing process.

I think triplification can also be performed by data owners
themselves, who aren't in a position to gut their existing enterprise,
and convert everything over to using triplestores, sparql, etc. This
is still the case with id.loc.gov where the result of somewhat ancient
data workflows (MARC) is triplified on a routine basis [1].

Also, on the subject of sparsely linked data, I think there is a lot
of value in simply minting URIs for things, and serving up relatively
bare bones RDF. The value is that it allows other people to link to
them in their data.

//Ed

[1] http://id.loc.gov/authorities/loads/

Reply all
Reply to author
Forward
0 new messages