Hi,
(cc pedantic-web)
To the maintainer of the Our Airports dataset at <http://airports.dataincubator.org/>, with SPARQL endpoint at <http://api.talis.com/stores/airports>: I have a number of suggestions for improvement, mostly related to link quality.
The following query shows that ten different regions are all declared as being owl:sameAs some "null item":
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT * WHERE {
?s owl:sameAs <http://api.talis.com/stores/airports/items/null> .
}
The problem seems to be related to exotic Unicode characters, many of the names of the subject regions include the characters “ž”, “ı”, or “Ş”. The object in these triples should be replaced with real identifiers for the regions. Hand-crafted corrected triples below:
<http://airports.dataincubator.org/regions/MD-SV>
owl:sameAs <http://dbpedia.org/resource/C5%9Etefan_Vod%C4%83_District> .
<http://airports.dataincubator.org/regions/SI-179>
owl:sameAs <http://dbpedia.org/resource/Sodra%C5%BEica> .
<http://airports.dataincubator.org/regions/TR-63>
owl:sameAs <http://dbpedia.org/resource/%C5%9Eanl%C4%B1urfa> .
<http://airports.dataincubator.org/regions/ME-15>
owl:sameAs <http://dbpedia.org/resource/Plu%C5%BEine> .
<http://airports.dataincubator.org/regions/SI-087>
owl:sameAs <http://dbpedia.org/resource/Ormo%C5%BE> .
<http://airports.dataincubator.org/regions/TR-73>
owl:sameAs <http://dbpedia.org/resource/%C5%9E%C4%B1rnak> .
<http://airports.dataincubator.org/regions/SI-176>
owl:sameAs <http://dbpedia.org/resource/Razkri%C5%BEje> .
<http://airports.dataincubator.org/regions/LY-SS>
owl:sameAs <http://dbpedia.org/resource/Sabratha_Wa_Surman_District> .
<http://airports.dataincubator.org/regions/SI-111>
owl:sameAs <http://dbpedia.org/resource/Se%C5%BEana> .
<http://airports.dataincubator.org/regions/ME-17>
owl:sameAs <http://dbpedia.org/resource/Ro%C5%BEaje> .
Running the following query shows that a number of airports are owl:sameAs linked to web pages (mostly to non-English Wikipedia pages):
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT *
WHERE {
?s owl:sameAs?o
FILTER (!REGEX(STR(?o), "http://dbpedia.org/resource/"))
FILTER (?o != <http://api.talis.com/stores/airports/items/null>)
}
The link predicate should be replaced with a property that is appropriate for linking to web pages, such as rdfs:seeAlso, or perhaps better foaf:page.
The dataset contains the following statement:
<http://dbpedia.org/property/iata>
owl:sameAs <http://www.daml.org/2001/10/html/airport-ont#iataCode>
This statement equates two properties. owl:sameAs should not be used to connect classes and properties. The right property to use here would be owl:equivalentProperty.
I also couldn't find any information about the license, creator or maintainer of this dataset. This information could be added at <http://airports.dataincubator.org/>. Also a link to the SPARQL endpoint (via void:sparqlEndpoint, perhaps).
Best,
Richard
> Link quality and character encoding issues are the least of the
> problems with this data set. The data set is one of the subjects of a
If you can describe the other problems, perhaps something practical
can be done about them. Perhaps I should wait for the fully-written
rant, but I, at least, am not sure what you're alluding to.
> rant that I have half-written on the farce called "triplification."
>
What do you mean by "triplification"? Are you arguing that RDF
datasets created by third parties are in general a bad thing?
> This almost always results in data sets which:
>
> - have no provenance
> - are disconnected from their authors/communities and are thus stale
> and never updated
> - aren't really linked to anything anyway, using an abundance of
> string literals instead of links (because linking is hard)
Isn't it more that provenance metadata and inter-dataset linking are
good things (in fact, very good things), than that datasets without
them are bad?
>
> Additionally, it's a crutch which lets people avoid the hard work of
> convincing publishers to publish their data in RDF format(s).
Can you demonstrate how "triplification" (making RDF from someone
else's data?) hinders rather than helps RDF adoption by data
publishers, preferably with real-world examples?
My impression of your stance is that it's a bad thing to publish a
dataset with flaws in it. I would disagree with that. It's hard to get
everything "right" the first time, and you might not find out what is
"right" until the data starts getting used. Moreover, being told of
flaws in a dataset is an indication to the "triplifier" that the
triples they have created are interesting to others, and worth
spending time on.
Best,
Keith
3 or 4 years ago it was useful to create big static one-shot datasets to
use as test cases and examples. Now these are subject to the inevitable
bitrot. That's going to become the norm, as non-pedants knock up their
foaf:depicts of <my cat> type junk.
As a community we're going to have to start differentiating between
curated datasets, static decaying datasets and static-but-that's-OK
datasets (such as an ontology definition document, or a dataset of UK
Members of Parliament 1800-1900.)
I just love modelling things correctly, but most of the triples I create
when doing so are "because I have the data", not "because this will be
useful to someone". When creating RDF for work, we're now keeping at
least one or two usecases in mind while building it. Better still
building it and a service using it at the same time.
Just making a one-time conversion of a scrape or snapshot of someone
else's data is no longer good enough, and is the "404" error of the next
generation web. It's inevitable, but we shouldn't encourage it.
--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
/ Lead Developer, EPrints Project, http://eprints.org/
/ Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/
/ Webmaster, Web Science Trust, http://www.webscience.org/
On Wed, Jul 28, 2010 at 3:15 PM, Richard Cyganiak<ric...@cyganiak.de> wrote:
To the maintainer of the Our Airports dataset at
<http://airports.dataincubator.org/>, with SPARQL endpoint at
<http://api.talis.com/stores/airports>: I have a number of suggestions for
improvement, mostly related to link quality.
Link quality and character encoding issues are the least of the
problems with this data set. The data set is one of the subjects of a
rant that I have half-written on the farce called "triplification."
This almost always results in data sets which:
- have no provenance
- are disconnected from their authors/communities and are thus stale
and never updated
- aren't really linked to anything anyway, using an abundance of
string literals instead of links (because linking is hard)
Additionally, it's a crutch which lets people avoid the hard work of
convincing publishers to publish their data in RDF format(s).
Well, I tend to agree with that but triplification is still important as a way to convince publishers to publish their data.The whole concept of triplification is just a bad, bad idea which
should go away.
(OK, there's a mini version of the rant)
Tom
--
Most potential publisher already have their data in some format (say, a SQL database) and export it in some other (CSV files).
So why would they bother publishing in RDF? And if they think it could be nice, why would they invest money/effort/... in doing it?
Being able to show a bit of RDF (e.g. a triplified version of a CSV) is a very useful argument as it shows the benefits of RDF on a practical case.
IMHO, triplification should not be considered as evil but as a (necessary) mean to an end.
Cheers,
Christophe
Dr. Christophe Guéret (cgu...@few.vu.nl)
http://www.few.vu.nl/~cgueret/
Postdoc working on SOKS (http://www.few.vu.nl/soks)
Knowledge Representation& Reasoning Group
Computational Intelligence Group
Department of Computer Science, AI
VU University Amsterdam
On 29 Jul 2010, at 00:04, Tom Morris wrote:
> The data set is one of the subjects of a
> rant that I have half-written on the farce called "triplification."
>
> This almost always results in data sets which:
>
> - have no provenance
> - are disconnected from their authors/communities and are thus stale
> and never updated
> - aren't really linked to anything anyway, using an abundance of
> string literals instead of links (because linking is hard)
>
> Additionally, it's a crutch which lets people avoid the hard work of
> convincing publishers to publish their data in RDF format(s).
>
> The whole concept of triplification is just a bad, bad idea which
> should go away.
Wow. You must have been burned pretty badly by some low-quality data!
To me this sound dangerously close to a complaint that the unwashed
masses, who don't have the proper experience, the proper social
connections, and the proper engineering resources, should stop
dabbling in putting stuff on the Web.
I think expecting every dataset to be a masterpiece a la GeoSpecies or
id.loc.gov is unrealistic. Most datasets are apprentice work. This is
how people learn. Ok, they make mistakes in public. Can't we meet this
with some tolerance and patience?
A better question might be, how to direct attention and kudos towards
the publishers that actually get everything right.
Best,
Richard
I've always thought of triplification as an activity that is
ultimately geared towards helping convince publishers to publish RDF
themselves. I agree with Tom that getting "data owners" to assume the
responsibility of publishing RDF is the hard part. But letting
publishers see some of the benefits of RDF publishing before making an
investment themselves is an important part of the convincing process.
I think triplification can also be performed by data owners
themselves, who aren't in a position to gut their existing enterprise,
and convert everything over to using triplestores, sparql, etc. This
is still the case with id.loc.gov where the result of somewhat ancient
data workflows (MARC) is triplified on a routine basis [1].
Also, on the subject of sparsely linked data, I think there is a lot
of value in simply minting URIs for things, and serving up relatively
bare bones RDF. The value is that it allows other people to link to
them in their data.
//Ed