Re: [Ann] LODStats - Real-time Data Web Statistics

62 views
Skip to first unread message

Richard Cyganiak

unread,
Feb 2, 2012, 6:32:03 AM2/2/12
to Sören Auer, Linking Open Data, pedant...@googlegroups.com
Congrats, this is awesome.

So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.

Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
http://stats.lod2.eu/rdfdoc/?errors=1
This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).

One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.

I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?

Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?

Again, congrats to all involved, this is great work.

Best,
Richard


On 2 Feb 2012, at 11:04, Sören Auer wrote:

> Dear all,
>
> We are happy to announce the first public *release of LODStats*.
>
> LODStats is a statement-stream-based approach for gathering
> comprehensive statistics about datasets adhering to the Resource
> Description Framework (RDF). LODStats was implemented in Python and
> integrated into the CKAN dataset metadata registry [1]. Thus it helps to
> obtain a comprehensive picture of the current state of the Data Web.
>
> More information about LODStats (including its open-source
> implementation) is available from:
>
> http://aksw.org/projects/LODStats
>
> A demo installation collecting statistics from all LOD datasets
> registered on CKAN is available from:
>
> http://stats.lod2.eu
>
> We would like to thank the AKSW research group [2] and LOD2 project [3]
> members for their suggestions. The development LODStats was supported by
> the FP7 project LOD2 (GA no. 257943).
>
> On behalf of the LODStats team,
>
> Sören Auer, Jan Demter, Michael Martin, Jens Lehmann
>
> [1] http://ckan.net
> [2] http://aksw.org
> [3] http://lod2.eu
>

Richard Cyganiak

unread,
Feb 3, 2012, 4:58:04 AM2/3/12
to Bernard Vatant, Sören Auer, Linking Open Data, pedant...@googlegroups.com
On 2 Feb 2012, at 23:58, Bernard Vatant wrote:
> More than 60 [vocabularies] are either 404, time out or access denied, which does not come as a surprise, but is nevertheless a big issue. It means that data using those vocabularies are relying on semantics no one can check.
>
> The rest is de-referencable, but to various types of resources more or less close to one or several vocabularies, but not published following good practices, in a word not in a LOV-able state.
>
> All in all, almost half of the vocabularies used in LOD are not meeting a minimal quality requirement : be published at their namespace.

Now, if there was a list of these, annotated with some stats (used in how many datasets? occurring in how many triples?), then we could start at the top of the list, and sort it out with the various publishers involved.

Best,
Richard

Bernard Vatant

unread,
Feb 3, 2012, 5:44:17 AM2/3/12
to Richard Cyganiak, Sören Auer, Linking Open Data, pedant...@googlegroups.com
Hello Richard

> All in all, almost half of the vocabularies used in LOD are not meeting a minimal quality requirement : be published at their namespace.

Now, if there was a list of these, annotated with some stats (used in how many datasets? occurring in how many triples?), then we could start at the top of the list, and sort it out with the various publishers involved.

Indeed! That's the purpose of what I started in the Gdocs ... I just sent you edition rights :)

That is a work we have already started with Pierre-Yves inside the LOV ecosystem : ping the vocabularies curators when they rely on non-such-reliable namespaces (either their own ones, or the ones of vocabularise they re-use but don't maintain). The objective being to augment the overall quality of the vocabulary ecosystem, one vocabulary at a time :)

It is a patient but important task. You're welcome to participate. It is actually 80% social and 20% technical :)

Best

Bernard

--
Bernard Vatant
Vocabularies & Data Engineering
Tel :  + 33 (0)9 71 48 84 59
Skype : bernard.vatant
Linked Open Vocabularies

--------------------------------------------------------
Mondeca                             
3 cité Nollez 75018 Paris, France
Follow us on Twitter : @mondecanews

Sören Auer

unread,
Feb 2, 2012, 7:18:26 AM2/2/12
to publi...@w3.org, pedant...@googlegroups.com
Am 02.02.2012 12:32, schrieb Richard Cyganiak:
> Congrats, this is awesome.

Thanks Richard, we are happy you like it ;-)

> So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.

Exactly.

> Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
> http://stats.lod2.eu/rdfdoc/?errors=1
> This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).

Yes, having an interoperability testbed and a timely view on the current
state was one of the primary reasons for developing LODStats. Some
problems might, however, also be related to incorrect CKAN metadata or
some glitches in LODStats itself - we will try to iron them out as much
as possible in the next weeks.

> One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.

Indeed, thats a great suggestion and will be implemented soon.

> I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?

Not yet, but that's planned. For now it should be relatively easy to
crawl and concat the VoID files, but we will make it more convenient ;-)

> Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?

We submitted a paper, which you can cite:

Jan Demter, S�ren Auer, Michael Martin, Jens Lehmann: LODStats � An
Extensible Framework for High-performance Dataset Analytics, submitted
to ESWC2012

http://svn.aksw.org/papers/2011/RDFStats/public.pdf

Best,

S�ren

Rinke Hoekstra

unread,
Feb 21, 2012, 9:38:16 AM2/21/12
to Sören Auer, publi...@w3.org, pedant...@googlegroups.com
Hi Sören, others,

LODStats is certainly great work. Congratulations!

However... is it me, or isn't the 'almost 2B triples' a very
disappointing number? If you go through all datasets advertised on the
Data Hub, the advertised number of triples is over 40B ! This means
that only one out of 20 triples in the linked 'open' data cloud is
publicly accessible.

Another thing... it seems as if LODStats is merely checking whether a
SPARQL endpoint is 'up' and whether the endpoint actually contains the
data that has been advertised on the Data Hub. For instance, my very
own bubble is listed without problems, but I know for a fact that the
triple store no longer contains the data (sorry!). Do you have any
thoughts/ideas on how to detect such problems?

Cheers,
Rinke

On 2 February 2012 13:18, Sören Auer <au...@informatik.uni-leipzig.de> wrote:
> Am 02.02.2012 12:32, schrieb Richard Cyganiak:
>> Congrats, this is awesome.
>
> Thanks Richard, we are happy you like it ;-)
>

>> So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.
>

> Exactly.


>
>> Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
>> http://stats.lod2.eu/rdfdoc/?errors=1
>> This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).
>

> Yes, having an interoperability testbed and a timely view on the current
> state was one of the primary reasons for developing LODStats. Some
> problems might, however, also be related to incorrect CKAN metadata or
> some glitches in LODStats itself - we will try to iron them out as much
> as possible in the next weeks.
>

>> One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.
>

> Indeed, thats a great suggestion and will be implemented soon.
>

>> I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?
>

> Not yet, but that's planned. For now it should be relatively easy to
> crawl and concat the VoID files, but we will make it more convenient ;-)
>

>> Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?
>

> We submitted a paper, which you can cite:
>

> Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An


> Extensible Framework for High-performance Dataset Analytics, submitted
> to ESWC2012
>
> http://svn.aksw.org/papers/2011/RDFStats/public.pdf
>
> Best,
>

> Sören
>

Sören Auer

unread,
Feb 21, 2012, 9:51:35 AM2/21/12
to Rinke Hoekstra, publi...@w3.org, pedant...@googlegroups.com
Am 21.02.2012 15:38, schrieb Rinke Hoekstra:
> However... is it me, or isn't the 'almost 2B triples' a very
> disappointing number? If you go through all datasets advertised on the
> Data Hub, the advertised number of triples is over 40B ! This means
> that only one out of 20 triples in the linked 'open' data cloud is
> publicly accessible.

It certainly is and this is one of the reasons we developed this tool to
get a better picture of the LOD cloud. Of cause this difference is
partially caused by invalid links in CKAN and some issues we still have
with dealing with very large datasets, but these issues real users might
have as well.

> Another thing... it seems as if LODStats is merely checking whether a
> SPARQL endpoint is 'up' and whether the endpoint actually contains the
> data that has been advertised on the Data Hub. For instance, my very
> own bubble is listed without problems, but I know for a fact that the
> triple store no longer contains the data (sorry!). Do you have any
> thoughts/ideas on how to detect such problems?

We currently don't delete our stats when an endpoint is not available
once, but try to check back later. Of course after a certain number of
check backs and timeouts the stats should be invalidated. Can you point
me to your endpoint and we will have a look what's the problem there.

Best,

Sören

miguel...@gmail.com

unread,
Jun 21, 2012, 4:26:39 AM6/21/12
to pedant...@googlegroups.com, Sören Auer, Linking Open Data
Hi everybody,
I am starting to use LODStats and I think it is a very useful tool. Actually I would be interested on using it over SPARQL endpoints but I dont know how to do that. Does anybody knows whether it is possible?

Thanks in advance

Miguel Tinte

unread,
Jun 21, 2012, 6:51:13 AM6/21/12
to Sören Auer, pedant...@googlegroups.com, Linking Open Data
Hi Sören,
Thanks for your answer. I think my question was not very clear because I am not looking for an SPARQL endpoint for lodstats: what I need is to run lodstats over datasets SPARQL endpoints. It seems that it is possible like this:
(lodstats-env)root@ubuntu:/home/LODStats# lodstats -f sparql http://dbpedia.org/sparql
Basic stats: 235153034 triples, 0 warnings
Results (from custom code):
        propertiesall
                len(distinct): 0
                len(distinct_object): 0
                len(distinct_subject): 0
                len(histogram): 0
        classes
                len(distinct): 0
                len(histogram): 0
        vocabularies
        entities
                count: 0

At this point, my question is: Can I obtain also information for classes, properties, etc? With -a parameter it is not working for me :-(

Thanks in advance


2012/6/21 Sören Auer <au...@informatik.uni-leipzig.de>
> I am starting to use LODStats and I think it is a very useful tool.
> Actually I would be interested on using it over SPARQL endpoints but I
> dont know how to do that. Does anybody knows whether it is possible?

We don't have a SPARQL endpoint available (yet), but
you can obtain a complete dump of all VoID descriptions from

http://stats.lod2.eu/rdfdocs/void

Best,

Sören

Sören Auer

unread,
Jun 21, 2012, 6:36:54 AM6/21/12
to miguel...@gmail.com, pedant...@googlegroups.com, Linking Open Data
> I am starting to use LODStats and I think it is a very useful tool.
> Actually I would be interested on using it over SPARQL endpoints but I
> dont know how to do that. Does anybody knows whether it is possible?

matinte

unread,
Jun 21, 2012, 8:11:22 AM6/21/12
to Pedantic Web Group
Hi Sören,
I am afraid I did "top-posting" in my previous message and I am not
sure you read it, sorry for that. My problem now is that I just obtain
the number of triples when running this:
(lodstats-env)root@ubuntu:/home/LODStats# lodstats -af sparql
http://dbpedia.org/sparql

How can I obtain the rest of statistics?

Thanks

Kingsley Idehen

unread,
Jun 21, 2012, 9:32:22 AM6/21/12
to pedant...@googlegroups.com, publi...@w3.org
Soren,

We've just loaded all the VoiD graphs our our LOD cloud cache. Thus, you
can SPARQL at: http://lod.openlinksw.com/sparql, and use named graph
IRI: http://stats.lod2.eu/rdfdocs/void .


--

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen





matinte

unread,
Jun 22, 2012, 3:12:49 AM6/22/12
to Pedantic Web Group

On 21 jun, 15:32, Kingsley Idehen <kide...@openlinksw.com> wrote:
> On 6/21/12 6:36 AM, Sören Auer wrote:>> I am starting to use LODStats and I think it is a very useful tool.
> >> Actually I would be interested on using it over SPARQL endpoints but I
> >> dont know how to do that. Does anybody knows whether it is possible?
> > We don't have a SPARQL endpoint available (yet), but
> > you can obtain a complete dump of all VoID descriptions from
>
> >http://stats.lod2.eu/rdfdocs/void
>
> > Best,
>
> > Sören
>
> Soren,
>
> We've just loaded all the VoiD graphs our our LOD cloud cache. Thus, you
> can SPARQL at:http://lod.openlinksw.com/sparql, and use named graph
> IRI:http://stats.lod2.eu/rdfdocs/void.

Yes, thanks for that. The problem is that I need to create a new VoiD
file from a SPARQL endpoint that has not been published yet. This is
why I am trying to use lodstats tool to generate it. So I am trying
this: lodstats -vaf sparql http://dbpedia.org/sparql , but it is only
returning the number of triples. Any idea about how to get full
statistics results?

Thanks again :-)

>
> --
>
> Regards,
>
> Kingsley Idehen
> Founder & CEO
> OpenLink Software
> Company Web:http://www.openlinksw.com
> Personal Weblog:http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca handle: @kidehen
> Google+ Profile:https://plus.google.com/112399767740508618350/about
> LinkedIn Profile:http://www.linkedin.com/in/kidehen
>
>  smime.p7s
> 2 KVerDescargar
Reply all
Reply to author
Forward
0 new messages