So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.
Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
http://stats.lod2.eu/rdfdoc/?errors=1
This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).
One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.
I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?
Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?
Again, congrats to all involved, this is great work.
Best,
Richard
On 2 Feb 2012, at 11:04, Sören Auer wrote:
> Dear all,
>
> We are happy to announce the first public *release of LODStats*.
>
> LODStats is a statement-stream-based approach for gathering
> comprehensive statistics about datasets adhering to the Resource
> Description Framework (RDF). LODStats was implemented in Python and
> integrated into the CKAN dataset metadata registry [1]. Thus it helps to
> obtain a comprehensive picture of the current state of the Data Web.
>
> More information about LODStats (including its open-source
> implementation) is available from:
>
> http://aksw.org/projects/LODStats
>
> A demo installation collecting statistics from all LOD datasets
> registered on CKAN is available from:
>
> http://stats.lod2.eu
>
> We would like to thank the AKSW research group [2] and LOD2 project [3]
> members for their suggestions. The development LODStats was supported by
> the FP7 project LOD2 (GA no. 257943).
>
> On behalf of the LODStats team,
>
> Sören Auer, Jan Demter, Michael Martin, Jens Lehmann
>
> [1] http://ckan.net
> [2] http://aksw.org
> [3] http://lod2.eu
>
Now, if there was a list of these, annotated with some stats (used in how many datasets? occurring in how many triples?), then we could start at the top of the list, and sort it out with the various publishers involved.
Best,
Richard
> All in all, almost half of the vocabularies used in LOD are not meeting a minimal quality requirement : be published at their namespace.Now, if there was a list of these, annotated with some stats (used in how many datasets? occurring in how many triples?), then we could start at the top of the list, and sort it out with the various publishers involved.
Thanks Richard, we are happy you like it ;-)
> So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.
Exactly.
> Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
> http://stats.lod2.eu/rdfdoc/?errors=1
> This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).
Yes, having an interoperability testbed and a timely view on the current
state was one of the primary reasons for developing LODStats. Some
problems might, however, also be related to incorrect CKAN metadata or
some glitches in LODStats itself - we will try to iron them out as much
as possible in the next weeks.
> One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.
Indeed, thats a great suggestion and will be implemented soon.
> I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?
Not yet, but that's planned. For now it should be relatively easy to
crawl and concat the VoID files, but we will make it more convenient ;-)
> Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?
We submitted a paper, which you can cite:
Jan Demter, S�ren Auer, Michael Martin, Jens Lehmann: LODStats � An
Extensible Framework for High-performance Dataset Analytics, submitted
to ESWC2012
http://svn.aksw.org/papers/2011/RDFStats/public.pdf
Best,
S�ren
LODStats is certainly great work. Congratulations!
However... is it me, or isn't the 'almost 2B triples' a very
disappointing number? If you go through all datasets advertised on the
Data Hub, the advertised number of triples is over 40B ! This means
that only one out of 20 triples in the linked 'open' data cloud is
publicly accessible.
Another thing... it seems as if LODStats is merely checking whether a
SPARQL endpoint is 'up' and whether the endpoint actually contains the
data that has been advertised on the Data Hub. For instance, my very
own bubble is listed without problems, but I know for a fact that the
triple store no longer contains the data (sorry!). Do you have any
thoughts/ideas on how to detect such problems?
Cheers,
Rinke
On 2 February 2012 13:18, Sören Auer <au...@informatik.uni-leipzig.de> wrote:
> Am 02.02.2012 12:32, schrieb Richard Cyganiak:
>> Congrats, this is awesome.
>
> Thanks Richard, we are happy you like it ;-)
>
>> So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.
>
> Exactly.
>
>> Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
>> http://stats.lod2.eu/rdfdoc/?errors=1
>> This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).
>
> Yes, having an interoperability testbed and a timely view on the current
> state was one of the primary reasons for developing LODStats. Some
> problems might, however, also be related to incorrect CKAN metadata or
> some glitches in LODStats itself - we will try to iron them out as much
> as possible in the next weeks.
>
>> One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.
>
> Indeed, thats a great suggestion and will be implemented soon.
>
>> I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?
>
> Not yet, but that's planned. For now it should be relatively easy to
> crawl and concat the VoID files, but we will make it more convenient ;-)
>
>> Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?
>
> We submitted a paper, which you can cite:
>
> Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An
> Extensible Framework for High-performance Dataset Analytics, submitted
> to ESWC2012
>
> http://svn.aksw.org/papers/2011/RDFStats/public.pdf
>
> Best,
>
> Sören
>
It certainly is and this is one of the reasons we developed this tool to
get a better picture of the LOD cloud. Of cause this difference is
partially caused by invalid links in CKAN and some issues we still have
with dealing with very large datasets, but these issues real users might
have as well.
> Another thing... it seems as if LODStats is merely checking whether a
> SPARQL endpoint is 'up' and whether the endpoint actually contains the
> data that has been advertised on the Data Hub. For instance, my very
> own bubble is listed without problems, but I know for a fact that the
> triple store no longer contains the data (sorry!). Do you have any
> thoughts/ideas on how to detect such problems?
We currently don't delete our stats when an endpoint is not available
once, but try to check back later. Of course after a certain number of
check backs and timeouts the stats should be invalidated. Can you point
me to your endpoint and we will have a look what's the problem there.
Best,
Sören
> I am starting to use LODStats and I think it is a very useful tool.
> Actually I would be interested on using it over SPARQL endpoints but I
> dont know how to do that. Does anybody knows whether it is possible?
We don't have a SPARQL endpoint available (yet), but
you can obtain a complete dump of all VoID descriptions from
http://stats.lod2.eu/rdfdocs/void
Best,
Sören