Hello,
Concerning this topic, I also consider discoverability of related data among an RDF dataset, its SPARQL endpoint, and its statistics.
There's a helpful blog post written by Mr. Leigh Dodds.
http://blog.ldodds.com/2013/02/04/dataset-and-api-discovery-in-linked-data/
For example, a recommended path of a void file is
http://www.example.com/.well-known/void , but I don't know how many data providers follow it.
Are there any proposals for data discoverability?
Regards,
Yasunori
On H.25/03/13, at 10:28, Toshiaki Katayama wrote:
> Hi Jerven,
>
>> From the practical goals that Toshiaki put in the agenda I think we should focus only on:
>> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
>> ** Which statistics are needed for which tools.
>
> Agreed.
>
> For this purpose, I think it would be natural to enhance the YummyData project which has started during the BH12 meeting.
>
>
http://yummydata.org/
>
https://github.com/dbcls/bh12/wiki/Yummy-data
>
>> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
>> Where point 2 should discuss only endpoint statistics, basic meta data for now.
>
> When we reached to a consensus on those common (subset of) vocabularies, probably based on the bio2rdf codes and Scott's summary,
>
>
https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics
>
https://docs.google.com/spreadsheet/ccc?key=0AvCayBYdTclldEpBSS1wRXNEaU9OeHdWcGRwc09mSmc#gid=0
>
> we can put them on
yummydata.org so that YummyData can be used to provide basic statistics with broader coverage and Bio2RDF can continue to provide advanced metrics on their collection (some of which YummyData might incorporate for all supported datasets in the future). Another solution would be to push each dataset provider to provide those metrics by themselves, but it might require some time.
>
> In any cases,
>
>> I am most interested in the how to serve statistics about classes properties and links per SPARQL endpoint. As well as about licensing and update frequencies etc...
>
>
>> If we can change these to SPARQL 1.1 compliant construct/insert queries then other endpoint providers can do the same as bio2rdf, with minimal effort.
>
> these two points are important to be kept in mind.
>
>
>> And leave the following out for the moment
>>
>> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
>
> OK. Let's treat it as a long-term goal.
>
> I personally believe
identifiers.org is an ideal solution for this purpose because it is open and generic, already providing clean URIs, and can be redirected to multiple data sources. Remaining problem would be, we still need to maintain a database of links among instance URIs.
>
> Cheers,
> Toshiaki
>
> On 2013/03/12, at 19:11, Jerven Bolleman wrote:
>
>> Hi All, Mark, Michel,
>>
>> I am most interested in the how to serve statistics about classes properties and links per SPARQL endpoint. As well as about licensing and update frequencies etc...
>>
>> Bio2rdf has quite a lot of work already done. For example the sparql code inside
https://github.com/bio2rdf/bio2rdf-scripts/blob/master/statistics/bio2rdf_stats_virtuoso.php.
>>
>> If we can change these to SPARQL 1.1 compliant construct/insert queries then other endpoint providers can do the same as bio2rdf, with minimal effort.
>>
>> We should also look at the overlap between VoID and the bio2rdf dataset description.
>>
>> From the practical goals that Toshiaki put in the agenda I think we should focus only on:
>> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
>> ** Which statistics are needed for which tools.
>> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
>> Where point 2 should discuss only endpoint statistics, basic meta data for now.
>>
>> And leave the following out for the moment
>>
>> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
>>
>> I hope this gives a targeted agenda and allows a basic specification to be finished soon, for further iterations expansions later as needed.
>>
>> For interest I am interested in providing the following meta data on (beta.)
sparql.uniprot.org. And aim to have both accessible via sparql-service description as well as in the endpoint.
>>
>> licensing
>> ontology (re)-use
>> statistics on class use
>> objects
>> properties
>> subjects
>> up date frequency
>> last updated date
>> contributors (institutions)
>> which sparql version is supported
>> query timeouts and other constraints
>>
>>
>> Regards,
>> Jerven
>>
>>
>> On 03/12/2013 10:44 AM, Toshiaki Katayama wrote:
>>> Hi Michel and all,
>>>
>>>>> How about a week from today on a Monday at 3PM CET?
>>>
>>> OK
>>>
>>> I'm still not sure how to join the teleconference.
>>> Will you use Skype?
>>>
>>>> I don't mind that time for one-off calls but I won't be able to make
>>>> it to them regularly.
>>>
>>> +1
>>>
>>>>> (Straw Man) Agenda:
>>>>> * Past: Review of how we exited Biohackathon11 - Scott
>>>>> * Current: Discussion of progress in the meantime (during a few
>>>>> teleconference calls and mail threads, Bio2RDF, other?) - Scott,
>>>>> Michel
>>>>> * Future: Identify work to be done - All
>>>
>>> Sorry for my ignorance but to my understanding from your documents,
>>> this project aims to formalize DB metadata (as a consensus of existing
>>> efforts), right?
>>>
>>> (So that, users/agents can easily grasp information about the name,
>>> category, amount, license, contacts etc. of the dataset;
>>> and hopefully obtain those information via some SPARQL queries?)
>>>
>>> And, practical goals are to define:
>>>
>>> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
>>> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
>>> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
>>>
>>> As I need to understand the situation before attending the teleconference,
>>> please correct me if I'm misunderstanding.
>>>
>>> Cheers,
>>> Toshiaki
>>>
>>> On 2013/03/12, at 4:46, Peter Ansell wrote:
>>>
>>>> On 12 March 2013 03:26, M. Scott Marshall <
mscottm...@gmail.com> wrote:
>>>>> Hi Toshiaki, All,
>>>>>
>>>>> I decided not to try to have the meeting today in the end because it
>>>>> would be too short of a notice for some people.
>>>>>
>>>>> How about a week from today on a Monday at 3PM CET?
>>>>>
>>>>> Because the U.S. has shifted to Daylight Savings Time already and
>>>>> Europe will only do the same on March 31, it will be somewhat less
>>>>> painful for those in the PST timezone until then (although admittedly
>>>>> still very bright and early at 7AM PST). Any others from the Japanese
>>>>> timezone (Chisato?) or Australia (Peter?) who could make it? And, so
>>>>> we aren't caught by surprise, when is Daylight Savings Time for Japan
>>>>> and Australia?
>>>>
>>>> I am in Queensland, Australia, one of the states that does not have
>>>> Daylight Savings. 3PM CET seems to be Midnight in Brisbane and 11pm in
>>>> Tokyo according to
timeanddate.com:
>>>>
>>>>
http://www.timeanddate.com/worldclock/meetingtime.html?iso=20130311&p1=47&p2=248&p3=195
>>>>
>>>> I don't mind that time for one-off calls but I won't be able to make
>>>> it to them regularly. There are other times of my working day that
>>>> match Europe and USA individually, but not together (making it
>>>> virtually impossible to do anything with W3C!)
>>>>
>>>>> A tentative agenda would start with a roundup of the issues that we
>>>>> put on the table back in 2011, and identify issues that have been
>>>>> (partially) solved by various partners, and remaining issues that we
>>>>> think are important to consider in the near term.
>>>>>
>>>>> (Straw Man) Agenda:
>>>>> * Past: Review of how we exited Biohackathon11 - Scott
>>>>> * Current: Discussion of progress in the meantime (during a few
>>>>> teleconference calls and mail threads, Bio2RDF, other?) - Scott,
>>>>> Michel
>>>>> * Future: Identify work to be done - All
>>>>>
>>>>> Here are a few documents from our earlier efforts:
>>>>>
https://docs.google.com/document/d/1qVSZ1n334fTTchCWS2tOaEL6z-H12pcqty98jQIg6nw/edit?hl=en_US#heading=h.x5e0yirsfi9v
>>>>>
https://docs.google.com/spreadsheet/ccc?key=0AvCayBYdTclldEpBSS1wRXNEaU9OeHdWcGRwc09mSmc&usp=sharing
>>>>>
https://docs.google.com/document/d/1BZyylrV-NXpCpF23CUj-HSvq4YogXjtCgqN7UtFK5EY/edit
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Scott
>>>>>
>>>>> P.S. Toshiaki - No problem that you didn't know that we had talked
>>>>> about including information about predicates used in an RDF dataset in
>>>>> a Biohackathon11 'working group'. I wouldn't expect you to track and
>>>>> remember every detail. I was actually happy to see you bring up a need
>>>>> for one of the things we had discussed - confirming our suspicions
>>>>> that it was a good idea.
>>>>>
>>>>> On Mon, Mar 11, 2013 at 8:27 AM, Toshiaki Katayama <
kt...@dbcls.jp> wrote:
>>>>>> Hi Scott,
>>>>>>
>>>>>> Sorry for my late reply.
>>>>>>
>>>>>>> Several of the ideas that you have mentioned in this thread have been
>>>>>>> a topic of discussion in the Biohackathon 2011 group
>>>>>>
>>>>>> Oops, two years of behind. Sorry. ;)
>>>>>> I hope to follow the discussions.
>>>>>>
>>>>>>> If we have a Linked Life Data
>>>>>>> teleconference at 3PM CET on a Monday, could you attend?
>>>>>>
>>>>>> Do you mean it will be today?
>>>>>>
>>>>>>> I would like to form a plan to deliver a
>>>>>>> version of this work, as well as the biohackathon 2011 publication.
>>>>>>
>>>>>> This sounds great.
>>>>>> Do you have any agenda?
>>>>>> What media do you use for the teleconference?
>>>>>> Anyway, I'll try to be online tonight.
>>>>>>
>>>>>> Regards,
>>>>>> Toshiaki
>>>>>>
>>>>>> On 2013/03/07, at 19:20, M. Scott Marshall wrote:
>>>>>>
>>>>>>> Hi Toshiaki,
>>>>>>>
>>>>>>> Several of the ideas that you have mentioned in this thread have been
>>>>>>> a topic of discussion in the Biohackathon 2011 group that we
>>>>>>> eventually called DBCatalog: RDF metadata vocabulary to describe the
>>>>>>> data that is available in an RDF-rendered dataset. We haven't met
>>>>>>> since last year but it seems like the time is right to pick this up
>>>>>>> again. It sounds like you have very practical applications for it as
>>>>>>> well.
>>>>>>>
>>>>>>>> So another possibility might be to extend the VoID specification to indicate a list of predicates and classes in addition to the number of them (void:predicates and void:classes).
>>>>>>>> In this case, each endpoint provider can generate the list by running a SPARQL query only once when importing LOD.
>>>>>>>
>>>>>>> I agree that such a tool would be good and an important way to ease
>>>>>>> adoption of the metadata markup that we will refine in the dbcatalog
>>>>>>> group. In BH11, we discussed how such a tool could gather many
>>>>>>> important graph statistics, including those about predicates and
>>>>>>> classes, and make those available through SPARQL. Last year, Janos
>>>>>>> Hajagos presented a tool in the Linked Life Data task force (LLD) that
>>>>>>> gathered graph statistics:
https://code.google.com/p/py-triple-simple/
>>>>>>> .
>>>>>>>
>>>>>>>> If we can make an agreement with the major biological LOD providers on providing (a standardized version of) this dataset_vocabulary (DaVo?) in addition to VoID, it would be great as the biological datasets are often very huge.
>>>>>>>
>>>>>>> That is one of the potential outcomes that we were aiming for in the
>>>>>>> dbcatalog. Could we discuss this in the Linked Life Data task force?
>>>>>>> Michel and I have been interested in pursuing these ideas and refining
>>>>>>> them within the context of Linked Life Data. I think that it should
>>>>>>> form the main line of work for that group for at least the next
>>>>>>> several months (1/2 year). Chisato suggested that a time that is
>>>>>>> possible for most timezones is 3PM CET. If we have a Linked Life Data
>>>>>>> teleconference at 3PM CET on a Monday, could you attend? Or would
>>>>>>> another day be better? I would like to form a plan to deliver a
>>>>>>> version of this work, as well as the biohackathon 2011 publication.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Scott
>>>>>>>
>>>>>>> P.S. As you know, I am working at a radiotherapy oncology clinic these
>>>>>>> days. We are just now turning our attention to federation issues
>>>>>>> (example: query federation across image database (called a PACS) and
>>>>>>> electronic health record (EHR).
>>>>>>>
>>>>>>> On Fri, Feb 15, 2013 at 7:24 PM, Toshiaki Katayama <
kt...@dbcls.jp> wrote:
>>>>>>>> Hi Michel,
>>>>>>>>
>>>>>>>> Thank you for your inputs! These Bio2RDF metrics seem to be very useful.
>>>>>>>> If we can make an agreement with the major biological LOD providers on providing (a standardized version of) this dataset_vocabulary (DaVo?) in addition to VoID, it would be great as the biological datasets are often very huge.
>>>>>>>>
>>>>>>>>
>>>>>>>> By the way, for those who might be interested:
>>>>>>>>
>>>>>>>>>> The UniProt license legally thinks what you are doing is absolutely fine. I think its great in any case.
>>>>>>>>
>>>>>>>> According to this Jerven's statement, I temporally put UniProt files split into taxon IDs at
>>>>>>>>
>>>>>>>>
http://lod.dbcls.jp/rdf/uniprot_taxon.ttl/
>>>>>>>>
>>>>>>>> I hope this would be useful to setup species specific application for you.
>>>>>>>>
>>>>>>>> Please use with RDF and OWL files (other than huge uni*.rdf.gz files) provided at
>>>>>>>>
>>>>>>>>
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/
>>>>>>>>
>>>>>>>> Note that, you can generate a list of descendant taxon IDs from a taxonomic node of your choice (tax:42241 in this example) by simply querying with
>>>>>>>>
>>>>>>>> PREFIX tax: <
http://purl.uniprot.org/taxonomy/>
>>>>>>>> SELECT ?taxon
>>>>>>>> WHERE {
>>>>>>>> ?taxon rdfs:subClassOf* tax:42241 .
>>>>>>>> }
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Toshiaki
>>>>>>>>
>>>>>>>> On 2013/02/16, at 1:50, Michel Dumontier wrote:
>>>>>>>>
>>>>>>>>> Toshiaki,
>>>>>>>>> Bio2RDF now pre-computes graph properties - you can see our documentation here:
>>>>>>>>>
>>>>>>>>>
https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics
>>>>>>>>>
>>>>>>>>> and example pages:
>>>>>>>>>
>>>>>>>>>
http://download.bio2rdf.org/release/2/biomodels/biomodels.html
>>>>>>>>>
http://download.bio2rdf.org/release/2/gene/gene.html
>>>>>>>>>
>>>>>>>>> while we used our own vocabulary (for ease), we'd be happy to investigate something more standard. VoID is one option, and the SPARQLed
>>>>>>>>>
>>>>>>>>>
http://sindicetech.com/sindice-suite/sparqled/
>>>>>>>>>
>>>>>>>>> too uses the Dataset Analytics Vocabulary ontology
>>>>>>>>>
>>>>>>>>>
http://vocab.sindice.net/analytics#
>>>>>>>>>
>>>>>>>>> but I don't find it particularly intuitive.
>>>>>>>>>
>>>>>>>>> m.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From:
biohac...@googlegroups.com [mailto:
biohac...@googlegroups.com] On Behalf Of Toshiaki Katayama
>>>>>>>>> Sent: Friday, February 15, 2013 10:17 AM
>>>>>>>>> To:
biohac...@googlegroups.com
>>>>>>>>> Subject: Re: [biohackathon:768] UniProt sparql endpoint at dbcls??
>>>>>>>>>
>>>>>>>>> Hi Jerven,
>>>>>>>>>
>>>>>>>>> Thank you for your clarification.
>>>>>>>>>
>>>>>>>>> As for the flint, we are also experiencing the same problem for many endpoints.
>>>>>>>>> In the flint interface, the functionality to generate a list of predicates and classes should be useful to explore the stored data, but it is often timed-out actually.
>>>>>>>>>
>>>>>>>>> The situation might be resolved if each endpoint provides meta information as described in a VoID specification.
>>>>>>>>> Still need to hack the flint code, we'll be able to make a client application to use the void:vocabulary for fetching properties and classes from relevant ontologies.
>>>>>>>>> This approach doesn't require a heavy SPARQL load.
>>>>>>>>>
>>>>>>>>> However, in wild LOD, many predicates and classes are used without defined in an ontology.
>>>>>>>>> So another possibility might be to extend the VoID specification to indicate a list of predicates and classes in addition to the number of them (void:predicates and void:classes).
>>>>>>>>> In this case, each endpoint provider can generate the list by running a SPARQL query only once when importing LOD.
>>>>>>>>> Just a rambling thought..
>>>>>>>>>
>>>>>>>>> Anyway, I can temporally remove the UniProt endpoint from our flint configuration, if you prefer.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Toshiaki
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2013/02/15, at 23:13, Jerven Bolleman wrote:
>>>>>>>>>
>>>>>>>>>> Hi Toshiaki,
>>>>>>>>>>
>>>>>>>>>> The UniProt license legally thinks what you are doing is absolutely fine. I think its great in any case.
>>>>>>>>>>
>>>>>>>>>> I was chasing down a killer query (i.e. one that takes up to much memory).
>>>>>>>>>> Unfortunately its the flint interface that generates these :( Sesame,
>>>>>>>>>> always first does the order by and then the distinct (as per sparql
>>>>>>>>>> algebra) but for UniProt that generates a list of about 6 billion elements sorted which means the server runs out memory and crashes.
>>>>>>>>>> Will ask the sesame/owlim-list if anything is possible there.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Jerven
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 15, 2013, at 2:50 PM, Toshiaki Katayama wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Jerven,
>>>>>>>>>>>
>>>>>>>>>>> I worried about if you meant that we need to make a license agreement to provide a subset of the UniProt database, so thank you for your kind admission.
>>>>>>>>>>>
>>>>>>>>>>> By the way, your skill to dig our server is amazing. :) Yes, we also
>>>>>>>>>>> put flint in front of some public and our internal SPARQL endpoints for testing.
>>>>>>>>>>> As I included the official UniProt endpoint, you may have noticed our access from the log.
>>>>>>>>>>> This flint service is also not intended to be widely used but it might be useful in some cases (e.g. during the hackathon).
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Toshiaki
>>>>>>>>>>>
>>>>>>>>>>> On 2013/02/15, at 22:38, Jerven Bolleman wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Toshiaki,
>>>>>>>>>>>>
>>>>>>>>>>>> Ah great, that's a perfect use case of using our RDF data.
>>>>>>>>>>>> Please continue using it. I just noticed it from
lod.dbcls.jp/flint/
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Jerven
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 15, 2013, at 2:35 PM, Toshiaki Katayama wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Jerven,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is mine.
>>>>>>>>>>>>> Basically a cyanobacterial subset of the UniProt to develop our own application.
>>>>>>>>>>>>> The official UniProt endpoint is great but we needed to perform
>>>>>>>>>>>>> many queries for try & error, and we wanted to have better performance by using a small subset.
>>>>>>>>>>>>> For now, the endpoint does not have enough capacity to be widely
>>>>>>>>>>>>> used. So please be gentle. :) The server is running on OWLIM Lite 5.3.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Toshiaki
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2013/02/15, at 22:15, Jerven Bolleman wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just realized that this sparql endpoint exists at DBCLS.
>>>>>>>>>>>>>> Does anyone know who is maintaining this endpoint? Wondering what tech they are using. And seeing if we could formalize a mirroring agreement.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
http://lod.dbcls.jp/openrdf-sesame5l/repositories/cyano
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Jerven
>>>>>>>>>>>>>> --
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google Groups "dbcatalog" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to
dbcatalog+...@googlegroups.com.
>> --