UniProt sparql endpoint at dbcls??

69 views
Skip to first unread message

Jerven Bolleman

unread,
Feb 15, 2013, 8:15:50 AM2/15/13
to biohac...@googlegroups.com
Hi All,

I just realized that this sparql endpoint exists at DBCLS.
Does anyone know who is maintaining this endpoint? Wondering what tech
they are using. And seeing if we could formalize a mirroring agreement.

http://lod.dbcls.jp/openrdf-sesame5l/repositories/cyano

Regards,
Jerven
--
-------------------------------------------------------------------
Jerven Bolleman Jerven....@isb-sib.ch
SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

Toshiaki Katayama

unread,
Feb 15, 2013, 8:35:07 AM2/15/13
to biohac...@googlegroups.com
Hi Jerven,

This is mine.
Basically a cyanobacterial subset of the UniProt to develop our own application.
The official UniProt endpoint is great but we needed to perform many queries for try & error,
and we wanted to have better performance by using a small subset.
For now, the endpoint does not have enough capacity to be widely used. So please be gentle. :)
The server is running on OWLIM Lite 5.3.

Thanks,
Toshiaki
> --
> You received this message because you are subscribed to the Google Groups "BioHackathon" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biohackathon...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Jerven Bolleman

unread,
Feb 15, 2013, 8:38:39 AM2/15/13
to biohac...@googlegroups.com
Hi Toshiaki,

Ah great, that's a perfect use case of using our RDF data.
Please continue using it. I just noticed it from lod.dbcls.jp/flint/

Regards,
Jerven

Toshiaki Katayama

unread,
Feb 15, 2013, 8:50:13 AM2/15/13
to biohac...@googlegroups.com
Hi Jerven,

I worried about if you meant that we need to make a license agreement to provide a subset of the UniProt database, so thank you for your kind admission.

By the way, your skill to dig our server is amazing. :)
Yes, we also put flint in front of some public and our internal SPARQL endpoints for testing.
As I included the official UniProt endpoint, you may have noticed our access from the log.
This flint service is also not intended to be widely used but it might be useful in some cases (e.g. during the hackathon).

Cheers,
Toshiaki

Jerven Bolleman

unread,
Feb 15, 2013, 9:13:24 AM2/15/13
to biohac...@googlegroups.com
Hi Toshiaki,

The UniProt license legally thinks what you are doing is absolutely fine. I think its great in any case.

I was chasing down a killer query (i.e. one that takes up to much memory).
Unfortunately its the flint interface that generates these :(
Sesame, always first does the order by and then the distinct (as per sparql algebra)
but for UniProt that generates a list of about 6 billion elements sorted which means the server runs out memory and crashes.
Will ask the sesame/owlim-list if anything is possible there.

Regards,
Jerven

Toshiaki Katayama

unread,
Feb 15, 2013, 10:16:44 AM2/15/13
to biohac...@googlegroups.com
Hi Jerven,

Thank you for your clarification.

As for the flint, we are also experiencing the same problem for many endpoints.
In the flint interface, the functionality to generate a list of predicates and classes should be useful to explore the stored data, but it is often timed-out actually.

The situation might be resolved if each endpoint provides meta information as described in a VoID specification.
Still need to hack the flint code, we'll be able to make a client application to use the void:vocabulary for fetching properties and classes from relevant ontologies.
This approach doesn't require a heavy SPARQL load.

However, in wild LOD, many predicates and classes are used without defined in an ontology.
So another possibility might be to extend the VoID specification to indicate a list of predicates and classes in addition to the number of them (void:predicates and void:classes).
In this case, each endpoint provider can generate the list by running a SPARQL query only once when importing LOD.
Just a rambling thought..

Anyway, I can temporally remove the UniProt endpoint from our flint configuration, if you prefer.

Regards,
Toshiaki

Michel Dumontier

unread,
Feb 15, 2013, 11:50:30 AM2/15/13
to biohac...@googlegroups.com
Toshiaki,
Bio2RDF now pre-computes graph properties - you can see our documentation here:

https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics

and example pages:

http://download.bio2rdf.org/release/2/biomodels/biomodels.html
http://download.bio2rdf.org/release/2/gene/gene.html

while we used our own vocabulary (for ease), we'd be happy to investigate something more standard. VoID is one option, and the SPARQLed

http://sindicetech.com/sindice-suite/sparqled/

too uses the Dataset Analytics Vocabulary ontology

http://vocab.sindice.net/analytics#

but I don't find it particularly intuitive.

m.

Toshiaki Katayama

unread,
Feb 15, 2013, 1:24:57 PM2/15/13
to biohac...@googlegroups.com
Hi Michel,

Thank you for your inputs! These Bio2RDF metrics seem to be very useful.
If we can make an agreement with the major biological LOD providers on providing (a standardized version of) this dataset_vocabulary (DaVo?) in addition to VoID, it would be great as the biological datasets are often very huge.


By the way, for those who might be interested:

>> The UniProt license legally thinks what you are doing is absolutely fine. I think its great in any case.

According to this Jerven's statement, I temporally put UniProt files split into taxon IDs at

http://lod.dbcls.jp/rdf/uniprot_taxon.ttl/

I hope this would be useful to setup species specific application for you.

Please use with RDF and OWL files (other than huge uni*.rdf.gz files) provided at

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/

Note that, you can generate a list of descendant taxon IDs from a taxonomic node of your choice (tax:42241 in this example) by simply querying with

PREFIX tax: <http://purl.uniprot.org/taxonomy/>
SELECT ?taxon
WHERE {
?taxon rdfs:subClassOf* tax:42241 .
}

Regards,
Toshiaki

M. Scott Marshall

unread,
Mar 7, 2013, 5:20:25 AM3/7/13
to toshiaki katayama, Michel Dumontier, biohac...@googlegroups.com, dbca...@googlegroups.com
Hi Toshiaki,

Several of the ideas that you have mentioned in this thread have been
a topic of discussion in the Biohackathon 2011 group that we
eventually called DBCatalog: RDF metadata vocabulary to describe the
data that is available in an RDF-rendered dataset. We haven't met
since last year but it seems like the time is right to pick this up
again. It sounds like you have very practical applications for it as
well.

> So another possibility might be to extend the VoID specification to indicate a list of predicates and classes in addition to the number of them (void:predicates and void:classes).
> In this case, each endpoint provider can generate the list by running a SPARQL query only once when importing LOD.

I agree that such a tool would be good and an important way to ease
adoption of the metadata markup that we will refine in the dbcatalog
group. In BH11, we discussed how such a tool could gather many
important graph statistics, including those about predicates and
classes, and make those available through SPARQL. Last year, Janos
Hajagos presented a tool in the Linked Life Data task force (LLD) that
gathered graph statistics: https://code.google.com/p/py-triple-simple/
.

> If we can make an agreement with the major biological LOD providers on providing (a standardized version of) this dataset_vocabulary (DaVo?) in addition to VoID, it would be great as the biological datasets are often very huge.

That is one of the potential outcomes that we were aiming for in the
dbcatalog. Could we discuss this in the Linked Life Data task force?
Michel and I have been interested in pursuing these ideas and refining
them within the context of Linked Life Data. I think that it should
form the main line of work for that group for at least the next
several months (1/2 year). Chisato suggested that a time that is
possible for most timezones is 3PM CET. If we have a Linked Life Data
teleconference at 3PM CET on a Monday, could you attend? Or would
another day be better? I would like to form a plan to deliver a
version of this work, as well as the biohackathon 2011 publication.

Kind regards,
Scott

P.S. As you know, I am working at a radiotherapy oncology clinic these
days. We are just now turning our attention to federation issues
(example: query federation across image database (called a PACS) and
electronic health record (EHR).
M. Scott Marshall, PhD
MAASTRO clinic, http://www.maastro.nl/en/1/
http://eurecaproject.eu/
https://plus.google.com/u/0/114642613065018821852/posts
http://www.linkedin.com/pub/m-scott-marshall/5/464/a22

Mark

unread,
Mar 9, 2013, 1:24:03 AM3/9/13
to toshiaki katayama, Michel Dumontier, M. Scott Marshall, biohac...@googlegroups.com, dbca...@googlegroups.com

> Several of the ideas that you have mentioned in this thread have been
> a topic of discussion in the Biohackathon 2011 group that we
> eventually called DBCatalog: RDF metadata vocabulary to describe the
> data that is available in an RDF-rendered dataset.


I'm sure Michel will respond to this thread himself, but I am thrilled
about his recent publication of various metadata relating to the Bio2RDF
datasets. **Incredibly** useful!! So much so, that we are now just days
away from exposing all of Bio2RDF as SADI services that create themselves
dynamically based on the Bio2RDF metadata! This will allow you to
automatically discover the data (regardless of endpoint) and chain it
together with analytical services.

:-)

M

Toshiaki Katayama

unread,
Mar 11, 2013, 3:27:15 AM3/11/13
to M. Scott Marshall, Michel Dumontier, biohac...@googlegroups.com, dbca...@googlegroups.com
Hi Scott,

Sorry for my late reply.

> Several of the ideas that you have mentioned in this thread have been
> a topic of discussion in the Biohackathon 2011 group

Oops, two years of behind. Sorry. ;)
I hope to follow the discussions.

> If we have a Linked Life Data
> teleconference at 3PM CET on a Monday, could you attend?

Do you mean it will be today?

> I would like to form a plan to deliver a
> version of this work, as well as the biohackathon 2011 publication.

This sounds great.
Do you have any agenda?
What media do you use for the teleconference?
Anyway, I'll try to be online tonight.

Regards,
Toshiaki

Trish Whetzel

unread,
Mar 11, 2013, 12:37:19 PM3/11/13
to dbca...@googlegroups.com, Mark, toshiaki katayama, M. Scott Marshall, biohac...@googlegroups.com
Hi Michel,

Do you have a link to this vocabulary or can send more details on the contents of what is included as "dataset statistics"?

Thanks!
Trish


On Sun, Mar 10, 2013 at 7:38 PM, Michel Dumontier <michel.d...@gmail.com> wrote:
Hi all,
  First, thanks for the rousing endorsement Mark :) 

  Second, our work on Bio2RDF provenance and dataset statistics will be presented and published as part of ESWC 2013. We are now working towards consensus with OpenPhacts (a confidence building measure), with an implementation for Bio2RDF Release 3 scheduled in time for Biohackathon. After that, I would be interested in getting all interested parties together to publish a definitive W3C note, which would incorporate even wider interests (as articulated in our biohackathon meeting).

 As far as the dataset statistics go, we used our own vocabulary, so we're open to examining other vocabs for the purpose. 

m.
--
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group

--
You received this message because you are subscribed to the Google Groups "dbcatalog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dbcatalog+...@googlegroups.com.

M. Scott Marshall

unread,
Mar 11, 2013, 1:26:19 PM3/11/13
to dbca...@googlegroups.com, Chisato Yamasaki, Peter Ansell, Michel Dumontier, biohac...@googlegroups.com
Hi Toshiaki, All,

I decided not to try to have the meeting today in the end because it
would be too short of a notice for some people.

How about a week from today on a Monday at 3PM CET?

Because the U.S. has shifted to Daylight Savings Time already and
Europe will only do the same on March 31, it will be somewhat less
painful for those in the PST timezone until then (although admittedly
still very bright and early at 7AM PST). Any others from the Japanese
timezone (Chisato?) or Australia (Peter?) who could make it? And, so
we aren't caught by surprise, when is Daylight Savings Time for Japan
and Australia?

A tentative agenda would start with a roundup of the issues that we
put on the table back in 2011, and identify issues that have been
(partially) solved by various partners, and remaining issues that we
think are important to consider in the near term.

(Straw Man) Agenda:
* Past: Review of how we exited Biohackathon11 - Scott
* Current: Discussion of progress in the meantime (during a few
teleconference calls and mail threads, Bio2RDF, other?) - Scott,
Michel
* Future: Identify work to be done - All

Here are a few documents from our earlier efforts:
https://docs.google.com/document/d/1qVSZ1n334fTTchCWS2tOaEL6z-H12pcqty98jQIg6nw/edit?hl=en_US#heading=h.x5e0yirsfi9v
https://docs.google.com/spreadsheet/ccc?key=0AvCayBYdTclldEpBSS1wRXNEaU9OeHdWcGRwc09mSmc&usp=sharing
https://docs.google.com/document/d/1BZyylrV-NXpCpF23CUj-HSvq4YogXjtCgqN7UtFK5EY/edit


Cheers,
Scott

P.S. Toshiaki - No problem that you didn't know that we had talked
about including information about predicates used in an RDF dataset in
a Biohackathon11 'working group'. I wouldn't expect you to track and
remember every detail. I was actually happy to see you bring up a need
for one of the things we had discussed - confirming our suspicions
that it was a good idea.
> You received this message because you are subscribed to the Google Groups "dbcatalog" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dbcatalog+...@googlegroups.com.

Peter Ansell

unread,
Mar 11, 2013, 3:46:03 PM3/11/13
to M. Scott Marshall, dbca...@googlegroups.com, Chisato Yamasaki, Michel Dumontier, biohac...@googlegroups.com
On 12 March 2013 03:26, M. Scott Marshall <mscottm...@gmail.com> wrote:
> Hi Toshiaki, All,
>
> I decided not to try to have the meeting today in the end because it
> would be too short of a notice for some people.
>
> How about a week from today on a Monday at 3PM CET?
>
> Because the U.S. has shifted to Daylight Savings Time already and
> Europe will only do the same on March 31, it will be somewhat less
> painful for those in the PST timezone until then (although admittedly
> still very bright and early at 7AM PST). Any others from the Japanese
> timezone (Chisato?) or Australia (Peter?) who could make it? And, so
> we aren't caught by surprise, when is Daylight Savings Time for Japan
> and Australia?

I am in Queensland, Australia, one of the states that does not have
Daylight Savings. 3PM CET seems to be Midnight in Brisbane and 11pm in
Tokyo according to timeanddate.com:

http://www.timeanddate.com/worldclock/meetingtime.html?iso=20130311&p1=47&p2=248&p3=195

I don't mind that time for one-off calls but I won't be able to make
it to them regularly. There are other times of my working day that
match Europe and USA individually, but not together (making it
virtually impossible to do anything with W3C!)

Trish Whetzel

unread,
Mar 11, 2013, 4:11:10 PM3/11/13
to dbca...@googlegroups.com, M. Scott Marshall, Chisato Yamasaki, Michel Dumontier, biohac...@googlegroups.com
Monday, March 18 at 3:00pm CET works for me.

Trish

Toshiaki Katayama

unread,
Mar 12, 2013, 5:44:55 AM3/12/13
to biohac...@googlegroups.com, M. Scott Marshall, dbca...@googlegroups.com, Chisato Yamasaki, Michel Dumontier
Hi Michel and all,

>> How about a week from today on a Monday at 3PM CET?

OK

I'm still not sure how to join the teleconference.
Will you use Skype?

> I don't mind that time for one-off calls but I won't be able to make
> it to them regularly.

+1

>> (Straw Man) Agenda:
>> * Past: Review of how we exited Biohackathon11 - Scott
>> * Current: Discussion of progress in the meantime (during a few
>> teleconference calls and mail threads, Bio2RDF, other?) - Scott,
>> Michel
>> * Future: Identify work to be done - All

Sorry for my ignorance but to my understanding from your documents,
this project aims to formalize DB metadata (as a consensus of existing
efforts), right?

(So that, users/agents can easily grasp information about the name,
category, amount, license, contacts etc. of the dataset;
and hopefully obtain those information via some SPARQL queries?)

And, practical goals are to define:

* common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
* common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
* where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)

As I need to understand the situation before attending the teleconference,
please correct me if I'm misunderstanding.

Cheers,
Toshiaki

M. Scott Marshall

unread,
Mar 12, 2013, 5:58:23 AM3/12/13
to dbca...@googlegroups.com, biohac...@googlegroups.com, Chisato Yamasaki, Michel Dumontier
Hi Toshiaki,

> this project aims to formalize DB metadata (as a consensus of existing
> efforts), right?
>
> (So that, users/agents can easily grasp information about the name,
> category, amount, license, contacts etc. of the dataset;
> and hopefully obtain those information via some SPARQL queries?)
>
> And, practical goals are to define:
>
> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)

You have written an excellent summary of what it's about. Spot on. Thank you.

To connect: In HCLS, we have been using https://www.fuzebox.com/
lately. Even though it claims to not support Linux, Eric Prud'hommeaux
was able to connect with it from Ubuntu. I will send an announcement
with instructions for how to join in the coming days.

Cheers,
Scott

Jerven Bolleman

unread,
Mar 12, 2013, 6:11:51 AM3/12/13
to dbca...@googlegroups.com, biohac...@googlegroups.com
Hi All, Mark, Michel,

I am most interested in the how to serve statistics about classes
properties and links per SPARQL endpoint. As well as about licensing and
update frequencies etc...

Bio2rdf has quite a lot of work already done. For example the sparql
code inside
https://github.com/bio2rdf/bio2rdf-scripts/blob/master/statistics/bio2rdf_stats_virtuoso.php.

If we can change these to SPARQL 1.1 compliant construct/insert queries
then other endpoint providers can do the same as bio2rdf, with minimal
effort.

We should also look at the overlap between VoID and the bio2rdf dataset
description.

From the practical goals that Toshiaki put in the agenda I think we
should focus only on:
* where/how to serve SPARQL endpoints (hopefully with statistics on #
of classes/properties/triples/links etc.)
** Which statistics are needed for which tools.
* common vocabularies to be used among the projects (Bio2RDF, BioDBCore,
Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
Where point 2 should discuss only endpoint statistics, basic meta data
for now.

And leave the following out for the moment

* common URIs (namespaces) to be linked together (Identifiers.org,
Bio2RDF.org, purl.*.org etc.)

I hope this gives a targeted agenda and allows a basic specification to
be finished soon, for further iterations expansions later as needed.

For interest I am interested in providing the following meta data on
(beta.)sparql.uniprot.org. And aim to have both accessible via
sparql-service description as well as in the endpoint.

licensing
ontology (re)-use
statistics on class use
objects
properties
subjects
up date frequency
last updated date
contributors (institutions)
which sparql version is supported
query timeouts and other constraints


Regards,
Jerven

Jerven Bolleman

unread,
Mar 12, 2013, 6:14:48 AM3/12/13
to dbca...@googlegroups.com, biohac...@googlegroups.com
Hi All,Chisato,

I will also be at biocuration2013.

Regards,
Jerven

On 03/12/2013 11:02 AM, Chisato Yamasaki wrote:
> Hi Scott, Michel and all,
> Thank you for arranging the call.
> I can also make call on March 18 at 3:00pm CET,
> 7AM PST, 11PM JST.
>
> Also, I am going to join Biocuration2013 in UK and
> staying around in Cambridge/London on 7-12 April.
> http://www.ebi.ac.uk/biocuration2013/content/home
>
> I will be happy to see if anyone also joining Biocuraion2013
> or can arrange another meeting ;-)
>
> Thank you,
> Chisato Yamasaki
> ----- Original Message ----- From: "Toshiaki Katayama"
> <toshiaki...@gmail.com>
> To: <biohac...@googlegroups.com>
> Cc: "M. Scott Marshall" <mscottm...@gmail.com>;
> <dbca...@googlegroups.com>; "Chisato Yamasaki"
> <chisato-...@aist.go.jp>; "Michel Dumontier"
> <michel.d...@gmail.com>
> Sent: Tuesday, March 12, 2013 6:44 PM
> Subject: Re: [biohackathon:782] UniProt sparql endpoint at dbcls??

Susanna Sansone

unread,
Mar 12, 2013, 6:20:23 AM3/12/13
to dbca...@googlegroups.com, <biohackathon@googlegroups.com>, Philippe
Hi Chisato,
Also Philippe and I will be at biocuration 2013; some more work has also been done on biosharing and biodbcore. We will catch with you and Pascale there.
See you soon!
Susanna

On 12 Mar 2013, at 10:02, "Chisato Yamasaki" <chisato-...@aist.go.jp> wrote:

> Hi Scott, Michel and all,
> Thank you for arranging the call.
> I can also make call on March 18 at 3:00pm CET,
> 7AM PST, 11PM JST.
>
> Also, I am going to join Biocuration2013 in UK and
> staying around in Cambridge/London on 7-12 April.
> http://www.ebi.ac.uk/biocuration2013/content/home
>
> I will be happy to see if anyone also joining Biocuraion2013
> or can arrange another meeting ;-)
>
> Thank you,
> Chisato Yamasaki
> ----- Original Message ----- From: "Toshiaki Katayama" <toshiaki...@gmail.com>
> To: <biohac...@googlegroups.com>
> Cc: "M. Scott Marshall" <mscottm...@gmail.com>; <dbca...@googlegroups.com>; "Chisato Yamasaki" <chisato-...@aist.go.jp>; "Michel Dumontier" <michel.d...@gmail.com>
> Sent: Tuesday, March 12, 2013 6:44 PM
> Subject: Re: [biohackathon:782] UniProt sparql endpoint at dbcls??

chisato yamasaki

unread,
Mar 12, 2013, 6:22:11 AM3/12/13
to dbca...@googlegroups.com, biohac...@googlegroups.com
Hi Jerven and all,
Great, we can also meet there ;-)

Cheers,
Chisato Yamasaki <chisato-...@aist.go.jp>

2013/3/12, Jerven Bolleman <jerven....@isb-sib.ch>:

chisato yamasaki

unread,
Mar 12, 2013, 6:32:54 AM3/12/13
to dbca...@googlegroups.com, <biohackathon@googlegroups.com>, Philippe
Hi Susanna, Philippe and all,
I will be happy to meet you and hear about the recent updates on
biosharing and biodbcore.

See you soon ;-)
Chisato

2013/3/12 Susanna Sansone <sa.sa...@gmail.com>:

Paul Gordon

unread,
Mar 12, 2013, 3:39:29 PM3/12/13
to biohac...@googlegroups.com, toshiaki katayama, Michel Dumontier, dbca...@googlegroups.com
A comment from the peanut gallery, if you will allow me. I'm working
mostly with clinical personal genomes these daysŠif there is any scope to
efficiently tackle RDF queries where there are between 20,000 (exome best
case) and 3,000,000 (genome worst case) locations/IDs to look up, I'd be
happy to start delving into this side of things again.

Cheers,

Paul

Michel Dumontier

unread,
Mar 12, 2013, 3:13:11 PM3/12/13
to dbca...@googlegroups.com, biohac...@googlegroups.com, toshiaki katayama, Michel Dumontier
Paul,
If all you're doing is asking for a numeric range, then I agree, it's not worthwhile. But if your query involves transitive reasoning over a hierarchy, across mappings and over a number of domains, then I think the solution is clear.

m.
--
You received this message because you are subscribed to the Google Groups "dbcatalog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dbcatalog+...@googlegroups.com.

Paul Gordon

unread,
Mar 12, 2013, 4:25:51 PM3/12/13
to biohac...@googlegroups.com, dbca...@googlegroups.com, toshiaki katayama, Michel Dumontier
Hi Michel,

What I envisage is queries that use transitive reasoning to find a
relatively small number of higher level biological concepts (pathways,
diseases, metabolites) that are shared by thousands of sequence variants
across a personal genome cohort, even if the individual variants are not
shared (i.e. genetic heterogeneity). Of the reverse, where the phenotype
implicates a specific metabolite, and I need to see if there are related
variants in the dataset I have ( metabolite -> genes -> pathways -> genes
-> genomic locations).

Cheers,

Paul

On 2013-03-12 12:13 PM, "Michel Dumontier" <Michel_D...@carleton.ca>
wrote:
>"BioHackathon" group.
>To unsubscribe from this group and stop receiving emails from it, send an
>email to biohackathon...@googlegroups.com.

Michel Dumontier

unread,
Mar 12, 2013, 3:29:36 PM3/12/13
to Paul Gordon, biohac...@googlegroups.com, dbca...@googlegroups.com, toshiaki katayama
yup, that's pretty interesting - would you be interested in presenting this problem / your work to the HCLS?

m.

Paul Gordon

unread,
Mar 12, 2013, 4:35:07 PM3/12/13
to biohac...@googlegroups.com
That should say "Or the reverse" :-)
The reverse case should be fairly efficient, but often we don't such
useful phenotypic data, hence the need to go the other way and query with
thousands of locations.

Paul Gordon

unread,
Mar 12, 2013, 4:49:10 PM3/12/13
to biohac...@googlegroups.com
Sure.  This is getting to be a common problem…they say we're in the era of the $1000 genome and the $1,000,000 interpretation :-)

Michel Dumontier

unread,
Mar 12, 2013, 3:55:22 PM3/12/13
to biohac...@googlegroups.com
Exactly why big data should pay attention.

m.

Mark

unread,
Mar 12, 2013, 4:55:33 PM3/12/13
to biohac...@googlegroups.com, Paul Gordon
The same kind of problem exists in metagenomic analyses, I think. The need to 'distill' over (tens or hundreds of) millions of often completely independent homology-predictions (assuming that there was no attempt at genome reconstruction... which would be dubious at best)

My group here is struggling with that (same?) problem right now... trying to find a 'useful' semantic representation of metagenome analyses that could make use of SPARQL+reasoning...

Maybe not the same problem, but it feels the same, given what Paul describes...??

M

Toshiaki Katayama

unread,
Mar 12, 2013, 9:28:19 PM3/12/13
to biohac...@googlegroups.com, dbca...@googlegroups.com
Hi Jerven,

> From the practical goals that Toshiaki put in the agenda I think we should focus only on:
> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
> ** Which statistics are needed for which tools.

Agreed.

For this purpose, I think it would be natural to enhance the YummyData project which has started during the BH12 meeting.

http://yummydata.org/
https://github.com/dbcls/bh12/wiki/Yummy-data

> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
> Where point 2 should discuss only endpoint statistics, basic meta data for now.

When we reached to a consensus on those common (subset of) vocabularies, probably based on the bio2rdf codes and Scott's summary,

https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics
https://docs.google.com/spreadsheet/ccc?key=0AvCayBYdTclldEpBSS1wRXNEaU9OeHdWcGRwc09mSmc#gid=0

we can put them on yummydata.org so that YummyData can be used to provide basic statistics with broader coverage and Bio2RDF can continue to provide advanced metrics on their collection (some of which YummyData might incorporate for all supported datasets in the future). Another solution would be to push each dataset provider to provide those metrics by themselves, but it might require some time.

In any cases,

> I am most interested in the how to serve statistics about classes properties and links per SPARQL endpoint. As well as about licensing and update frequencies etc...


> If we can change these to SPARQL 1.1 compliant construct/insert queries then other endpoint providers can do the same as bio2rdf, with minimal effort.

these two points are important to be kept in mind.


> And leave the following out for the moment
>
> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)

OK. Let's treat it as a long-term goal.

I personally believe identifiers.org is an ideal solution for this purpose because it is open and generic, already providing clean URIs, and can be redirected to multiple data sources. Remaining problem would be, we still need to maintain a database of links among instance URIs.

Cheers,
Toshiaki

Paul Gordon

unread,
Mar 12, 2013, 9:41:37 PM3/12/13
to Mark, biohac...@googlegroups.com
Well, I'm mostly concerned with efficiency of queries containing thousand of identifiers, whereas it sounds like you might be trying to grapple with semantic representation of the input data relationships themselves.  Both legitimate "big data" issues I think, but I'm hoping mine is more tractable as you need to deal with nebulous concepts such as OTUs...

Mark Wilkinson

unread,
Mar 13, 2013, 3:46:54 AM3/13/13
to Paul Gordon, biohac...@googlegroups.com
On Wed, 13 Mar 2013 02:41:37 +0100, Paul Gordon <gor...@ucalgary.ca> wrote:

Well, I'm mostly concerned with efficiency of queries containing thousand of identifiers

Ah!!  It wasn't until I read this that I finally twigged to what you were saying :-)

Okay, right, different problem.

M

Jerven Bolleman

unread,
Mar 13, 2013, 5:06:16 AM3/13/13
to dbca...@googlegroups.com, biohac...@googlegroups.com
Hi Paul,

I hope you wont mind me putting my "solution architect" hat on... I had
a similar discussion last week with one of the Profs in a different
group investigating what to set up from scratch for NGS plus meta data work.

What this sounds like is a lot of joins between different datasources
but joins they are. The identifier lookups are a way of executing a
merge join between datasets, right? Then a number of SPARQL 1.1.
endpoints could very well be an efficient solution. The native SPARQL
stores are definitely join optimized architectures.

However, you need to take a few things into account. SPARQL support does
not mean RDF store you should also look at R2ML, SADI etc..., and
specialized stores. SPARQL is storage agnostic...

You will need SPARQL 1.1 support with the VALUES and SERVICE keyword.
Which is the least well supported at the moment (last change and hardest
to get right) for most API's and vendors.

If you are investigating federation issues the fact that SPARQL has
federation build in and that it works between different vendors and
solutions means it should be on the top of your list for investigation.

To be honest there is no key-turn solution for your problem field and
the science you are doing might be cutting edge enough that you don't
have time for cutting edge infrastructure.

Hope I have not been preaching to much ;)

Regards,
Jerven

Mark Wilkinson

unread,
Mar 13, 2013, 5:32:56 AM3/13/13
to dbca...@googlegroups.com, biohac...@googlegroups.com, Jerven Bolleman

> However, you need to take a few things into account. SPARQL support does
> not mean RDF store you should also look at R2ML, SADI etc.


Indeed... it feels somewhat more intuitive to me to batch-POST 3,000,000
URIs to an asynchronous SADI service than to mention them in a SPARQL
query...

M

Jerven Bolleman

unread,
Mar 13, 2013, 5:46:01 AM3/13/13
to Mark Wilkinson, dbca...@googlegroups.com, biohac...@googlegroups.com
The way I would think about doing it is something like this query.

SELECT ?disease WHERE [
?idSeqVar a :SequenceVariant ' #binds 3 million times
SERVICE <gene:endpoint>{
?idSeqVar :variantOfGene ?idGene . #binds up to 21,000 times
?idGene :phenotype ?phenotype . #binds 50,000 times
?phenotype a :disseasePhenotypeForm ; #binds up to 12,000
:causesDisease ?disease .
}
}

The 3 million ids to post are never typed by the user ;) the service
call does the posting for you. And if that service call goes to a SADI
endpoint or a relational database or a rdf store or a REDIS key-value
store or something else _does not really matter_.

Regards,
Jerven.
>
> M

Paul Gordon

unread,
Mar 13, 2013, 12:35:43 PM3/13/13
to biohac...@googlegroups.com, dbca...@googlegroups.com
Thanks all for the suggestionsŠto be clear, I'm trying to avoid having to
build any infrastructure, and query other's datasets Š because I'm lazy.
:-)

Jerven: I am aware of the concept of unification in logic programming ;-)
My problem is that I have many thousands of IDs to look up for a single
patient, but the actual set of possible IDs is 3 billion (i.e. every
position in the genome). In reality I can pare this down since this is a
protein coding query, but in other instances I can envisage needing to
look at all the patient's genomic variants (up to 3 million). To be fair,
I have not actually tried such a massive query, but my intuition told be
that such a request would likely fail on most existing public endpoints.
Please correct me if my intuition is wrong!

Right now the common practice in personal genomics is to download 100GB+
datasets for the whole genome as gzip'ed tab delimited flat files (VCF,
BED, GTF, etc.) that are efficiently indexed (tabix) for position-based
lookup, then run batch queries against these files and parse the result
lines for the fields you want. That works for 1:1(ish) lookups like
conservation scores and minor allele frequencies, but not for conceptual
queries like the one I described earlier.

On 2013-03-13 2:46 AM, "Jerven Bolleman" <jerven....@isb-sib.ch>
wrote:

Jerven Bolleman

unread,
Mar 13, 2013, 12:07:56 PM3/13/13
to biohac...@googlegroups.com
Hi Paul,

> Thanks all for the suggestionsŠto be clear, I'm trying to avoid having to
> build any infrastructure, and query other's datasets Š because I'm lazy.
> :-)
We are all lazy...
>
> Jerven: I am aware of the concept of unification in logic programming ;-)
> My problem is that I have many thousands of IDs to look up for a single
> patient, but the actual set of possible IDs is 3 billion (i.e. every
> position in the genome). In reality I can pare this down since this is a
> protein coding query, but in other instances I can envisage needing to
> look at all the patient's genomic variants (up to 3 million). To be fair,
> I have not actually tried such a massive query, but my intuition told be
> that such a request would likely fail on most existing public endpoints.
> Please correct me if my intuition is wrong!
To be honest, if you use a public endpoint, yes then you are going to need to wait.
Which is similar if you do 3 billion http rest calls ;) e.g. using ensembl-api's.
i.e. your not going to fail if you posts of 2mb at a time etc... but wait you must :(

The question then is what is easier than RDF + SPARQL? i.e. what other approach
are you going to implement?
> Right now the common practice in personal genomics is to download 100GB+
> datasets for the whole genome as gzip'ed tab delimited flat files (VCF,
> BED, GTF, etc.) that are efficiently indexed (tabix) for position-based
> lookup, then run batch queries against these files and parse the result
> lines for the fields you want. That works for 1:1(ish) lookups like
> conservation scores and minor allele frequencies, but not for conceptual
> queries like the one I described earlier.
I think the point that I am trying to make is that just because the data is a BED file etc..., it doesn't mean you can't
run a SPARQL query against it. In the end you just need to implement the TripleSource api in Sesame or Jena to
become SPARQL capable. And at that point you won't be much slower than a basic script. i.e. linear IO bound.
But you just standardized your API between BED, GTF, ArrayExpress and UniProt etc...

Regards,
Jerven

Michel Dumontier

unread,
Mar 13, 2013, 12:26:03 PM3/13/13
to biohac...@googlegroups.com
 
To be honest, if you use a public endpoint, yes then you are going to need to wait.
Which is similar if you do 3 billion http rest calls ;) e.g. using ensembl-api's.
i.e. your not going to fail if you posts of 2mb at a time etc... but wait you must :(


Indeed, that's why Bio2RDF makes available both the RDF and the full text indexed virtuoso databases for download. Feel free to download and bring up with your own resources!

m.
 

Toshiaki Katayama

unread,
Mar 14, 2013, 1:04:12 AM3/14/13
to biohac...@googlegroups.com
Hi Jerven,

> I think the point that I am trying to make is that just because the data is a BED file etc..., it doesn't mean you can't
> run a SPARQL query against it. In the end you just need to implement the TripleSource api in Sesame or Jena to
> become SPARQL capable.

This is very interesting point. I also found your tweet this morning:

https://twitter.com/jervenbolleman/status/311856468582363136
jervenbolleman: Just ran my first #SPARQL query against a SAM file by using #picard/#samools as triple source and #sesame.

Have you already implemented the API for BED/GFF/SAM etc.?

Cheers,
Toshiaki

Francesco Strozzi

unread,
Mar 14, 2013, 2:06:52 AM3/14/13
to biohac...@googlegroups.com, biohac...@googlegroups.com
Hi all,
this is a very interesting topic, we are currently working on a way to store VCF files in some database to query them using genomics coordinates and alleles. One kind of query we would like to perform is something like: given 20 samples, I want all the SNPs that are present in 10 samples but not in the other 10 and give me back just the alternative alleles. This is something that to our knowledge is difficult to obtain using the current command line tools...

Do you think that using the approach described by Jarven this could be possible?

Cheers!
Francesco

----
Francesco Strozzi
Bioinformatics

Parco Tecnologico Padano
Via Einstein
Loc. Cascina Codazza
26900 Lodi, ITALY
Phone: +39 03714662333
Skype: francescostrozzi

Pjotr Prins

unread,
Mar 14, 2013, 3:12:34 AM3/14/13
to biohac...@googlegroups.com
Hi Jerven,

Would it be possible to get a GSoC student involved in such a project?
We are writing project proposals for the OBF - some Ruby ones are here

http://bioruby.open-bio.org/wiki/Google_Summer_of_Code#Create_a_D3_based_graphics_package_for_Bioinformatics

Pj.
> >mostly with clinical personal genomes these days??if there is any scope to
> >efficiently tackle RDF queries where there are between 20,000 (exome best
> >case) and 3,000,000 (genome worst case) locations/IDs to look up, I'd be
> >happy to start delving into this side of things again.
> >
> >Cheers,
> >
> >Paul
> >
> >>
> >>P.S. As you know, I am working at a radiotherapy oncology clinic these
> >>days. We are just now turning our attention to federation issues
> >>(example: query federation across image database (called a PACS) and
> >>electronic health record (EHR).
> >
> >
>
>
> --
> -------------------------------------------------------------------
> Jerven Bolleman Jerven....@isb-sib.ch
> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
>

Jerven Bolleman

unread,
Mar 14, 2013, 4:42:16 AM3/14/13
to biohac...@googlegroups.com, LINDENBAUM pierre, Andy Jenkinson
Hi Toshiaki, All,

I started with a SAM file, it is the origin of this plan. And it only has about 10 hours of work on it, so its not fast and certainly not complete. I have not done anything for related file formats. Working on protein data by day NGS is a hobby project only (currently).

I currently implemented the simple triple source API approach (which is slow) is to validate unit tests for a faster (SPARQL to java code compiler) approach. This triple source approach was inspired by an example written by Pierre Lindenbaum at http://plindenbaum.blogspot.ch/2012/11/creating-virtual-rdf-graph-describing.html except I use the sesame API.

What Mark Wilkinson with SADI, BluePrints for graph stores and D2RQ/R2ML showed is that because SPARQL is so generic you can really use any datasource at all to answer SPARQL queries. All you need is a translator/compiler. The really nice thing is that using SPARQL data storage and data access became separated, meaning developers can innovate on data storage approaches without breaking data access.

In the future I expect a lot of specialized SPARQL stores for specific kinds of data. i.e. a NGS read storage optimized database etc... Next to generic triple stores that can handle any data.

Regards,
Jerven

P.S. anyone knowing of a nice grant call where we could propose to work on this full time please let me know.

Joachim Baran

unread,
Mar 14, 2013, 9:14:14 AM3/14/13
to biohac...@googlegroups.com
Hello,

On 2013-03-14, at 2:06 AM, Francesco Strozzi <francesc...@tecnoparco.org> wrote:
> [...] to store VCF files in some database to query them using genomics coordinates and alleles.
I think that is a very good point: getting the data into the database is easy, but enabling the stored data to be analyzed in an efficient way can be challenging.

It would be beneficial to have a demo set of SPARQL queries on genomic data that show how to do some tasks current command line tools can carry out. For example, calculating allele frequencies, case/control comparison tests, return genomic features in spatially close proximity to exploit genetic linkage, etc.

Otherwise, it might be too easy for critics to argue that SQL database provide much better solutions, or simply point out that we already have command line tools that deal with standardized file formats.

Joachim

Jerven Bolleman

unread,
Mar 14, 2013, 9:21:07 AM3/14/13
to biohac...@googlegroups.com
On Mar 14, 2013, at 2:14 PM, Joachim Baran wrote:

> Hello,
>
> On 2013-03-14, at 2:06 AM, Francesco Strozzi <francesc...@tecnoparco.org> wrote:
>> [...] to store VCF files in some database to query them using genomics coordinates and alleles.
> I think that is a very good point: getting the data into the database is easy, but enabling the stored data to be analyzed in an efficient way can be challenging.
>
> It would be beneficial to have a demo set of SPARQL queries on genomic data that show how to do some tasks current command line tools can carry out. For example, calculating allele frequencies, case/control comparison tests, return genomic features in spatially close proximity to exploit genetic linkage, etc.
Sounds like a good idea. Does anyone have an example of how for example allele frequencies are calculated today? script etc... that someone not in the field can look at and see what that would look like as a SPARQL query?
>
> Otherwise, it might be too easy for critics to argue that SQL database provide much better solutions, or simply point out that we already have command line tools that deal with standardized file formats.
Absolutely, and we should also show where the CLI/sql approach starts to become difficult i.e. SPARQL needs to be better.

Regards,
Jerven
>
> Joachim

Joachim Baran

unread,
Mar 14, 2013, 11:25:06 AM3/14/13
to biohac...@googlegroups.com

On 2013-03-14, at 9:21 AM, Jerven Bolleman <jerven....@isb-sib.ch> wrote:
> Sounds like a good idea. Does anyone have an example of how for example allele frequencies are calculated today?
Imagine you do genotyping and you have an SQL database with your samples in one column called "snp", where values in "snp" are A, B, and H to denote the major allele, minor allele and heterozygous SNPs.

The major allele frequency via SQL is (I hope, because I have not reach caffein saturation yet):

SELECT COUNT(snp) WHERE snp = "A";

But that already assumes that you have rewritten your genotypes as "A", "B" ("C" and possible "D"). If you have them still as "GC", "AT", "TT", etc., then the query condition just becomes a bit longer. That would mostly apply to microarray data.

If you sequence and have the sequences in your database, but you want to talk about SNPs and genotypes, then it becomes trickier because of the strandedness of the sequences.

Anyway, I think my point is: if we could show SPARQL examples for all these examples, then that would be fantastic! No more need of rewriting TSV files to massage the data into the right format...

Joachim

Jerven Bolleman

unread,
Mar 14, 2013, 12:22:18 PM3/14/13
to biohac...@googlegroups.com
Hi Joachim,
On Mar 14, 2013, at 4:25 PM, Joachim Baran wrote:

>
> On 2013-03-14, at 9:21 AM, Jerven Bolleman <jerven....@isb-sib.ch> wrote:
>> Sounds like a good idea. Does anyone have an example of how for example allele frequencies are calculated today?
> Imagine you do genotyping and you have an SQL database with your samples in one column called "snp", where values in "snp" are A, B, and H to denote the major allele, minor allele and heterozygous SNPs.
>
> The major allele frequency via SQL is (I hope, because I have not reach caffein saturation yet):
>
> SELECT COUNT(snp) WHERE snp = "A";
>
> But that already assumes that you have rewritten your genotypes as "A", "B" ("C" and possible "D"). If you have them still as "GC", "AT", "TT", etc., then the query condition just becomes a bit longer. That would mostly apply to microarray data.
Something like this in SPARQL ?

SELECT (COUNT(?snp) AS ?noOfSnpA) WHERE {?snp a :snp . ?snp rdf:value ?value . FILTER(?value == "A")}

Or if data is encoded more semantic

SELECT (COUNT(?snp) AS ?noOfMajorAlleleSnp) WHERE {?snp a :snp . ?snp a :MajorAllele}


>
> If you sequence and have the sequences in your database, but you want to talk about SNPs and genotypes, then it becomes trickier because of the strandedness of the sequences.
>
> Anyway, I think my point is: if we could show SPARQL examples for all these examples, then that would be fantastic! No more need of rewriting TSV files to massage the data into the right format...
I think we should be able to show SPARQL examples for this kind of questions. Especially for the simple TSV files, we should be able to have an RDF (pseudo) model for these without too much work (I think).

Paul Gordon

unread,
Mar 14, 2013, 1:54:53 PM3/14/13
to biohac...@googlegroups.com

>Sounds like a good idea. Does anyone have an example of how for example
>allele frequencies are calculated today? script etc... that someone not
>in the field can look at and see what that would look like as a SPARQL
>query?

Still a little early for good kung foo, butŠ

If you put your patient's genotypes that differ from the reference in a
file called target_locations.bed like so (minus the comments):

1 10927 10927 G # SNP
1 10433 10433 AC # insertion of C
1 10439 10440 A # deletion of reference base chr1:10440


You could get the dbSNP rsid, reference allele, and Minor Allele Frequency
like so:

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz #
latest dbSNP
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz.tbi
# the index
tabix 00-All.vcf.gz -B target_locations.bed | perl -ane
'BEGIN{%patient=split /(\S+\n)/s, `cat target_locations.bed`} $alt_bases =
$patient{"$F[0]\t$F[1]\t".($F[1]+length($F[3])-1)."\t"}; chomp $alt_bases;
print join("\t", @F[0..4], $1), "\n" if $F[4] eq $alt_bases and
/MAF=(\d\.\d+)/'

Of course, this assumes you used the same reference genome as dbSNP.
There are a million other consideration that I could expand upon at
length, but that should give you the gist.

Jerven Bolleman

unread,
Mar 14, 2013, 1:30:56 PM3/14/13
to biohac...@googlegroups.com
Hi Paul,

I started to look at this and the bed to rdf stuff is easy.
However, I am totally confused by what your script actually does.
Starting with the tabix -B option which does not seem to be documented
at http://samtools.sourceforge.net/tabix.shtml.

Could you give a slight comment on your perl one liner for those of us which
are not used to perl please? The inline idea below is with the understanding that
you are looking for known snp's in dbSNP that your patient has as well.

Regards,
Jerven


On Mar 14, 2013, at 6:54 PM, Paul Gordon wrote:

> If you put your patient's genotypes that differ from the reference in a
> file called target_locations.bed like so (minus the comments):
>
> 1 10927 10927 G # SNP
> 1 10433 10433 AC # insertion of C
> 1 10439 10440 A # deletion of reference base chr1:10440
RDF in pseudo turtle in local database/file/sparql endpoint.

snp:1 :chromosome chromosome:1 ;
a :mutation ; #mutation in relation to the reference
a :SNP ;
:begin "10927"^^xsd:int ;
:end "10927"^^xsd:int ;
rdf:value "G"
snp:2 :chromosome chromosome:1 ;
a :mutation ;
a :insertion ;
:begin "10433"^^xsd:int ;
:end "10433"^^xsd:int ;
rdf:value "AC"
snp:1 :chromosome chromosome:1 ;
a :mutation ;
a :deletion ;
:begin "10439"^^xsd:int ;
:end "10440"^^xsd:int ;
rdf:value "A"

SELECT ? patientSnp ?dbSnp {
VALUES(?patientSnp ?patientBegin ?patientEnd ?patientValue) {(snp:1, 10927, "G") (snp:2, 10433, 10433, "AC") (snp:3, 10439, 10440, "A")} .
SERVICE<http://sparql.ncbi.nih.gov/snp/organisms/human_9606/sparql>{
?dbSnp a :mutation ;
:chromosome chromosome:1; #all our samples are on chromsome1 so
:begin ?patientBegin ; #begin, end and value must be the same
:end ?patientEnd ;
rdf:value ? patientValue .
}
}
OR
SELECT ?patientSnp ?dbSnp {
SERVICE<file:///target_locations.bed>{#Using Jerven's idea of SPARQL against existing files
?patientSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue .
?mutationType rdfs:subClassOf :mutation .
SERVICE<file:///ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz>{
?dbSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue .

Joachim Baran

unread,
Mar 14, 2013, 1:41:24 PM3/14/13
to biohac...@googlegroups.com

On 2013-03-14, at 12:22 PM, Jerven Bolleman <jerven....@isb-sib.ch> wrote:
> Something like this in SPARQL ?
>
> SELECT (COUNT(?snp) AS ?noOfSnpA) WHERE {?snp a :snp . ?snp rdf:value ?value . FILTER(?value == "A")}
>
> Or if data is encoded more semantic
>
> SELECT (COUNT(?snp) AS ?noOfMajorAlleleSnp) WHERE {?snp a :snp . ?snp a :MajorAllele}
That looks about right.

The more semantic example raises some interesting problems though:
- What if the allele frequencies in a population are unknown (so, you don't know which allele is major/minor)?
- What happens when tri-allelic/quad-allelic data is mixed (so, you get A-D as values, no longer major/minor)?

Is there a document where we can jot down all these quirks? That would be very useful when it comes to modelling ontologies for these problems.

Joachim


Jerven Bolleman

unread,
Mar 14, 2013, 2:00:02 PM3/14/13
to biohac...@googlegroups.com
>> Or if data is encoded more semantic
>>
>> SELECT (COUNT(?snp) AS ?noOfMajorAlleleSnp) WHERE {?snp a :snp . ?snp a :MajorAllele}
> That looks about right.
>
> The more semantic example raises some interesting problems though:
> - What if the allele frequencies in a population are unknown (so, you don't know which allele is major/minor)?
Depends, you always need to model what you know/measured.
I am not in this field enough to tell you what makes sense or not.
But in that case you probably lack the addition of the type, and you would fall back to the
non semantic enriched variant of the idea.



> - What happens when tri-allelic/quad-allelic data is mixed (so, you get A-D as values, no longer major/minor)?
Your types would change... may mark in the population frequencies somehow. i.e.

snp:A a :SNP ;
:allele allele:A .
allele:A :populationFreqMeasurement pfm:1 .
allele:A :populationFreqMeasurement pfm:2 .
pfm:1 a :Measurement ;
:date "2009-12-01" ;
rdf:value 21 ;
pfm:2 a :Measurement ;
:date "2013-01-01" ;
rdf:value 121 ;

snp:B a :SNP ;
:allele allele:B .
allele:B :populationFreqMeasurement pfm:3 .
allele:B :populationFreqMeasurement pfm:4 .
pfm:3 a :Measurement ;
:date "2009-12-01" ;
rdf:value 7 ;
pfm:4 a :Measurement ;
:date "2013-01-01" ;
rdf:value 800 ;

In this case we had two frequency measurements at different times. The Major/Minor allele has switched in the known population i.e. our original conclusions
where premature and may need to be revised.

>
> Is there a document where we can jot down all these quirks? That would be very useful when it comes to modelling ontologies for these problems.
>
> Joachim
>
>

Paul Gordon

unread,
Mar 14, 2013, 4:26:44 PM3/14/13
to biohac...@googlegroups.com


On 2013-03-14 10:41 AM, "Joachim Baran" <joachi...@gmail.com> wrote:

>
>On 2013-03-14, at 12:22 PM, Jerven Bolleman <jerven....@isb-sib.ch>
>wrote:
>> Something like this in SPARQL ?
>>
>> SELECT (COUNT(?snp) AS ?noOfSnpA) WHERE {?snp a :snp . ?snp rdf:value
>>?value . FILTER(?value == "A")}
>>
>> Or if data is encoded more semantic
>>
>> SELECT (COUNT(?snp) AS ?noOfMajorAlleleSnp) WHERE {?snp a :snp . ?snp a
>>:MajorAllele}
> That looks about right.
>
> The more semantic example raises some interesting problems though:
> - What if the allele frequencies in a population are unknown (so, you
>don't know which allele is major/minor)?
> - What happens when tri-allelic/quad-allelic data is mixed (so, you get
>A-D as values, no longer major/minor)?
>
> Is there a document where we can jot down all these quirks?
Good idea, I could whole write a book on these quirks!
What is the appropriate context for this documentation: Biohackathon, W3C
SIG, etc.?


Paul Gordon

unread,
Mar 14, 2013, 5:19:04 PM3/14/13
to biohac...@googlegroups.com

>Hi Paul,
>
>I started to look at this and the bed to rdf stuff is easy.
>However, I am totally confused by what your script actually does.
>Starting with the tabix -B option which does not seem to be documented
>at http://samtools.sourceforge.net/tabix.shtml.
Hmm, maybe it's been removed in a recent version. Grabs all the dbSNP VCF
records within the regions specified by the BED file that follows.

>Could you give a slight comment on your perl one liner for those of us
>which
>are not used to perl please? The inline idea below is with the
>understanding that
>you are looking for known snp's in dbSNP that your patient has as well.
See below.

>># for each line VCF returned by Tabix, autosplit it on whitespace into
>>array @F

>>tabix 00-All.vcf.gz -B target_locations.bed | perl -ane

>># before we do anything else, load up the patient variant BED file into
>>a hash table { "chr\tbegin\tend\t" => "variant_sequence\n"}
>>
>> 'BEGIN{%patient=split /(\S+\n)/s, `cat target_locations.bed`}

>># see if the current VCF line corresponds to the same reference genome
>>span as one of the patient variants in the hash (i.e. a key match)
>>
>>$alt_bases = $patient{"$F[0]\t$F[1]\t".($F[1]+length($F[3])-1)."\t"};

>># get rid of the new line char at the end of the patient variant sequence
>>chomp $alt_bases;

>># print the first five columns of the VCF file, and the MAF value (from
>>the regex condition capture group that follows) on a single line
>> print join("\t", @F[0..4], $1), "\n"

>># only print if the VCF variant sequence column (#5) refers to the same
>>variant sequence as the patient, and a MAF value is found in the VCF
>>record

Mark

unread,
Mar 14, 2013, 4:58:11 PM3/14/13
to biohac...@googlegroups.com, Paul Gordon

>> Could you give a slight comment on your perl one liner for those of us
>> which
>> are not used to perl please?


Perl Gordon strikes again!!

LOL! His earlier reference "Still a little early for good kung foo" is
quite accurate... Paul's Perl is like a boot to the head sometimes :-)
:-)

M

Jerven Bolleman

unread,
Mar 14, 2013, 5:45:35 PM3/14/13
to biohac...@googlegroups.com
Ok Thanks, had not caught on the requirement for a MAF.
So looking for existing known SNP in dbsnp matching your patients snip where there is a Global minor allele frequency.

SELECT ?patientSnp ?dbSnp ?maf {
SERVICE<file:///target_locations.bed>{#Using Jerven's idea of SPARQL against existing files with existing API's
?patientSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue .
?mutationType rdfs:subClassOf :mutation .
SERVICE<ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz>{
?dbSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue ;
:MinorAlleleFrequency ?maf . #New to this query
}
}
}

The important thing to note is that Paul currently uses perl one liners and I think that SPARQL using known/existing API's can do better performance wise.

Then the command line would look like something like this.
./sparql_against_bed.sh --file target_locations.bed --query "SELECT ?patientSnp ?dbSnp ?maf {
?patientSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue .
?mutationType rdfs:subClassOf :mutation .
SERVICE<ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz>{
?dbSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue ;
:MinorAlleleFrequency ?maf . #New to this query
}
}"

The thing which is absolutely true is that the tabix->perl infrastructure exists and my SPARQL against BED is unwritten code.
However, SPARQL becomes interesting when you do add more SPARQL endpoints to your query.

./sparql_against_bed.sh --file target_locations.bed --query "SELECT ?patientSnp ?dbSnp ?maf ?diseaseComment {
?patientSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue .
?mutationType rdfs:subClassOf :mutation .
SERVICE<ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz>{
?dbSnp a ?mutationType ;
:begin ?patientBegin ;
:end ?patientEnd ;
rdf:value ?patientValue ;
:MinorAlleleFrequency ?maf . #New to this query
}
SERVICE<http://beta.sparql.uniprot.org>{ #Use another SPARQL endpoint and look up if this variant is related to a known disease.
?variantAnnotation rdfs:seeAlso ?dbSnp .
?protein up:annotation ?variantAnnotation ;
up:annotation ?diseaseAnnotation .
?diseaseAnnotation a up:DiseaseAnnotation ;
rdfs:comment ?diseaseComment .

Matthias Samwald

unread,
Mar 14, 2013, 6:56:36 PM3/14/13
to biohac...@googlegroups.com, Michel_Dumontier
>> Is there a document where we can jot down all these quirks?
> Good idea, I could whole write a book on these quirks!
> What is the appropriate context for this documentation: Biohackathon, W3C
> SIG, etc.?

Well, at the W3C we have a task force on Clinical Pharmacogenomics [1] where
we think a lot about modelling clinically relevant genetic information (as
you can tell by the name, mostly focused on pharmacogenomics at the moment).
Collecting different ideas for RDF/OWL representations and SPARQL queries
such as discussed in this thread would be a useful exercise!

[1] http://www.w3.org/wiki/HCLSIG/Pharmacogenomics (wiki needs to be tidied
up)

Cheers,
Matthias

Jerven Bolleman

unread,
Mar 15, 2013, 11:02:42 AM3/15/13
to biohac...@googlegroups.com
Hi everyone


I made a very simple sparql against BED files program.

to try it out

git clone https://github.com/JervenBolleman/sparql-bed
cd sparql-bed/sparql-bed
mvn assembly:assembly
./sparql-bed.sh src/test/resources/example.bed "SELECT ?s WHERE {?s <http://biohackathon.org/resource/faldo#position> ?p}"

Its not fast, well document or brilliant code but it demonstrates the basic idea.

Regards,
Jerven


On Mar 14, 2013, at 10:19 PM, Paul Gordon wrote:

>
> --
> You received this message because you are subscribed to the Google Groups "BioHackathon" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biohackathon...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

M. Scott Marshall

unread,
Mar 15, 2013, 11:43:50 AM3/15/13
to biohac...@googlegroups.com
[Love this thread!]

In the HCLS Pharmacogenomics task force a little over a year ago, we
spent some time looking at how to create a SPARQL endpoint to dbSNP.
Helena Deus (DERI) did some nice work here. We had some tips from
colleagues at Mayo and Michel hosted the mySQL:
https://docs.google.com/document/d/1UJK9GPqnd3z7fl38s19UPyP3wblMoTlcmlw261L1ETc/edit

I really like the way Jerven wrote this:
> I think the point that I am trying to make is that just because the data is a BED file etc..., it doesn't mean you can't
> run a SPARQL query against it. In the end you just need to implement the TripleSource api in Sesame or Jena to
> become SPARQL capable. And at that point you won't be much slower than a basic script. i.e. linear IO bound.
> But you just standardized your API between BED, GTF, ArrayExpress and UniProt etc...

I think that reducing the number and varieties of APIs cannot be
emphasized enough!
Programmable Web is fantastic but it is made up of thousands of APIs.
A SPARQL layer provides a single API unifying and more flexible
because fewer changes require refactoring and recompilation.

Cheers,
Scott

--
M. Scott Marshall, PhD
MAASTRO clinic, http://www.maastro.nl/en/1/
http://eurecaproject.eu/
https://plus.google.com/u/0/114642613065018821852/posts
http://www.linkedin.com/pub/m-scott-marshall/5/464/a22

Peter Cock

unread,
Mar 15, 2013, 12:02:45 PM3/15/13
to biohac...@googlegroups.com
On Fri, Mar 15, 2013 at 3:43 PM, M. Scott Marshall
<mscottm...@gmail.com> wrote:
> [Love this thread!]
>

Taking SPARQL beyond triple stores :)

> In the HCLS Pharmacogenomics task force a little over a year ago, we
> spent some time looking at how to create a SPARQL endpoint to dbSNP.
> Helena Deus (DERI) did some nice work here. We had some tips from
> colleagues at Mayo and Michel hosted the mySQL:
> https://docs.google.com/document/d/1UJK9GPqnd3z7fl38s19UPyP3wblMoTlcmlw261L1ETc/edit
>
> I really like the way Jerven wrote this:
>> I think the point that I am trying to make is that just because the data is a BED file etc..., it doesn't mean you can't
>> run a SPARQL query against it. In the end you just need to implement the TripleSource api in Sesame or Jena to
>> become SPARQL capable. And at that point you won't be much slower than a basic script. i.e. linear IO bound.
>> But you just standardized your API between BED, GTF, ArrayExpress and UniProt etc...
>
> I think that reducing the number and varieties of APIs cannot be
> emphasized enough!
> Programmable Web is fantastic but it is made up of thousands of APIs.
> A SPARQL layer provides a single API unifying and more flexible
> because fewer changes require refactoring and recompilation.

I'm starting to see how a SPARQL API on top of niche file formats
can make sense - especially for SNPs (where there are numerous
reference sets, e.g. dbSNP for humans) and genes (where there
is an official annotation at the NCBI/ENSEMBL/etc). This needn't
be restricted to model organisms either - provided a lab or community
can setup their own (local) URIs for their own annotation.

I guess for local gene annotations in GFF/GTF/GenBank/EMBL,
we could use the FALDO work (from BioHackathon 2012) to make
this available as RDF for exposing via SPARQL?

The big appeal is you can then broaden a query beyond your local
SNP or differentially expressed gene list (from RNA-Seq or microarray
data) for example, to pull in things like literature or published orthology
links to a sister organism - the sort of data integration which is currently
quite labour intensive - mixing assorted REST APIs and scripts, or 3rd
party data warehousing.

[I'm not convinced something for SAM/BAM style mapping or
assembly files operating at the read level makes sense yet (although
again we have things like the SRA and ENA allocating identifiers
to each read) due to the data files being at least an order of
magnitude larger on disk - but that's more an engineering issue?]

Regards,

Peter

Jerven Bolleman

unread,
Mar 15, 2013, 12:18:21 PM3/15/13
to biohac...@googlegroups.com

> [I'm not convinced something for SAM/BAM style mapping or
> assembly files operating at the read level makes sense yet (although
> again we have things like the SRA and ENA allocating identifiers
> to each read) due to the data files being at least an order of
> magnitude larger on disk - but that's more an engineering issue?]
I think that if we can show that we can efficiently work with SPARQL queries on BAM/SAM/CRAM scale then
we can show that semweb approaches deal very nicely with the biggest of bigdata. i.e. climbing the tallest mountain ;)
So more demonstration than a major practical use case. But once its coded I am sure practical effects will come as well...

Regards,
Jerven

Peter Cock

unread,
Mar 15, 2013, 12:27:22 PM3/15/13
to biohac...@googlegroups.com
On Fri, Mar 15, 2013 at 4:18 PM, Jerven Bolleman
<jerven....@isb-sib.ch> wrote:
>
>> [I'm not convinced something for SAM/BAM style mapping or
>> assembly files operating at the read level makes sense yet (although
>> again we have things like the SRA and ENA allocating identifiers
>> to each read) due to the data files being at least an order of
>> magnitude larger on disk - but that's more an engineering issue?]
>
> I think that if we can show that we can efficiently work with SPARQL queries on BAM/SAM/CRAM scale then
> we can show that semweb approaches deal very nicely with the biggest of bigdata. i.e. climbing the tallest mountain ;)
> So more demonstration than a major practical use case. But once its coded I am sure practical effects will come as well...
>

That is a good reason to try :)

I think there are a number of derived values you can get from
SAM/BAM files which are more interesting to use in queries than
the 'raw data' itself. Things like coverage, support for alternative
bases (i.e. SNPs), insertions, deletions, RNA-Seq coverage by
gene or region - so exposing that sort of information via SPARQL
from a BAM file on disk seems to me more exciting than exposing
the details of individual reads and how they mapped.

My 2 cents/yen.

Peter

Paul Gordon

unread,
Mar 15, 2013, 1:47:45 PM3/15/13
to biohac...@googlegroups.com
I was thinking the same thing, but didn't wasn't to rain on Jerven's
parade :-)
I almost always (except maybe counts for use in determining expression
FPKM values) have to massage the raw reads through samtools mpileup, phase
or at the very least calmd to get something meaningful . For these cases,
we should adopt a set of (non-samtools specific) predicates...

Francesco Strozzi

unread,
Mar 15, 2013, 12:55:22 PM3/15/13
to biohac...@googlegroups.com
f I remember correctly, Rob Buels last year worked on a SPARQL query system for the JBrowse, to display reads and other stuff on the fly in the genome browser, so some work in that direction has already been done.

I agree with you and I think the benefits will be higher on data interpolation rather than (simple) data retrieval or display of raw data. For that, existing command line tools are doing quite a good job (for BAMs or VCFs) but they are very limited when you want to get more information or to work on multiple datasets, e.g. to extract all the SNPs in common between 200 VCFs coming from 200 individuals and query them against dbSNP to see how many are known or novel.

That's seems more interesting to address :-)

Francesco

----
Francesco Strozzi
Head of Bioinformatics Core Facility


Parco Tecnologico Padano
Via Einstein
Loc. Cascina Codazza
26900 Lodi, ITALY
Phone: +39 03714662333
Skype: francescostrozzi

Peter Cock

unread,
Mar 15, 2013, 1:04:22 PM3/15/13
to biohac...@googlegroups.com, Robert Buels
On Fri, Mar 15, 2013 at 4:55 PM, Francesco Strozzi
<francesc...@tecnoparco.org> wrote:
> f I remember correctly, Rob Buels last year worked on a SPARQL query system
> for the JBrowse, to display reads and other stuff on the fly in the genome
> browser, so some work in that direction has already been done.

As I recall the demo was showing annotation using FALDO RDF pulled
from a triple store (derived from a GFF file or similar as a one-off
conversion). See also:
https://github.com/dbcls/bh12/wiki/JBrowse-on-SPARQL
https://github.com/dbcls/bh12/wiki/Feature-annotation-locations-in-RDF

I don't recall that it also showed mapped reads as well - Rob?

Peter

Yasunori YAMAMOTO

unread,
Mar 17, 2013, 11:09:35 PM3/17/13
to biohac...@googlegroups.com, dbca...@googlegroups.com
Hello,

Concerning this topic, I also consider discoverability of related data among an RDF dataset, its SPARQL endpoint, and its statistics.
There's a helpful blog post written by Mr. Leigh Dodds.
http://blog.ldodds.com/2013/02/04/dataset-and-api-discovery-in-linked-data/

For example, a recommended path of a void file is http://www.example.com/.well-known/void , but I don't know how many data providers follow it.

Are there any proposals for data discoverability?

Regards,
Yasunori

On H.25/03/13, at 10:28, Toshiaki Katayama wrote:

> Hi Jerven,
>
>> From the practical goals that Toshiaki put in the agenda I think we should focus only on:
>> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
>> ** Which statistics are needed for which tools.
>
> Agreed.
>
> For this purpose, I think it would be natural to enhance the YummyData project which has started during the BH12 meeting.
>
> http://yummydata.org/
> https://github.com/dbcls/bh12/wiki/Yummy-data
>
>> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
>> Where point 2 should discuss only endpoint statistics, basic meta data for now.
>
> When we reached to a consensus on those common (subset of) vocabularies, probably based on the bio2rdf codes and Scott's summary,
>
> https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics
> https://docs.google.com/spreadsheet/ccc?key=0AvCayBYdTclldEpBSS1wRXNEaU9OeHdWcGRwc09mSmc#gid=0
>
> we can put them on yummydata.org so that YummyData can be used to provide basic statistics with broader coverage and Bio2RDF can continue to provide advanced metrics on their collection (some of which YummyData might incorporate for all supported datasets in the future). Another solution would be to push each dataset provider to provide those metrics by themselves, but it might require some time.
>
> In any cases,
>
>> I am most interested in the how to serve statistics about classes properties and links per SPARQL endpoint. As well as about licensing and update frequencies etc...
>
>
>> If we can change these to SPARQL 1.1 compliant construct/insert queries then other endpoint providers can do the same as bio2rdf, with minimal effort.
>
> these two points are important to be kept in mind.
>
>
>> And leave the following out for the moment
>>
>> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
>
> OK. Let's treat it as a long-term goal.
>
> I personally believe identifiers.org is an ideal solution for this purpose because it is open and generic, already providing clean URIs, and can be redirected to multiple data sources. Remaining problem would be, we still need to maintain a database of links among instance URIs.
>
> Cheers,
> Toshiaki
>
> On 2013/03/12, at 19:11, Jerven Bolleman wrote:
>
>> Hi All, Mark, Michel,
>>
>> I am most interested in the how to serve statistics about classes properties and links per SPARQL endpoint. As well as about licensing and update frequencies etc...
>>
>> Bio2rdf has quite a lot of work already done. For example the sparql code inside https://github.com/bio2rdf/bio2rdf-scripts/blob/master/statistics/bio2rdf_stats_virtuoso.php.
>>
>> If we can change these to SPARQL 1.1 compliant construct/insert queries then other endpoint providers can do the same as bio2rdf, with minimal effort.
>>
>> We should also look at the overlap between VoID and the bio2rdf dataset description.
>>
>> From the practical goals that Toshiaki put in the agenda I think we should focus only on:
>> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
>> ** Which statistics are needed for which tools.
>> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
>> Where point 2 should discuss only endpoint statistics, basic meta data for now.
>>
>> And leave the following out for the moment
>>
>> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
>>
>> I hope this gives a targeted agenda and allows a basic specification to be finished soon, for further iterations expansions later as needed.
>>
>> For interest I am interested in providing the following meta data on (beta.)sparql.uniprot.org. And aim to have both accessible via sparql-service description as well as in the endpoint.
>>
>> licensing
>> ontology (re)-use
>> statistics on class use
>> objects
>> properties
>> subjects
>> up date frequency
>> last updated date
>> contributors (institutions)
>> which sparql version is supported
>> query timeouts and other constraints
>>
>>
>> Regards,
>> Jerven
>>
>>
>> On 03/12/2013 10:44 AM, Toshiaki Katayama wrote:
>>> Hi Michel and all,
>>>
>>>>> How about a week from today on a Monday at 3PM CET?
>>>
>>> OK
>>>
>>> I'm still not sure how to join the teleconference.
>>> Will you use Skype?
>>>
>>>> I don't mind that time for one-off calls but I won't be able to make
>>>> it to them regularly.
>>>
>>> +1
>>>
>>>>> (Straw Man) Agenda:
>>>>> * Past: Review of how we exited Biohackathon11 - Scott
>>>>> * Current: Discussion of progress in the meantime (during a few
>>>>> teleconference calls and mail threads, Bio2RDF, other?) - Scott,
>>>>> Michel
>>>>> * Future: Identify work to be done - All
>>>
>>> Sorry for my ignorance but to my understanding from your documents,
>>> this project aims to formalize DB metadata (as a consensus of existing
>>> efforts), right?
>>>
>>> (So that, users/agents can easily grasp information about the name,
>>> category, amount, license, contacts etc. of the dataset;
>>> and hopefully obtain those information via some SPARQL queries?)
>>>
>>> And, practical goals are to define:
>>>
>>> * common vocabularies to be used among the projects (Bio2RDF, BioDBCore, Biositemap, ViID, SERV, SIO, MEDALS and NBDC/DBCLS + some others?)
>>> * common URIs (namespaces) to be linked together (Identifiers.org, Bio2RDF.org, purl.*.org etc.)
>>> * where/how to serve SPARQL endpoints (hopefully with statistics on # of classes/properties/triples/links etc.)
>>>
>>> As I need to understand the situation before attending the teleconference,
>>> please correct me if I'm misunderstanding.
>>>
>>> Cheers,
>>> Toshiaki
>>>
>>> On 2013/03/12, at 4:46, Peter Ansell wrote:
>>>
>>>> On 12 March 2013 03:26, M. Scott Marshall <mscottm...@gmail.com> wrote:
>>>>> Hi Toshiaki, All,
>>>>>
>>>>> I decided not to try to have the meeting today in the end because it
>>>>> would be too short of a notice for some people.
>>>>>
>>>>> How about a week from today on a Monday at 3PM CET?
>>>>>
>>>>> Because the U.S. has shifted to Daylight Savings Time already and
>>>>> Europe will only do the same on March 31, it will be somewhat less
>>>>> painful for those in the PST timezone until then (although admittedly
>>>>> still very bright and early at 7AM PST). Any others from the Japanese
>>>>> timezone (Chisato?) or Australia (Peter?) who could make it? And, so
>>>>> we aren't caught by surprise, when is Daylight Savings Time for Japan
>>>>> and Australia?
>>>>
>>>> I am in Queensland, Australia, one of the states that does not have
>>>> Daylight Savings. 3PM CET seems to be Midnight in Brisbane and 11pm in
>>>> Tokyo according to timeanddate.com:
>>>>
>>>> http://www.timeanddate.com/worldclock/meetingtime.html?iso=20130311&p1=47&p2=248&p3=195
>>>>
>>>> I don't mind that time for one-off calls but I won't be able to make
>>>> it to them regularly. There are other times of my working day that
>>>> match Europe and USA individually, but not together (making it
>>>> virtually impossible to do anything with W3C!)
>>>>
>>>>> A tentative agenda would start with a roundup of the issues that we
>>>>> put on the table back in 2011, and identify issues that have been
>>>>> (partially) solved by various partners, and remaining issues that we
>>>>> think are important to consider in the near term.
>>>>>
>>>>> (Straw Man) Agenda:
>>>>> * Past: Review of how we exited Biohackathon11 - Scott
>>>>> * Current: Discussion of progress in the meantime (during a few
>>>>> teleconference calls and mail threads, Bio2RDF, other?) - Scott,
>>>>> Michel
>>>>> * Future: Identify work to be done - All
>>>>>
>>>>> Here are a few documents from our earlier efforts:
>>>>> https://docs.google.com/document/d/1qVSZ1n334fTTchCWS2tOaEL6z-H12pcqty98jQIg6nw/edit?hl=en_US#heading=h.x5e0yirsfi9v
>>>>> https://docs.google.com/spreadsheet/ccc?key=0AvCayBYdTclldEpBSS1wRXNEaU9OeHdWcGRwc09mSmc&usp=sharing
>>>>> https://docs.google.com/document/d/1BZyylrV-NXpCpF23CUj-HSvq4YogXjtCgqN7UtFK5EY/edit
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Scott
>>>>>
>>>>> P.S. Toshiaki - No problem that you didn't know that we had talked
>>>>> about including information about predicates used in an RDF dataset in
>>>>> a Biohackathon11 'working group'. I wouldn't expect you to track and
>>>>> remember every detail. I was actually happy to see you bring up a need
>>>>> for one of the things we had discussed - confirming our suspicions
>>>>> that it was a good idea.
>>>>>
>>>>> On Mon, Mar 11, 2013 at 8:27 AM, Toshiaki Katayama <kt...@dbcls.jp> wrote:
>>>>>> Hi Scott,
>>>>>>
>>>>>> Sorry for my late reply.
>>>>>>
>>>>>>> Several of the ideas that you have mentioned in this thread have been
>>>>>>> a topic of discussion in the Biohackathon 2011 group
>>>>>>
>>>>>> Oops, two years of behind. Sorry. ;)
>>>>>> I hope to follow the discussions.
>>>>>>
>>>>>>> If we have a Linked Life Data
>>>>>>> teleconference at 3PM CET on a Monday, could you attend?
>>>>>>
>>>>>> Do you mean it will be today?
>>>>>>
>>>>>>> I would like to form a plan to deliver a
>>>>>>> version of this work, as well as the biohackathon 2011 publication.
>>>>>>
>>>>>> This sounds great.
>>>>>> Do you have any agenda?
>>>>>> What media do you use for the teleconference?
>>>>>> Anyway, I'll try to be online tonight.
>>>>>>
>>>>>> Regards,
>>>>>> Toshiaki
>>>>>>
>>>>>> On 2013/03/07, at 19:20, M. Scott Marshall wrote:
>>>>>>
>>>>>>> Hi Toshiaki,
>>>>>>>
>>>>>>> Several of the ideas that you have mentioned in this thread have been
>>>>>>> a topic of discussion in the Biohackathon 2011 group that we
>>>>>>> eventually called DBCatalog: RDF metadata vocabulary to describe the
>>>>>>> data that is available in an RDF-rendered dataset. We haven't met
>>>>>>> since last year but it seems like the time is right to pick this up
>>>>>>> again. It sounds like you have very practical applications for it as
>>>>>>> well.
>>>>>>>
>>>>>>>> So another possibility might be to extend the VoID specification to indicate a list of predicates and classes in addition to the number of them (void:predicates and void:classes).
>>>>>>>> In this case, each endpoint provider can generate the list by running a SPARQL query only once when importing LOD.
>>>>>>>
>>>>>>> I agree that such a tool would be good and an important way to ease
>>>>>>> adoption of the metadata markup that we will refine in the dbcatalog
>>>>>>> group. In BH11, we discussed how such a tool could gather many
>>>>>>> important graph statistics, including those about predicates and
>>>>>>> classes, and make those available through SPARQL. Last year, Janos
>>>>>>> Hajagos presented a tool in the Linked Life Data task force (LLD) that
>>>>>>> gathered graph statistics: https://code.google.com/p/py-triple-simple/
>>>>>>> .
>>>>>>>
>>>>>>>> If we can make an agreement with the major biological LOD providers on providing (a standardized version of) this dataset_vocabulary (DaVo?) in addition to VoID, it would be great as the biological datasets are often very huge.
>>>>>>>
>>>>>>> That is one of the potential outcomes that we were aiming for in the
>>>>>>> dbcatalog. Could we discuss this in the Linked Life Data task force?
>>>>>>> Michel and I have been interested in pursuing these ideas and refining
>>>>>>> them within the context of Linked Life Data. I think that it should
>>>>>>> form the main line of work for that group for at least the next
>>>>>>> several months (1/2 year). Chisato suggested that a time that is
>>>>>>> possible for most timezones is 3PM CET. If we have a Linked Life Data
>>>>>>> teleconference at 3PM CET on a Monday, could you attend? Or would
>>>>>>> another day be better? I would like to form a plan to deliver a
>>>>>>> version of this work, as well as the biohackathon 2011 publication.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Scott
>>>>>>>
>>>>>>> P.S. As you know, I am working at a radiotherapy oncology clinic these
>>>>>>> days. We are just now turning our attention to federation issues
>>>>>>> (example: query federation across image database (called a PACS) and
>>>>>>> electronic health record (EHR).
>>>>>>>
>>>>>>> On Fri, Feb 15, 2013 at 7:24 PM, Toshiaki Katayama <kt...@dbcls.jp> wrote:
>>>>>>>> Hi Michel,
>>>>>>>>
>>>>>>>> Thank you for your inputs! These Bio2RDF metrics seem to be very useful.
>>>>>>>> If we can make an agreement with the major biological LOD providers on providing (a standardized version of) this dataset_vocabulary (DaVo?) in addition to VoID, it would be great as the biological datasets are often very huge.
>>>>>>>>
>>>>>>>>
>>>>>>>> By the way, for those who might be interested:
>>>>>>>>
>>>>>>>>>> The UniProt license legally thinks what you are doing is absolutely fine. I think its great in any case.
>>>>>>>>
>>>>>>>> According to this Jerven's statement, I temporally put UniProt files split into taxon IDs at
>>>>>>>>
>>>>>>>> http://lod.dbcls.jp/rdf/uniprot_taxon.ttl/
>>>>>>>>
>>>>>>>> I hope this would be useful to setup species specific application for you.
>>>>>>>>
>>>>>>>> Please use with RDF and OWL files (other than huge uni*.rdf.gz files) provided at
>>>>>>>>
>>>>>>>> ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/
>>>>>>>>
>>>>>>>> Note that, you can generate a list of descendant taxon IDs from a taxonomic node of your choice (tax:42241 in this example) by simply querying with
>>>>>>>>
>>>>>>>> PREFIX tax: <http://purl.uniprot.org/taxonomy/>
>>>>>>>> SELECT ?taxon
>>>>>>>> WHERE {
>>>>>>>> ?taxon rdfs:subClassOf* tax:42241 .
>>>>>>>> }
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Toshiaki
>>>>>>>>
>>>>>>>> On 2013/02/16, at 1:50, Michel Dumontier wrote:
>>>>>>>>
>>>>>>>>> Toshiaki,
>>>>>>>>> Bio2RDF now pre-computes graph properties - you can see our documentation here:
>>>>>>>>>
>>>>>>>>> https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics
>>>>>>>>>
>>>>>>>>> and example pages:
>>>>>>>>>
>>>>>>>>> http://download.bio2rdf.org/release/2/biomodels/biomodels.html
>>>>>>>>> http://download.bio2rdf.org/release/2/gene/gene.html
>>>>>>>>>
>>>>>>>>> while we used our own vocabulary (for ease), we'd be happy to investigate something more standard. VoID is one option, and the SPARQLed
>>>>>>>>>
>>>>>>>>> http://sindicetech.com/sindice-suite/sparqled/
>>>>>>>>>
>>>>>>>>> too uses the Dataset Analytics Vocabulary ontology
>>>>>>>>>
>>>>>>>>> http://vocab.sindice.net/analytics#
>>>>>>>>>
>>>>>>>>> but I don't find it particularly intuitive.
>>>>>>>>>
>>>>>>>>> m.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: biohac...@googlegroups.com [mailto:biohac...@googlegroups.com] On Behalf Of Toshiaki Katayama
>>>>>>>>> Sent: Friday, February 15, 2013 10:17 AM
>>>>>>>>> To: biohac...@googlegroups.com
>>>>>>>>> Subject: Re: [biohackathon:768] UniProt sparql endpoint at dbcls??
>>>>>>>>>
>>>>>>>>> Hi Jerven,
>>>>>>>>>
>>>>>>>>> Thank you for your clarification.
>>>>>>>>>
>>>>>>>>> As for the flint, we are also experiencing the same problem for many endpoints.
>>>>>>>>> In the flint interface, the functionality to generate a list of predicates and classes should be useful to explore the stored data, but it is often timed-out actually.
>>>>>>>>>
>>>>>>>>> The situation might be resolved if each endpoint provides meta information as described in a VoID specification.
>>>>>>>>> Still need to hack the flint code, we'll be able to make a client application to use the void:vocabulary for fetching properties and classes from relevant ontologies.
>>>>>>>>> This approach doesn't require a heavy SPARQL load.
>>>>>>>>>
>>>>>>>>> However, in wild LOD, many predicates and classes are used without defined in an ontology.
>>>>>>>>> So another possibility might be to extend the VoID specification to indicate a list of predicates and classes in addition to the number of them (void:predicates and void:classes).
>>>>>>>>> In this case, each endpoint provider can generate the list by running a SPARQL query only once when importing LOD.
>>>>>>>>> Just a rambling thought..
>>>>>>>>>
>>>>>>>>> Anyway, I can temporally remove the UniProt endpoint from our flint configuration, if you prefer.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Toshiaki
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2013/02/15, at 23:13, Jerven Bolleman wrote:
>>>>>>>>>
>>>>>>>>>> Hi Toshiaki,
>>>>>>>>>>
>>>>>>>>>> The UniProt license legally thinks what you are doing is absolutely fine. I think its great in any case.
>>>>>>>>>>
>>>>>>>>>> I was chasing down a killer query (i.e. one that takes up to much memory).
>>>>>>>>>> Unfortunately its the flint interface that generates these :( Sesame,
>>>>>>>>>> always first does the order by and then the distinct (as per sparql
>>>>>>>>>> algebra) but for UniProt that generates a list of about 6 billion elements sorted which means the server runs out memory and crashes.
>>>>>>>>>> Will ask the sesame/owlim-list if anything is possible there.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Jerven
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 15, 2013, at 2:50 PM, Toshiaki Katayama wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Jerven,
>>>>>>>>>>>
>>>>>>>>>>> I worried about if you meant that we need to make a license agreement to provide a subset of the UniProt database, so thank you for your kind admission.
>>>>>>>>>>>
>>>>>>>>>>> By the way, your skill to dig our server is amazing. :) Yes, we also
>>>>>>>>>>> put flint in front of some public and our internal SPARQL endpoints for testing.
>>>>>>>>>>> As I included the official UniProt endpoint, you may have noticed our access from the log.
>>>>>>>>>>> This flint service is also not intended to be widely used but it might be useful in some cases (e.g. during the hackathon).
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Toshiaki
>>>>>>>>>>>
>>>>>>>>>>> On 2013/02/15, at 22:38, Jerven Bolleman wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Toshiaki,
>>>>>>>>>>>>
>>>>>>>>>>>> Ah great, that's a perfect use case of using our RDF data.
>>>>>>>>>>>> Please continue using it. I just noticed it from lod.dbcls.jp/flint/
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Jerven
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 15, 2013, at 2:35 PM, Toshiaki Katayama wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Jerven,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is mine.
>>>>>>>>>>>>> Basically a cyanobacterial subset of the UniProt to develop our own application.
>>>>>>>>>>>>> The official UniProt endpoint is great but we needed to perform
>>>>>>>>>>>>> many queries for try & error, and we wanted to have better performance by using a small subset.
>>>>>>>>>>>>> For now, the endpoint does not have enough capacity to be widely
>>>>>>>>>>>>> used. So please be gentle. :) The server is running on OWLIM Lite 5.3.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Toshiaki
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2013/02/15, at 22:15, Jerven Bolleman wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just realized that this sparql endpoint exists at DBCLS.
>>>>>>>>>>>>>> Does anyone know who is maintaining this endpoint? Wondering what tech they are using. And seeing if we could formalize a mirroring agreement.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://lod.dbcls.jp/openrdf-sesame5l/repositories/cyano
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Jerven
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>>>> Jerven Bolleman Jerven....@isb-sib.ch
>>>>>>>>>>>>>> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
>>>>>>>>>>>>>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
>>>>>>>>>>>>>> 1211 Geneve 4,
>>>>>>>>>>>>>> Switzerland www.isb-sib.ch - www.uniprot.org
>>>>>>>>>>>>>> Follow us at https://twitter.com/#!/uniprot
>>>>>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google Groups "dbcatalog" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to dbcatalog+...@googlegroups.com.
>>>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups "BioHackathon" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to biohackathon...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>>
>>>
>>
>>
>> --
>> -------------------------------------------------------------------
>> Jerven Bolleman Jerven....@isb-sib.ch
>> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
>> 1211 Geneve 4,
>> Switzerland www.isb-sib.ch - www.uniprot.org
>> Follow us at https://twitter.com/#!/uniprot
>> -------------------------------------------------------------------
>>

Jerven Bolleman

unread,
Mar 18, 2013, 4:23:10 PM3/18/13
to biohac...@googlegroups.com
Hi All,

I slightly improved the code here. Using the Trible/picard code for reading BED files which means Tabix support.
I started work on avoiding joins if the query pattern is about elements on a single bed line. All in all it shows you can implement
a SPARQL endpoint on flat file/TSV based files without to much work.

The code uses the FALDO position ontology from last years bio hackathon.

Things to do is to use the tabix indexes when position data is given (when appropriate).
Allow embedding this code into an open-rdf/sesame workbench.

Otherwise the basic idea still works.

As we will be dealing with quite large bed files there should be some memory optimizations as well include fall back to disk options
for some of the collections that can be generated during SPARQL query evaluation. But

For some users latency/through put can be reduced by introducing some threading work. For users of BED files please add issues
here https://github.com/JervenBolleman/sparql-bed/issues


Regards,
Jerven

Chisato Yamasaki

unread,
Apr 5, 2013, 6:54:09 AM4/5/13
to dbca...@googlegroups.com, biohac...@googlegroups.com, Philippe, Susanna Sansone, chi...@nibio.go.jp
Dear Jerven, Susana, Phillipe and all,
I just talked with Scott by Skype, and he said that
the next meeting related to bio-database will be held on
Monday Apr 8 at 5PM CET, during the biocuration meeting.

How about having lunch together on Monday Apr 8, to share
the related information together? Are you available, Jerven?
We can also join the tel-meeitng at 5PM.

I am looking forward to meet you at Biocuration2013.

Best regards,
Chisato Yamasaki

*** I had moved to new institution since 01/04/2013 ***

Email : chi...@nibio.go.jp
National Institute of Biomedical Innovation (NIBIO)
Bioinformatics Project : http://mizuguchilab.org/
---

> -----Original Message-----
> From: dbca...@googlegroups.com [mailto:dbca...@googlegroups.com]
> On Behalf Of Jerven Bolleman
> Sent: Tuesday, March 12, 2013 7:15 PM
> To: dbca...@googlegroups.com; biohac...@googlegroups.com
> Subject: Re: [biohackathon:782] Biocuration2013 database statics meetup
>
> Hi All,Chisato,
>
> I will also be at biocuration2013.
>
> Regards,
> Jerven
>
> On 03/12/2013 11:02 AM, Chisato Yamasaki wrote:
> > Hi Scott, Michel and all,
> > Thank you for arranging the call.
> > I can also make call on March 18 at 3:00pm CET, 7AM PST, 11PM JST.
> >
> > Also, I am going to join Biocuration2013 in UK and staying around in
> > Cambridge/London on 7-12 April.
> > http://www.ebi.ac.uk/biocuration2013/content/home
> >
> > I will be happy to see if anyone also joining Biocuraion2013 or can
> > arrange another meeting ;-)
> >
> > Thank you,
> > Chisato Yamasaki
> > ----- Original Message ----- From: "Toshiaki Katayama"
> > <toshiaki...@gmail.com>
> > To: <biohac...@googlegroups.com>
> > Cc: "M. Scott Marshall" <mscottm...@gmail.com>;
> > <dbca...@googlegroups.com>; "Chisato Yamasaki"
> > <chisato-...@aist.go.jp>; "Michel Dumontier"
> > <michel.d...@gmail.com>
> > Sent: Tuesday, March 12, 2013 6:44 PM
> > Subject: Re: [biohackathon:782] UniProt sparql endpoint at dbcls??
> >>>>>>>>>>>> Jerven Bolleman Jerven....@isb-sib.ch SIB Swiss
> >>>>>>>>>>>> Institute of Bioinformatics Tel: +41 (0)22 379 58 85
> >>>>>>>>>>>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379
> 58 58
> >>>>>>>>>>>> 1211 Geneve 4,
> >>>>>>>>>>>> Switzerland www.isb-sib.ch - www.uniprot.org
> >>>>>>>>>>>> Follow us at https://twitter.com/#!/uniprot
> >>>>>>>>>>>>
> -----------------------------------------------------------
> >>>>>>>>>>>> -------
> >>>>>>>>>>>>
> >>>>>>>>>>>> -
> >>>>>>>>>>>>
> >>>> --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "dbcatalog" group.
> >>>> To unsubscribe from this group and stop receiving emails from it,
> >>>> send an email to dbcatalog+...@googlegroups.com.
> >>>> For more options, visit https://groups.google.com/groups/opt_out.
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "BioHackathon" group.
> >> To unsubscribe from this group and stop receiving emails from it,
> >> send an email to biohackathon...@googlegroups.com.
> >> For more options, visit https://groups.google.com/groups/opt_out.
> >>
> >>
> >
>
>
> --
> -------------------------------------------------------------------
> Jerven Bolleman Jerven....@isb-sib.ch
> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
>
> --
> You received this message because you are subscribed to the Google Groups
> "dbcatalog" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dbcatalog+...@googlegroups.com.

Daniel Jamieson

unread,
Apr 5, 2013, 7:08:22 AM4/5/13
to <biohackathon@googlegroups.com>
Hi all,

I appear to have picked up this thread rather late, so I'm missing the focus of this meeting, but I will be at Biocuration and would love to join and meet any of you at lunch.

Thanks,

Dan

Pascale Gaudet

unread,
Apr 5, 2013, 7:12:11 AM4/5/13
to biohac...@googlegroups.com
I'm also interested to attend.

Best,

Pascale

Philippe

unread,
Apr 5, 2013, 7:31:13 AM4/5/13
to dbca...@googlegroups.com, Chisato Yamasaki, biohac...@googlegroups.com, Susanna Sansone, chi...@nibio.go.jp
Dear Chisato,

Thank you for the heads up, Lunch time sounds good.
However, I wont be able to attend the telco (on another call at that
time, if wifi access permits).

Looking forward to meeting you all in Cambridge.

Best

Philippe

Susanna-Assunta Sansone

unread,
Apr 5, 2013, 8:14:40 AM4/5/13
to biohac...@googlegroups.com
Chisato, sure lunch is fine!
Susanna

--
Susanna-Assunta Sansone, PhD
skype: susanna-a.sansone
uk.linkedin.com/in/sasansone

University of Oxford e-Research Centre
Associate Director, Principal Investigator
www.isacommons.org|www.biosharing.org

Nature Publishing Group
Consultant, Scientific Data
www.nature.com/scientificdata
--

Kevin B. Cohen

unread,
Apr 5, 2013, 11:36:42 AM4/5/13
to biohac...@googlegroups.com, dbca...@googlegroups.com, Philippe, Susanna Sansone, chi...@nibio.go.jp
I will also be at the Biocuration meeting.

Kev
Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Computational Bioscience Program,
U. Colorado School of Medicine
303-916-2417 (cell) 303-377-9194 (home)
http://compbio.ucdenver.edu/Hunter_lab/Cohen

Susanna-Assunta Sansone

unread,
Apr 5, 2013, 11:49:45 AM4/5/13
to biohac...@googlegroups.com, dbca...@googlegroups.com, Kevin B. Cohen, Philippe, chi...@nibio.go.jp
Hi all,
It seem we will have have a mini meeting at ISB!
Btw, I take this opportunity for some shameless self promotion...but in context.  BioSharing* (and so bioDBcore) are now supporting the Nature Publishing Group's data publication platform, Scientific Data** announced yesterday .
One of my task with NPG is to work with identify established, community-recognized data repositories and community-standards etc;  hence having a way to catalogue repositories, data, standards etc is key for NPG too.
See you at ISB,
Susanna

* http://biosharing.org/ and http://biosharing.org/biodbcore
** http://www.nature.com/scientificdata/



Susanna-Assunta Sansone, PhD
skype: susanna-a.sansone
uk.linkedin.com/in/sasansone

University of Oxford e-Research Centre
Associate Director, Principal Investigator
www.isacommons.org|www.biosharing.org

Nature Publishing Group
Consultant, Scientific Data
www.nature.com/scientificdata
--

N Juty

unread,
Apr 5, 2013, 1:20:34 PM4/5/13
to biohac...@googlegroups.com, Susanna-Assunta Sansone, dbca...@googlegroups.com, Kevin B. Cohen, Philippe, chi...@nibio.go.jp
I'll be there too, so please save me a spot at the table ;p

see you soon,


cheers

Nick
>>>>>>>>>>>>>>>> Jerven BollemanJer...@isb-sib.ch SIB Swiss
>>>>>>>>>>>>>>>> Institute of Bioinformatics Tel: +41 (0)22 379 58 85
>>>>>>>>>>>>>>>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379
>>>> 58 58
>>>>>>>>>>>>>>>> 1211 Geneve 4,
>>>>>>>>>>>>>>>> Switzerlandwww.isb-sib.ch -www.uniprot.org
>>>>>>>>>>>>>>>> Follow us athttps://twitter.com/#!/uniprot
>>>>>>>>>>> it, send an email tobiohackatho...@googlegroups.com.
>>>>>>>>>>> For more options, visit
>>>> https://groups.google.com/groups/opt_out.
>>>>>>>>>>> --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "BioHackathon" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>> it, send an email tobiohackatho...@googlegroups.com.
>>>>>>>>>>> For more options, visit
>>>> https://groups.google.com/groups/opt_out.
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "BioHackathon" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email tobiohackatho...@googlegroups.com.
>>>>>>>>>> For more options, visit
>>>> https://groups.google.com/groups/opt_out.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> M. Scott Marshall, PhD
>>>>>>>>> MAASTRO clinic,http://www.maastro.nl/en/1/
>>>>>>>>> http://eurecaproject.eu/
>>>>>>>>> https://plus.google.com/u/0/114642613065018821852/posts
>>>>>>>>> http://www.linkedin.com/pub/m-scott-marshall/5/464/a22
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "dbcatalog" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email todbcatalog...@googlegroups.com.
>>>>>>>> For more options, visithttps://groups.google.com/groups/opt_out.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> M. Scott Marshall, PhD
>>>>>>> MAASTRO clinic,http://www.maastro.nl/en/1/
>>>> http://eurecaproject.eu/
>>>>>>> https://plus.google.com/u/0/114642613065018821852/posts
>>>>>>> http://www.linkedin.com/pub/m-scott-marshall/5/464/a22
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "BioHackathon" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email tobiohackatho...@googlegroups.com.
>>>>>> For more options, visithttps://groups.google.com/groups/opt_out.
>>>>>>
>>>>>>
>>>> --
>>>> -------------------------------------------------------------------
>>>> Jerven BollemanJer...@isb-sib.ch
>>>> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
>>>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
>>>> 1211 Geneve 4,
>>>> Switzerlandwww.isb-sib.ch -www.uniprot.org
>>>> Follow us athttps://twitter.com/#!/uniprot
>>>> -------------------------------------------------------------------
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups
>>>> "dbcatalog" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email todbcatalog...@googlegroups.com.
>>>> For more options, visithttps://groups.google.com/groups/opt_out.
>>> --
>>> You received this message because you are subscribed to the Google Groups "BioHackathon" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email tobiohackatho...@googlegroups.com.
>>> For more options, visithttps://groups.google.com/groups/opt_out.
>>>
>>>
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "BioHackathon" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to biohackathon...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


--
--------------------------------------------------------
Nick Juty
Database Curator
European Bioinformatics Institute
Cambridge, United Kingdom
--------------------------------------------------------

Chisato Yamasaki

unread,
Apr 8, 2013, 6:18:14 AM4/8/13
to dbca...@googlegroups.com, biohac...@googlegroups.com, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, chi...@nibio.go.jp
Dear all who concern,
For the lunch today, I am trying to keep a table for us,
so please find me gather there at meeting dining room.

See you later,
Chisato Yamasaki
> an email to dbcatalog+...@googlegroups.com.

cyama...@gmail.com

unread,
Apr 8, 2013, 8:38:45 AM4/8/13
to <dbcatalog@googlegroups.com>, <biohackathon@googlegroups.com>, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Hi Scott and all who concern,
We had a very quick lunch about bio-database catalog related topics.

I am going to send the brief summary later, but before that we have a
request for the next teleco.
Could you please change the date of teleco to next Monday or later, as we all found difficult to join because we will have poster session at same time?

Thank you for your helps in advance,
Chisato Yamasaki

iPhoneから送信

2013/04/08 11:18、"Chisato Yamasaki" <cyama...@gmail.com> のメッセージ:

Toshiaki Katayama

unread,
Apr 8, 2013, 9:04:17 AM4/8/13
to biohac...@googlegroups.com, <dbcatalog@googlegroups.com>, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Hi Chisato,

Great to know that you could enjoy the group lunch.

I'm fine to postpone the teleconference.

By the way, I wish to confirm two relevant issues for the next time:

* The "Fuze" iPad app worked very well for the past two calls but I'm not sure whether I can continue to use the Fuze system for free after the 14days trial. Do I need to pay for the future meetings? If so, is there any other option?

* It seems that the Fuze meeting number of the teleconference changes every time. Is it possible to ask anyone of the conference members to post the number to be used and a brief agenda of the day to this mailing list before we start the meeting?

Cheers,
Toshiaki

Andrea Splendiani

unread,
Apr 8, 2013, 9:11:11 AM4/8/13
to dbca...@googlegroups.com, biohac...@googlegroups.com, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Hi,

I'm assuming this thread crosses the W3C calls we had about the "community-driven specification for dataset and service descriptions".
If so, next Monday would be better for me as well, as today I'm locked in an appointment at the doctor.

Fuze exists also for the Desktop, I think with no restrictions there.

ciao,
Andrea

Toshiaki Katayama

unread,
Apr 8, 2013, 9:28:24 AM4/8/13
to biohac...@googlegroups.com, dbca...@googlegroups.com, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Hi Andrea,

Both of the Desktop and Mobile (incl. iPad) applications seem to be free for download.

I was talking about their pricing options:

https://www.fuzebox.com/products/pricing

If those plans are not applied (or only applied to who make a call) for our meetings, we can continue with the Fuze.

Cheers,
Toshiaki

M. Scott Marshall

unread,
Apr 8, 2013, 9:35:50 AM4/8/13
to dbca...@googlegroups.com, Michel Dumontier, <biohackathon@googlegroups.com>, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Hi Chisato, All,

It is good to hear that many of you were able to meet at lunch. It is
great that there is so much interest in continuing this work!

In case it wasn't clear, we've already met a few times in the HCLS
Linked Life Data task force and reviewed the issues involved with
(bio)dbcataloging. We have been taking inventory of existing
approaches to particular types of dataset description here:
https://docs.google.com/spreadsheet/ccc?key=0Aoy0zfdRviKsdFJWTDFpblNXc3BtelhrdEpNYTdvbXc&usp=sharing

I hope you don't mind if we don't cancel this week's HCLS meeting so
that those who have planned for it for the last two weeks are not
inconvenienced - I think that there is a handful of people who are
expecting a meeting and have planned for it. It has been weekly to
bi-weekly until now. In any case, we can have another one next week
that will presumably have more participants than this week's meeting.
I look forward to talking to you then!

I am still waiting to hear back from Michel about the fuzebox access
code before announcing to HCLS. I like Toshiaki's idea - I was just
wondering myself whether we could fix the code so it didn't depend on
a new action for every meeting.

Also, I believe that Michel has been recording the meetings so you
should be able to get a URL to the video recording from him. Michel -
is that possible?

Kind regards,
Scott
MAASTRO clinic, http://www.maastro.nl/en/1/

Michel Dumontier

unread,
Apr 8, 2013, 9:41:18 AM4/8/13
to M. Scott Marshall, dbca...@googlegroups.com, <biohackathon@googlegroups.com>, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Hi Everybody,
  Let us meet next week when all can make it.

  You shouldn't need a subscription to Fuze to simply join the meeting (I have the subscription to host meetings). When I send out the notices, you need only to click on the link to automatically join the meeting.

Best,

m.

--
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group

M. Scott Marshall

unread,
Apr 8, 2013, 9:47:04 AM4/8/13
to Michel Dumontier, dbca...@googlegroups.com, <biohackathon@googlegroups.com>, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, <chisato@nibio.go.jp>
Ok. I'll send out a cancellation notice to HCLS.

On Mon, Apr 8, 2013 at 3:41 PM, Michel Dumontier
<michel.d...@gmail.com> wrote:
> Hi Everybody,
> Let us meet next week when all can make it.
>
> You shouldn't need a subscription to Fuze to simply join the meeting (I
> have the subscription to host meetings). When I send out the notices, you
> need only to click on the link to automatically join the meeting.

Ok but, as Toshiaki suggested: would it be possible to issue such a
link that works for the same timeslot 1 week later, but with the same
link and access code? A repeated occurrence of fuzebox meeting?

Also, would you please send us a link to the recordings of previous
meetings when you get a chance, assuming that you kept the recording.

Cheers,

Chisato Yamasaki

unread,
Apr 8, 2013, 10:27:02 AM4/8/13
to biohac...@googlegroups.com, Michel Dumontier, dbca...@googlegroups.com, Susanna-Assunta Sansone, Kevin B. Cohen, Philippe, chi...@nibio.go.jp
Dear all,
Thank you very much for the rearrangement of the meeting.
I am looking forward to meet you next Monday ;-)

Chisato Yamasaki
> -----Original Message-----
> From: biohac...@googlegroups.com
> [mailto:biohac...@googlegroups.com] On Behalf Of M. Scott Marshall
> Sent: Monday, April 08, 2013 10:47 PM
> To: Michel Dumontier
> Cc: dbca...@googlegroups.com; <biohac...@googlegroups.com>;
> Susanna-Assunta Sansone; Kevin B. Cohen; Philippe;
> <chi...@nibio.go.jp>
> Subject: Re: [biohackathon:871] data/databases categorization - also
> publishers needs it
>

Chisato Yamasaki

unread,
Apr 15, 2013, 4:54:11 AM4/15/13
to Chisato Yamasaki, biohac...@googlegroups.com, michel.d...@gmail.com, sa.sa...@gmail.com, procc...@gmail.com
Dear all,
It was nice meeting with some of you at Biocuration2013.
Here are some notes on the lunch meeting on 8th April.
======================================================
# date&time: 12:10-12:50 on 8th April, 2013/04/08
# members: Pascale, Susanna, Jerven, Philippe, Daniel,
Yasunori, Takeru, Gaku, Rie, Chisato
# topics:
: from previous teleco, and Biosharing/BioDBCore
* standard data description vocabularies
* database catalog initial preparation after several strategic meetings,
submitting DB information of each database published in NAR DB issue 2011
in excel format
* update frequency : depends on the item (maybe every two weeks for
site updates, 5 years for license policy, etc)
* efficient way to update database catalog
(e.g. providing data files in a certain format by DB providers)
* not all curators can submit paper to NAR every year
* have to provide merit to curators, e.g. including grant information
* have to show the how community support the database cataloging activity
======================================================

Please fill above if my notes missed anything important.


I am sorry for the people I missed on the day,
Kevin, Nick and some others.

Best regards,
Chisato Yamasaki

Email : chi...@nibio.go.jp
National Institute of Biomedical Innovation (NIBIO)
Bioinformatics Project : http://mizuguchilab.org/
---

> -------- Original Message --------
> Subject: RE: [biohackathon:871] data/databases categorization - also
> publishers needs it
> Date: Mon, 8 Apr 2013 23:27:02 +0900
> From: Chisato Yamasaki <cyama...@gmail.com>
> To: <biohac...@googlegroups.com>, "'Michel Dumontier'"
> <michel.d...@gmail.com>
> CC: <dbca...@googlegroups.com>, "'Susanna-Assunta Sansone'"
> <sa.sa...@gmail.com>, "'Kevin B. Cohen'"
> <kevin...@gmail.com>, "'Philippe'" <procc...@gmail.com>,
> <chi...@nibio.go.jp>
Reply all
Reply to author
Forward
0 new messages