Updates of endpoints

Joerg Kurt Wegner

unread,

Aug 15, 2011, 2:19:15 PM8/15/11

to bio2rdf

Dear all,

can someone please provide an overview about update mechanisms for
specific Bio2RDF data sub-sources and the last update date?

Examples:
There was a discussion about pharmgkb, the update script is available,
but how current is Bio2RDF with this source?
http://groups.google.com/group/bio2rdf/browse_thread/thread/a390e411a2c8dd94

In general terms was this question asked on the Bio2RDF SourceForge
site, which was forwarded to the W3C mailing list. Still, the current
feedback is not really addressing any of the two questions.
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2011Aug/0002.html

Is anyone of this list able to help or forward to someonw who might
know answers?
1. Update mechanisms for Source_i to Bio2RDF
2. Last update date for Source_i

Thanks!

Best regards,
Joerg Kurt Wegner
https://plus.google.com/116731043002877336055

Michel Dumontier

unread,

Aug 17, 2011, 6:22:38 AM8/17/11

to bio...@googlegroups.com

Hi Joerg,

Marc-Alexandre is working on a more automated way to pull down and process the source files. We also have a SPARQL endpoint that serves as a registry for our datasets, and this information is stored there. I'll let him answer your question to a greater extent.

Cheers,

m.

--
You received this message because you are subscribed to the Google Groups "bio2rdf" group.
To post to this group, send email to bio...@googlegroups.com.
To unsubscribe from this group, send email to bio2rdf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/bio2rdf?hl=en.

--
Michel Dumontier
Associate Professor of Bioinformatics
Carleton University
http://dumontierlab.com

Marc-Alexandre Nolin

unread,

Aug 24, 2011, 9:31:45 AM8/24/11

to bio...@googlegroups.com

Hello,

Sorry for the delays of the answer, I was in vacation without any
Internet access for the last 2 weeks.

For all endpoints except OMIM and HGNC, the update process is manual.
That means that I have to fetch the new data myself, convert it in RDF
or nQuads and load it in the corresponding Virtuoso server. Once it is
done, I send a copy of the updated virtuoso.db file to the various
Bio2RDF mirror.

For OMIM and HGNC, I use a pipeline script that do all the update
process in 1 shell command. However, I still need to prepare the
virtuoso.db manually to update the mirrors.

I also have an endpoint where I place information about a release
http://release.bio2rdf.org/sparql . You can see the release which have
information with that query:

select distinct ?release where {?release a
<http://bio2rdf.org/release_resource:release>}

Few release are listed there, so you can assume that the one not there
has not been updated in a while.

Bye !!

Marc-Alexandre

2011/8/15 Joerg Kurt Wegner <joergku...@gmail.com>:

Joerg Kurt Wegner

unread,

Aug 25, 2011, 4:15:18 PM8/25/11

to bio2rdf, lo...@ieee.org, Michel Dumontier, Jörg Kurt Wegner

Hi Michel,
Hi Marc-Alexandre,

thanks for providing the details.

Would it be possible to summarize the pipeline in more detail?
Example:
OMIM - automatic - which script, which input/output?
HGNC - automatic - which script, which input/output?
Other sources - manual process - what does this look like? Tutorial or
how-to?

Following your SPARQL query
http://goo.gl/2s6Oc
provide the following sources a release version
http://bio2rdf.org/release:genbank
http://bio2rdf.org/release:pubmed.bak
http://bio2rdf.org/release:refseq
http://bio2rdf.org/release:pubmed
http://bio2rdf.org/release:hgnc
though, none of them provides a version or date, correct?

Just to put it in perspective, at this point I consider Bio2RDF an non-
maintainable and not integrate-able in an organizational context.
Unless we get a clearer process on this I will hold this position and
cannot recommend it as a data source for use.
At this point it looks easier integrating data right from the start
not making any use of Bio2RDF, but I might be too naive on this ?;-)

Finally, any more details are highly appreciated.I can understand the
maintenance challenge and must be clear that this is perceived a road-
block for me.

Cheers,
/.Joerg

http://www.joergkurtwegner.eu

Peter Ansell

unread,

Aug 25, 2011, 6:59:55 PM8/25/11

to bio...@googlegroups.com, lo...@ieee.org, Michel Dumontier, Jörg Kurt Wegner

On 26 August 2011 06:15, Joerg Kurt Wegner <joergku...@gmail.com> wrote:
> Hi Michel,
> Hi Marc-Alexandre,
>
> thanks for providing the details.
>
> Would it be possible to summarize the pipeline in more detail?
> Example:
> OMIM - automatic - which script, which input/output?
> HGNC - automatic - which script, which input/output?
> Other sources - manual process - what does this look like? Tutorial or
> how-to?
>
> Following your SPARQL query
> http://goo.gl/2s6Oc
> provide the following sources a release version
> http://bio2rdf.org/release:genbank
> http://bio2rdf.org/release:pubmed.bak
> http://bio2rdf.org/release:refseq
> http://bio2rdf.org/release:pubmed
> http://bio2rdf.org/release:hgnc
> though, none of them provides a version or date, correct?

They *do* actually all have a version and a date encoded using Dublin
Core, although the version may simply be the date if the dataset
doesn't have a numbered versioning scheme. Do you have an alternative
strategy for encoding this information?

> Just to put it in perspective, at this point I consider Bio2RDF an non-
> maintainable and not integrate-able in an organizational context.
> Unless we get a clearer process on this I will hold this position and
> cannot recommend it as a data source for use.
> At this point it looks easier integrating data right from the start
> not making any use of Bio2RDF, but I might be too naive on this ?;-)

There are a number of challenges related to integrated data that you
would want to think of before embarking on a solo/new effort.

The first one would be that if you want to provide third-party Linked
Data, where the data is published by someone else but converted to RDF
by you, (as opposed to just a SPARQL endpoint, or some other RDF
resource), you need to have a clear strategy for resolving the HTTP
requests to either your infrastructure, or some federated
infrastructure. This is necessary even before you think about how to
easily maintain the resulting datasets and maintaining uptime on the
Linked Data endpoints. In bio2rdf we have had a single host for this
from the very beginning, although we have branched from a single
physical server into multiple physical servers at different locations.
The important thing was that the URI design from the very beginning
did not hamper us in this goal. In addition, we were lucky in the URI
design that we are now able to offer different REST services at
http://bio2rdf.org/* without interfering with the Linked Data
resolution. For example, you can perform searches using
http://bio2rdf.org/searchns/hgnc/abcc4 without interfering with the
hgnc namespace:identifier pattern.

When you get past that challenge, you need to decide what the limit to
the number of providers you are going to support is. For example, do
your clients have the money necessary to maintain an entire mirror of
NCBI in RDF form? Have you talked to the Uniprot RDF team about their
experiences, as they embarked on the same endeavour about the same
time Bio2RDF started.

We would be more than happy for you to develop maintainable RDFisers
for us, as we (Marc and I) have so far been driven by our personal
PhD's, which only require prototypes to back up our assertions. In
doing so, you could avoid both of the two challenges I just posed and
be able to start doing the actual RDFisation straight away.

Peter

Jyoti Pathak

unread,

Sep 20, 2011, 12:25:54 AM9/20/11

to bio2rdf

Hi,

Is this link still working: http://release.bio2rdf.org/sparql

I am unable to access it as of Sept 19th.

Thanks,
- Jyoti

On Aug 24, 8:31 am, Marc-Alexandre Nolin <lo...@ieee.org> wrote:
> Hello,
>
> Sorry for the delays of the answer, I was in vacation without any
> Internet access for the last 2 weeks.
>
> For all endpoints except OMIM and HGNC, the update process is manual.
> That means that I have to fetch the new data myself, convert it in RDF
> or nQuads and load it in the corresponding Virtuoso server. Once it is
> done, I send a copy of the updated virtuoso.db file to the various
> Bio2RDF mirror.
>
> For OMIM and HGNC, I use a pipeline script that do all the update
> process in 1 shell command. However, I still need to prepare the
> virtuoso.db manually to update the mirrors.
>

> I also have an endpoint where I place information about a releasehttp://release.bio2rdf.org/sparql. You can see the release which have

> information with that query:
>
> select distinct ?release where {?release a
> <http://bio2rdf.org/release_resource:release>}
>
> Few release are listed there, so you can assume that the one not there
> has not been updated in a while.
>
> Bye !!
>
> Marc-Alexandre
>

> 2011/8/15 Joerg Kurt Wegner <joergkurtweg...@gmail.com>:

>
>
>
>
>
>
>
> > Dear all,
>
> > can someone please provide an overview about update mechanisms for
> > specific Bio2RDF data sub-sources and the last update date?
>
> > Examples:
> > There was a discussion about pharmgkb, the update script is available,
> > but how current is Bio2RDF with this source?

> >http://groups.google.com/group/bio2rdf/browse_thread/thread/a390e411a...

>
> > In general terms was this question asked on the Bio2RDF SourceForge
> > site, which was forwarded to the W3C mailing list. Still, the current
> > feedback is not really addressing any of the two questions.

> >http://lists.w3.org/Archives/Public/public-semweb-lifesci/2011Aug/000...

Jose Miguel Cruz Toledo

unread,

Sep 20, 2011, 9:14:02 AM9/20/11

to bio...@googlegroups.com

Hi Jyoti,

The endpoint at http://release.bio2rdf.org/sparql is back online. The Carleton server had to be rebooted over the weekend due to maintenance. I will be checking all endpoints to ensure that there are no other surprises.

Cheers,

Jose

Jyotishman Pathak

unread,

Sep 22, 2011, 4:02:03 PM9/22/11

to bio...@googlegroups.com

Sorry Jose -- does not seem to be the case. The endpoint still does seem to be working. Same story with: http://omim.bio2rdf.org/sparql

Can you please assist?

Thanks,

- Jyoti

Jyotishman Pathak

unread,

Sep 27, 2011, 3:29:56 PM9/27/11

to bio...@googlegroups.com, josemiguel...@gmail.com, Rick Kiefer

Hi Jose,

I has another question: where are the schemas for all the Bio2RDF datasets? While the OMIM SPARQL endpoint is working for me now, without the schema I'm having a hard time to model my queries.

Will appreciate any advice from you. Thanks!

- Jyoti

Sent from my iPhone.

Mark

unread,

Sep 27, 2011, 3:35:33 PM9/27/11

to bio...@googlegroups.com, Jyotishman Pathak, Ben Vandervalk, josemiguel...@gmail.com, Rick Kiefer

If it helps, Benjamin Vandervalk from my lab has created a schema document
for (all?) of the BioRDF datasets based on exploration/traversal... He's
c.c.'d on this message.

Mark

Peter Ansell

unread,

Sep 27, 2011, 6:59:39 PM9/27/11

to bio...@googlegroups.com, Jyotishman Pathak, Ben Vandervalk, josemiguel...@gmail.com, Rick Kiefer

I would be very interested in having those schema's in a SPARQL
endpoint somewhere so we can make them dynamically queryable (or we
could make it work using RDF/XML files if that is easier)

Jose Miguel Cruz Toledo

unread,

Sep 28, 2011, 9:18:41 AM9/28/11

to Peter Ansell, bio...@googlegroups.com, Jyotishman Pathak, Ben Vandervalk, Rick Kiefer

In the mean time you can always use the faceted browser that comes with virtuoso.

It allows you to navigate the linked data stored in the endpoint. To access it simply append /fct to the base url of your endpoint of choice. For example, to get to the faceted browser for the omim endpoint visit the following url: http://omim.bio2rdf.org/fct

Jose

Jyotishman Pathak

unread,

Sep 28, 2011, 11:21:26 AM9/28/11

to Jose Miguel Cruz Toledo, Peter Ansell, bio...@googlegroups.com, Ben Vandervalk, Rick Kiefer

Thank you Jose -- very helpful.

I do have a related question about OMIM data in Bio2RDF, although not 100% sure if this the appropriate forum to ask.

1.) As part of OMIM download, one can get access to the Morbid Map which basically provides the gene-phenotype relationships (ftp://grcf.jhmi.edu/OMIM/morbidmap). For example, the text below describes the morbid map entry for Type 2 Diabetes.

Diabetes mellitus, type 2, 125853 (3)|PAX4, MODY9, KPD|167413|7q32.1

Diabetes mellitus, type II, 125853 (3)|AKT2|164731|19q13.2

This same information is also available via OMIM web browser under "phenotype gene relationships" table: http://omim.org/entry/125853

However, when I browse the RDF for OMIM ID 125853 (http://bio2rdf.org/omim:125853), I do cannot find this data. Am I doing something incorrect?

2.) OMIM also has links to SNOMED and ICD codes for the phenotypes (please see attached screenshot: right hand corner). As an example, Type 2 Diabetes (OMIM ID 125853) is mapped to SNOMED CT code 44054006. I am not able to find this information as well.

Could someone please advise?

Thanks, Jyoti

Screen shot 2011-09-28 at 10.19.02 AM.png

Michel Dumontier

unread,

Oct 3, 2011, 4:41:41 PM10/3/11

to bio...@googlegroups.com, Jose Miguel Cruz Toledo, Peter Ansell, Ben Vandervalk, Rick Kiefer

Hi Jyotti,

1. morbidmap is not loaded, only the main omim file.

2.

On Wed, Sep 28, 2011 at 4:21 PM, Jyotishman Pathak <jyotish...@gmail.com> wrote:

Thank you Jose -- very helpful.

I do have a related question about OMIM data in Bio2RDF, although not 100% sure if this the appropriate forum to ask.

1.) As part of OMIM download, one can get access to the Morbid Map which basically provides the gene-phenotype relationships (ftp://grcf.jhmi.edu/OMIM/morbidmap). For example, the text below describes the morbid map entry for Type 2 Diabetes.

Diabetes mellitus, type 2, 125853 (3)|PAX4, MODY9, KPD|167413|7q32.1
Diabetes mellitus, type II, 125853 (3)|AKT2|164731|19q13.2
This same information is also available via OMIM web browser under "phenotype gene relationships" table: http://omim.org/entry/125853

However, when I browse the RDF for OMIM ID 125853 (http://bio2rdf.org/omim:125853), I do cannot find this data. Am I doing something incorrect?

2.) OMIM also has links to SNOMED and ICD codes for the phenotypes (please see attached screenshot: right hand corner). As an example, Type 2 Diabetes (OMIM ID 125853) is mapped to SNOMED CT code 44054006. I am not able to find this information as well.

This data is not in the OMIM distribution, but i see from the image that it may be a mapping from ICD.

m.

--

Jyotishman Pathak

unread,

Oct 3, 2011, 5:02:58 PM10/3/11

to bio...@googlegroups.com, Jose Miguel Cruz Toledo, Peter Ansell, Ben Vandervalk, Rick Kiefer

Thanks for your reply Michel. Are there any plans to load the Morbid Map?

Regards,

- Jyoti

Jyotishman Pathak, PhD
Assistant Professor of Biomedical Informatics
Division of Biomedical Statistics and Informatics
Department of Health Sciences Research
Mayo Clinic College of Medicine
Rochester, MN 55905, USA
Phone: +1-507-538-8384
Fax: +1-507-284-0360
Email: pathak.j...@mayo.edu

Reply all

Reply to author

Forward