drugbank

44 views
Skip to first unread message

Michel Dumontier

unread,
Jul 4, 2012, 3:10:48 PM7/4/12
to bio2rdf
Hi,
 We are now hosting Drugbank - can we make sure that we get the data from our SPARQL endpoint? Recall that we do not use the drugbank_drugs namespace - everything is in drugbank:, drugbank_resoruce: and drugbank_vocabulary: namespaces


Thanks!

m.

--
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group

Egon Willighagen

unread,
Jul 4, 2012, 3:19:13 PM7/4/12
to bio...@googlegroups.com
On Wed, Jul 4, 2012 at 9:10 PM, Michel Dumontier
<michel.d...@gmail.com> wrote:
>  We are now hosting Drugbank - can we make sure that we get the data from
> our SPARQL endpoint? Recall that we do not use the drugbank_drugs namespace
> - everything is in drugbank:, drugbank_resoruce: and drugbank_vocabulary:
> namespaces
>
> http://bio2rdf.org/drugbank:DB00001

I'm testing a semantic chemistry crawler (see my blog :), and one
source of owl:sameAs point to drugbank_drugs comes from:

http://rdf.freebase.com/ns/m/0qkc

You the server should have seen some hits on
http://bio2rdf.org/drugbank_drugs:DB00945

(I'll manually blacklist these now then...)

BTW, I also note that many Bio2RDF resource also have these things,
where the pointed to resource does not result in RDF:

http://bio2rdf.org/cas:50-78-2 http://www.w3.org/2002/07/owl#sameAs
http://cas.bio2rdf.org/cas:50-78-2

Egon

--
Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers

Michel Dumontier

unread,
Jul 4, 2012, 3:51:45 PM7/4/12
to bio...@googlegroups.com
On Wed, Jul 4, 2012 at 3:19 PM, Egon Willighagen <egon.wil...@gmail.com> wrote:
On Wed, Jul 4, 2012 at 9:10 PM, Michel Dumontier
<michel.d...@gmail.com> wrote:
>  We are now hosting Drugbank - can we make sure that we get the data from
> our SPARQL endpoint? Recall that we do not use the drugbank_drugs namespace
> - everything is in drugbank:, drugbank_resoruce: and drugbank_vocabulary:
> namespaces
>
> http://bio2rdf.org/drugbank:DB00001

I'm testing a semantic chemistry crawler (see my blog :), and one
source of owl:sameAs point to drugbank_drugs comes from:

http://rdf.freebase.com/ns/m/0qkc

You the server should have seen some hits on
http://bio2rdf.org/drugbank_drugs:DB00945
 
(I'll manually blacklist these now then...)

BTW, I also note that many Bio2RDF resource also have these things,
where the pointed to resource does not result in RDF:

http://bio2rdf.org/cas:50-78-2 http://www.w3.org/2002/07/owl#sameAs
http://cas.bio2rdf.org/cas:50-78-2

so the first entry http://bio2rdf.org/cas:50-78-2 does give back triples



m.
 
Egon

--
Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers

--
You received this message because you are subscribed to the Google Groups "bio2rdf" group.
To post to this group, send email to bio...@googlegroups.com.
To unsubscribe from this group, send email to bio2rdf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/bio2rdf?hl=en.

Egon Willighagen

unread,
Jul 4, 2012, 4:01:21 PM7/4/12
to bio...@googlegroups.com
On Wed, Jul 4, 2012 at 9:51 PM, Michel Dumontier
<michel.d...@gmail.com> wrote:
> so the first entry http://bio2rdf.org/cas:50-78-2 does give back triples
>
> http://bio2rdf.org/cas:50-78-2
>    http://bio2rdf.org/bio2rdf_resource:linkedToFrom
>       http://bio2rdf.org/cpd:C01405
>       http://bio2rdf.org/dr:D00109

And when I open that link, it also links from
http://bio2rdf.org/drugbank_drugs:DB00945 :)

BTW, how should I interpret
http://bio2rdf.org/bio2rdf_resource:linkedToFrom and
http://bio2rdf.org/bio2rdf_resource:xRef ?

Are those meant as owl:sameAs or as skos:exactMatch? Because from the
linking, I seem to see database entries supposed to be about the same
chemical linked...

That is, should my crawler follow those links when looking for info on ?

Michel Dumontier

unread,
Jul 4, 2012, 4:04:39 PM7/4/12
to bio...@googlegroups.com
LinkedToFrom results from a federated query over Bio2RDF resources.  xRef is an assertion in the dataset.  

Yes, you get results with drugbank_drugs, because that's used by the german group.

m.

--
You received this message because you are subscribed to the Google Groups "bio2rdf" group.
To post to this group, send email to bio...@googlegroups.com.
To unsubscribe from this group, send email to bio2rdf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/bio2rdf?hl=en.

Peter Ansell

unread,
Jul 4, 2012, 6:32:35 PM7/4/12
to bio...@googlegroups.com
Do we want to support retrieval from the D2RQ FU Berlin Drugbank
endpoint anymore?

I can switch that method off by default in the meantime without
deleting it, as it is not compatible with the new method.

Cheers,

Peter

Peter Ansell

unread,
Jul 4, 2012, 7:11:33 PM7/4/12
to bio...@googlegroups.com
Could you change the rdf:type URI for drug drug interactions from
http://bio2rdf.org/drugbank_resource:Drug-Drug-Interaction to
http://bio2rdf.org/drugbank_vocabulary:Drug-Drug-Interaction to match
the other rdf:types?

Thanks,

Peter

Peter Ansell

unread,
Jul 4, 2012, 7:29:07 PM7/4/12
to bio...@googlegroups.com
There are some links to the ndc namespace in the drugbank dataset by
the URL that I previously used for the NDC sparql endpoint is not
operational anymore [1].

Is there a replacement available?

Cheers,

Peter

[1] http://bio2rdf.semanticscience.org:8012/sparql/

On 5 July 2012 05:10, Michel Dumontier <michel.d...@gmail.com> wrote:

Peter Ansell

unread,
Jul 4, 2012, 7:43:29 PM7/4/12
to bio...@googlegroups.com
In addition, the previous NDC dataset did not have a plain "ndc:"
namespace, so there may need to be a translation there.

For reference, the list of namespaces that were in the previous NDC
sparql endpoint were:

queryall_provider:handlesNamespace bio2rdf_ns:ndc_resource ,
bio2rdf_ns:ndc_nda , bio2rdf_ns:ndc_doseform , bio2rdf_ns:ndc_dosage ,
bio2rdf_ns:ndc_listing , bio2rdf_ns:ndc_firm ,
bio2rdf_ns:ndc_formulation , bio2rdf_ns:ndc_facility ,
bio2rdf_ns:ndc_schedule , bio2rdf_ns:ndc_unit , bio2rdf_ns:ndc_route ,
bio2rdf_ns:ndc_product ;

Peter

Peter Ansell

unread,
Jul 4, 2012, 8:17:35 PM7/4/12
to bio...@googlegroups.com
There are URIs in the drugbank: namespace that do not start with DB
and are simple numeric identifiers, for example [1]. In my opinion
they should be either prefixed with drugbank_resources:target_, or in
a dedicated drugbank_targets (or similar) namespace if they have
stable identifiers, in a similar way to the FU Berlin Drugbank
dataset. I recommend using a different namespace from "drugbank:" even
if using a single namespace may not cross-pollute and cause ambiguity
in practice *if* there is only one set of numeric identifiers and all
of the drug identifiers start with letters, either DB or the old drug
identifier prefixes ("BIOD", "BTD", etc.,). Switching from
drugbank_drugs to drugbank: for DB identifiers is okay, as in practice
that is likely what people will want to reference in a "drug" dataset,
but doing the same for targets etc is very risky if they are not
prefixed in anyway and will cause more pain then just creating a new
namespace from the first revision.

Peter

[1] http://bio2rdf.org/drugbank:54

On 5 July 2012 05:10, Michel Dumontier <michel.d...@gmail.com> wrote:

Michel Dumontier

unread,
Jul 5, 2012, 10:15:54 AM7/5/12
to bio...@googlegroups.com
Good catch. I'll regenerate the dataset and we'll update the endpoint.

Michel Dumontier

unread,
Jul 5, 2012, 10:22:56 AM7/5/12
to bio...@googlegroups.com
On Wed, Jul 4, 2012 at 8:17 PM, Peter Ansell <ansell...@gmail.com> wrote:
There are URIs in the drugbank: namespace that do not start with DB
and are simple numeric identifiers, for example [1].

this identifier *is* assigned by DrugBank and is stable.

m.

Michel Dumontier

unread,
Jul 5, 2012, 10:26:01 AM7/5/12
to bio...@googlegroups.com
Right, so in this case, each dataset is lexically overlapping - they are assigned integers.  So we have to make the distinction somehow, and we do so by assigning each into its own namespace. So for that reason, there is nothing that is exclusively in the ndc namespace, even though they are all part of that.   So a query to ndc:1234 should be a query to all the subnamespaces...

m.

Michel Dumontier

unread,
Jul 5, 2012, 11:30:59 AM7/5/12
to bio...@googlegroups.com
The endpoint is now at 

i'm going to commit the code so we can get it into bio2rdf properly.

m.

Peter Ansell

unread,
Jul 5, 2012, 3:49:17 PM7/5/12
to bio...@googlegroups.com
I think we have a different understanding about what namespaces
represent. I see namespaces as containers for identifiers, however, I
don't view them as being containers for other containers. Links in
Bio2RDF documents should be directly pointed at a single Bio2RDF
namespace, or else they are bound to eventually either be accidentally
misused or become ambiguous, which would make it impossible to query
over them.

Is it difficult to determine which of the ndc_* namespaces a link from
drugbank to ndc actually points to?

Cheers,

Peter

Peter Ansell

unread,
Jul 5, 2012, 4:27:20 PM7/5/12
to bio...@googlegroups.com
Are there any other simple numeric identifiers assigned by drugbank
other than the target identifiers? It is difficult to guarantee that
there will never be any other simple numeric identifiers assigned by
drugbank that may make it ambiguous for people referencing the dataset
in their RDF documents.

One difficulty for me maintaining the resolver is that mixing
different types of entities into the drugbank namespace in this case
makes it difficult to automatically generate the HTML links, as
drugbank sets aside a different HTTP URL prefix for the
molecules/targets [1] [2], which may be why they do not mind using
simple numeric identifiers.

Mostly I don't understand what is gained by mashing namespaces
together. If we are really trying to make it easier in the long term
for people to reuse URIs then they should be as self-explanatory as
possible. The current URI for the molecule in [2], while
self-explanatory on the drugbank.ca scheme, is ambiguous at best in
the future if someone looked at it using the current Bio2RDF scheme
[3], as they would need to know that anything that didn't start with
letters was not actually a drug, and was actually a molecule.

I do like that madeup identifiers, where there are no original
provider identifiers and hence no long term contract, are now being
put in single separate drugbank_resource: namespace, as they are
generated by scripts and as such could have their current uniqueness
guaranteed by the script without ever clashing with an original
provider identifier (although the scripts probably don't actually
perform uniqueness checks).

The usefulness of a single namespace in that context however does not
pass to the realm of stable identifiers in my opinion. We should do
all that we can to make it easy for people to reuse stable identifiers
in the long term, and eliminating the reference to "molecule/target"
from the original drugbank reference just because we can by reason of
the current lexical uniqueness quirk in the drugbank design doesn't
encourage reuse as it is hiding information that would be very
valuable. There is little cost, as it is excessively cheap and useful
to create a new namespace, however there is potentially a high cost in
the future if there is another numeric identifier set in drugbank that
clashes with the molecule numbers and someone assumes that there is
only one namespace for drugbank and naively creates a URI that links
to either a molecule or both a molecule and something else. People
should be free to naively create URIs without having to double check
each reference if there is a well planned URI scheme that they can
understand without having to look through the quirks in each dataset.

If eventually there is another set of simple numeric identifiers
produced by Drugbank which alternative would you encourage by the way,
adding a prefix to the identifier [4] or creating a new namespace [5]?
If the answer is 5 then there is no real reason not to do it now. If
the answer if [4] then it looks like [3] (which could not be changed
at that point or backwards compatibility/long term stability would be
broken) is more of a hack than [4] as it is the only set of
identifiers in the multi-use drugbank namespace that can't be naively
interpreted without documentation.

[1] http://www.drugbank.ca/drugs/DB00544
[2] http://www.drugbank.ca/molecules/359
[3] http://bio2rdf.org/drugbank:359
[4] http://bio2rdf.org/drugbank:reaction_455
[5] http://bio2rdf.org/drugbank_reaction:455

Michel Dumontier

unread,
Jul 5, 2012, 5:08:59 PM7/5/12
to bio...@googlegroups.com
Ok,
  This is an important discussion.

  Let's take the GenBank + RefSeq accessions as an extreme case of identifier diversity, based on i) source and ii) types:


The advantage of putting them all in one namespace is because you don't need to figure out which sub-namespace the identifier belongs to before you create the URI (which would be challenging without further lexical analysis).

The only case where you can do this is:
 - they are lexically unique in that namespace (there must be no clashes)

Thus, when faced with a relational database that uses numeric identifiers as keys for each table, you *must* create a namespace for each table.  If, however, the dataprovider also provides a globally unique accession, you can easily put this in the same namespace.

Other databases, like Reactome, uses a lexically unique namespace (numeric), for *all* of its different types of data.  So, in this case, it is easy to put them all in one namespace.  But if you are advocating that we should distinguish between the types, then it would be difficult to determine what the right namespace is for the types of things that we must link to (often we must do this from an "xref" field).


m.




m.

Peter Ansell

unread,
Jul 5, 2012, 7:28:09 PM7/5/12
to bio...@googlegroups.com
Hi Michel,

The most significant part I am worried about is if we start
aggregating namespaces when upstream providers do not treat them the
same. In the case of drugbank, there are no prefixes on any of the
targets/molecules so they are bare numeric identifiers if you strip
the /molecule/ from the path that they have been given by
www.drugbank.ca, and hence they are not future proof if there is
another prefix with another set of overlapping numeric identifiers. In
the case of NCBI, everyone uses the entire identifier including the
prefix, knowing that NCBI minted the identifier to be unique across
their institution. In a similar way Reactome has a clear strategy for
their entire dataset, so we don't need to worry about multiple
namespaces as they have already filled their social contract by
guaranteeing the uniqueness across all of their datasets, even though
their is no textual prefix or other piece of information to further
narrow down the dataset that contains the intended target.

I led the discussion off on a tangent by referring to types, as it is
not necessarily type based distinctions that are important, there
could be multiple types in a single namespace or a single type
distributed across multiple namespaces. It might be more appropriate
to focus on determining whether the provider inserted enough
information into the identifier to reasonably rely on it on its own in
the long term without us adding more information by splitting the
identifier set out into its own namespace in Bio2RDF. In the case of
drugbank with their targets, I don't think they intended the bare
numeric identifier to be used on its own to identify that record
across all of the drugbank datasets. The current web interface at
www.drugbank.ca relies on the /molecule/ prefix to disambiguate it, so
I think it would be wise if we included that information ourselves. I
don't mind using drugbank: for the drug identifiers, as that will be
the most common usage in the Bio2RDF version. I used on
"drugbank_drugs:" with the FU Berlin dataset as it contains many other
dimensions as well so it was easiest to disambiguate "drugbank:"
further. The other dimensions that FU Berlin include in their version
are based on the type of the record, and all are based on plain
numeric identifiers, so there was no choice but to use subnamespaces
for them.

If people are using bare xref's then in most cases they will know
exactly what they should be pointing at, as otherwise there will be
people who will have the same problem interpreting the data as us, or
some will have an algorithm for disambiguating the reference. In the
irefindex case I brought up yesterday they do not seem to disambiguate
between textual gene symbols and numeric ncbi entrez gene identifiers,
other than relying on the field name to make it obvious to a human
what is happening, so there are cases where we need to encode some
logic into the rdfisers to fix up ambiguous references. However,
ideally we should be able to naively use the reference against a small
number of namespaces exposed by the target dataset (ie,
drugbank_drugs:/drugbank: versus
drugbank_molecules:/drugbank_targets).

If the field indicates in any way that it semantically links to a
particular type of record in the target dataset (ie, not the
semantics-less "xref" field), then the list of candidate namespaces
should be even smaller. In the long term we can't avoid having rdfiser
authors search a registry using textual queries to identify a suitable
target namespace while they are making up their scripts. The list for
any given provider should be so small that it should be possible to
identify the intended destination namespace quickly, and if not we
would need to discuss the data dump with the provider so they can
disambiguate the links in future before we get to rdfising their
dataset. Data dumps are not useful if authors are that sloppy with
their xref conventions that you can't as a human determine which of
the namespaces in the destination they intended and convert the
observed patterns to an algorithm if necessary, as it is not an issue
with RDF or URIs at that stage, it is an issue with usage by any
scientist.

Peter

Michel Dumontier

unread,
Jul 6, 2012, 12:07:21 PM7/6/12
to bio...@googlegroups.com
Peter,
 If we go the route of assigning different subnamespaces, then i also want to be able to aggregate on those namespaces. e.g.

if
drugbank_drug:DB00175
drugbank_target:4119

then, 
drugbank:DB00175
and drugbank:4119 are valid queries to the system, where we map our queries to those two namespaces. if we have documented the syntax of the identifier (e.g. drugbank_drug = DB[0-9]{4} ), then we should be able to restrict these accordingly. otherwise, it can query all the subnamespaces.

and BTW, why aren't we getting results on 
when 

m.

Michel Dumontier

unread,
Jul 7, 2012, 12:51:26 PM7/7/12
to bio...@googlegroups.com
Peter,
  what do you think about keeping drugbank: for DBXXXXX identifiers and drugbank_target for the target identifiers?

m.

Peter Ansell

unread,
Jul 7, 2012, 5:29:13 PM7/7/12
to bio...@googlegroups.com
That would be a good route I think. I will make the changes, including
the ndc changes, to the webapp in the next few days then we can
release version 1.3.2 of the webapp and get the changes out to others
who want to use the webapp directly on their own sites.

I will make drugbank_drugs an alias for drugbank in the same way that
swissprot is an alias for uniprot, and drugbank_drugs: will then
automatically be resolved as if they queried for drugbank: so that the
URIs that have been floating around for the past few years will still
resolve.

Peter

Peter Ansell

unread,
Jul 12, 2012, 9:18:30 PM7/12/12
to bio...@googlegroups.com
I made the changes to the ndc and drugbank/drugbank_drugs namespaces
and released bio2rdf-webapp-1.3.2 today [1]. It has a matching 1.3.2
release of the queryall library [2] using "mvn clean install" before
installing bio2rdf-webapp.

Peter

[1] https://github.com/bio2rdf/bio2rdf-webapp/tree/v1.3.2
[2] https://github.com/queryall/queryall/tree/v1.3.2
Reply all
Reply to author
Forward
0 new messages