Hi Michel,
The most significant part I am worried about is if we start
aggregating namespaces when upstream providers do not treat them the
same. In the case of drugbank, there are no prefixes on any of the
targets/molecules so they are bare numeric identifiers if you strip
the /molecule/ from the path that they have been given by
www.drugbank.ca, and hence they are not future proof if there is
another prefix with another set of overlapping numeric identifiers. In
the case of NCBI, everyone uses the entire identifier including the
prefix, knowing that NCBI minted the identifier to be unique across
their institution. In a similar way Reactome has a clear strategy for
their entire dataset, so we don't need to worry about multiple
namespaces as they have already filled their social contract by
guaranteeing the uniqueness across all of their datasets, even though
their is no textual prefix or other piece of information to further
narrow down the dataset that contains the intended target.
I led the discussion off on a tangent by referring to types, as it is
not necessarily type based distinctions that are important, there
could be multiple types in a single namespace or a single type
distributed across multiple namespaces. It might be more appropriate
to focus on determining whether the provider inserted enough
information into the identifier to reasonably rely on it on its own in
the long term without us adding more information by splitting the
identifier set out into its own namespace in Bio2RDF. In the case of
drugbank with their targets, I don't think they intended the bare
numeric identifier to be used on its own to identify that record
across all of the drugbank datasets. The current web interface at
www.drugbank.ca relies on the /molecule/ prefix to disambiguate it, so
I think it would be wise if we included that information ourselves. I
don't mind using drugbank: for the drug identifiers, as that will be
the most common usage in the Bio2RDF version. I used on
"drugbank_drugs:" with the FU Berlin dataset as it contains many other
dimensions as well so it was easiest to disambiguate "drugbank:"
further. The other dimensions that FU Berlin include in their version
are based on the type of the record, and all are based on plain
numeric identifiers, so there was no choice but to use subnamespaces
for them.
If people are using bare xref's then in most cases they will know
exactly what they should be pointing at, as otherwise there will be
people who will have the same problem interpreting the data as us, or
some will have an algorithm for disambiguating the reference. In the
irefindex case I brought up yesterday they do not seem to disambiguate
between textual gene symbols and numeric ncbi entrez gene identifiers,
other than relying on the field name to make it obvious to a human
what is happening, so there are cases where we need to encode some
logic into the rdfisers to fix up ambiguous references. However,
ideally we should be able to naively use the reference against a small
number of namespaces exposed by the target dataset (ie,
drugbank_drugs:/drugbank: versus
drugbank_molecules:/drugbank_targets).
If the field indicates in any way that it semantically links to a
particular type of record in the target dataset (ie, not the
semantics-less "xref" field), then the list of candidate namespaces
should be even smaller. In the long term we can't avoid having rdfiser
authors search a registry using textual queries to identify a suitable
target namespace while they are making up their scripts. The list for
any given provider should be so small that it should be possible to
identify the intended destination namespace quickly, and if not we
would need to discuss the data dump with the provider so they can
disambiguate the links in future before we get to rdfising their
dataset. Data dumps are not useful if authors are that sloppy with
their xref conventions that you can't as a human determine which of
the namespaces in the destination they intended and convert the
observed patterns to an algorithm if necessary, as it is not an issue
with RDF or URIs at that stage, it is an issue with usage by any
scientist.
Peter