should we have redundant identifiers for the same record (answer should be no)

3 views
Skip to first unread message

Alan Ruttenberg

unread,
Apr 26, 2009, 8:28:15 PM4/26/09
to shared...@googlegroups.com
But consider the following entries in LSRN:

HGCN, HGNC_name (both redirects are broken, btw, but that's aside the point).

These would be redundant ways to get at the same record.
I propose we do not (knowingly) have such cases as they harm the
ability to do data integration.

Another one for the agenda.

-Alan

Peter Ansell

unread,
Apr 26, 2009, 9:02:49 PM4/26/09
to shared...@googlegroups.com
2009/4/27 Alan Ruttenberg <alanrut...@gmail.com>:

HGNC and either HGNC_symbol or HGNC_hugo might be more appropriate
names for the two distinct HGNC database namespaces. In Bio2RDF
somehow we have three namespaces for the two by no fault of my own
(hgnc, hugo, and symbol). I would prefer HGNC and HUGO as the two
namesapces as they seem to be historically used. Incidentally "symbol"
namespace in Bio2RDF has been mashed up across a few different
databases by the current rdfisers for some reason which needs to be
fixed in the future.

The difference is in whether a numeric (HGNC) [1] or a textual name
(HUGO) [2] is being used. At least with the Bio2RDF /html/ redirects,
things are still working for both namespaces. [3] and [4]

On the agenda should also be the issue of case-sensitivity. I have had
a large amount of trouble with the current assumption in Bio2RDF that
lowercasing everything is likely to be more consistent than keeping
the case of identifiers that are given by the provider. Obviously, I
would prefer that people just didn't modify the case that they find on
the official provider database interface or database dumps, but people
do inevitably modify them in some circumstances it seems. If you are
going to reference dbpedia/wikipedia for example, you can't go around
trying to normalise the case to an arbitrary standard like
lowercasing, as the identifiers are very case-sensitive with different
articles being referenced in some cases if you change the case.

Hopefully not giving too many ideas, but you might also want to
discuss what best practice preference you want to give to full percent
encoding of identifiers as opposed to either percent and plus encoding
for spaces, or no encoding at all. I would prefer a full percent
encoding within identifiers for any potentially reserved character,
and UTF-8 encoding prior to percent encoding for non-ASCII characters.
I don't like the typical urlencoding scheme with encoding spaces as
"+" because it creates ambiguities if people really have + symbols in
their identifiers, %20 is more consistent for space encoding).

As with you, I don't like the idea of redundant identifiers, except
for cases like HGNC/HUGO where both distinct namespaces are primary
keys on the database, and both useful for people trying to reference
the record. I definitely don't like the idea of redundant identifiers
within namespaces, ie, hugo:Example+Symbol, hugo:Example%20Symbol and
hugo:example%20symbol, would be a less than useful best practice in my
opinion from my experience working with Bio2RDF.

Cheers,

Peter

[1] http://qut.bio2rdf.org/hgnc:11813
[2] http://qut.bio2rdf.org/hugo:TIMELESS
[3] http://qut.bio2rdf.org/html/hgnc:11813
[4] http://qut.bio2rdf.org/html/hugo:TIMELESS

Alan Ruttenberg

unread,
Apr 26, 2009, 9:14:10 PM4/26/09
to shared...@googlegroups.com
Hi Peter,

I think the issue is what the urls denote. For shared names we've said
the urls denote *records*, rather than *identifiers*.

What you suggest seems more along the lines that we should be naming
primary keys, i.e. identifiers.

If we have two URLs for the same *record* then we have a problem that
1/2 the people can use one, and 1/2 the people can use the other, and
then we are into trouble.

I don't have a principled reason to choose the names over the numbers
other than the fact that the names are unlikely to be as stable, and
therefore seemingly less suitable for a project such as ours.

On the issue of case sensitivity, I think we need to think of our
use-case - RDF. RDF identifiers are case sensitive (and indeed
canonicalization sensitive). So I think we can't tolerate different
case spelling of our identifiers and will need to figure out a
consistent policy.

There's no reason we can't include such auxiliary information as
alternative case spelling or other primary keys as metadata.

-Alan

Alan Ruttenberg

unread,
Apr 26, 2009, 9:34:26 PM4/26/09
to shared...@googlegroups.com
On Sun, Apr 26, 2009 at 9:02 PM, Peter Ansell <ansell...@gmail.com> wrote:
> Hopefully not giving too many ideas, but you might also want to
> discuss what best practice preference you want to give to full percent
> encoding of identifiers as opposed to either percent and plus encoding
> for spaces, or no encoding at all. I would prefer a full percent
> encoding within identifiers for any potentially reserved character,
> and UTF-8 encoding prior to percent encoding for non-ASCII characters.
> I don't like the typical urlencoding scheme with encoding spaces as
> "+" because it creates ambiguities if people really have + symbols in
> their identifiers, %20 is more consistent for space encoding).

On this issue, I think the easier choice is to use the provider's
accession, and then use the http URL encoding rules. My understanding
is "+" isn't a standard replacement for space, and %20 is, so I thing
you are right. It may be the case that "+" is application specific...

-Alan

Peter Ansell

unread,
Apr 26, 2009, 10:33:44 PM4/26/09
to shared...@googlegroups.com
2009/4/27 Alan Ruttenberg <alanrut...@gmail.com>:

>
> Hi Peter,
>
> I think the issue is what the urls denote. For shared names we've said
> the urls denote *records*, rather than *identifiers*.
>
> What you suggest seems more along the lines that we should be naming
> primary keys, i.e. identifiers.
>
> If we have two URLs for the same *record* then we have a problem that
> 1/2 the people can use one, and 1/2 the people can use the other, and
> then we are into trouble.

Mmmm... I understand the dilemma but in real cases where someone makes
up two primary identifiers for the same record it may get complicated
deciding which one to use. Maybe there should just be a resolution
service that tells people the best identifier to use and clearly shows
that the other identifier shouldn't be used because we want to keep a
one-to-one URI-to-Record system for simplicity.

HGNC is the only case I have found that does this btw so it happens
that we have stumbled upon the outlier first. For what it is worth
most (but not all) people do actually use the numerical HGNC
identifier even though the symbol is nominally unique to a given
record. In most cases determining the identifier for a record won't be
an issue at all so you shouldn't put too much effort into it.

> I don't have a principled reason to choose the names over the numbers
> other than the fact that the names are unlikely to be as stable, and
> therefore seemingly less suitable for a project such as ours.

And HGNC do indeed change the textual symbols on occasion so you would
be more inclined to follow the numerical identifiers where possible
because of that.

> On the issue of case sensitivity, I think we need to think of our
> use-case - RDF. RDF identifiers are case sensitive (and indeed
> canonicalization sensitive). So I think we can't tolerate different
> case spelling of our identifiers and will need to figure out a
> consistent policy.
>
> There's no reason we can't include such auxiliary information as
> alternative case spelling or other primary keys as metadata.

Maybe there should be a flag on a namespace that says whether it's
record identifiers are case sensitive... Not completely sure what the
purpose would be but it could have something to do with querying the
records in the namespace in the future.

Cheers,

Peter

Alan Ruttenberg

unread,
Apr 26, 2009, 10:53:03 PM4/26/09
to shared...@googlegroups.com
On Sun, Apr 26, 2009 at 10:33 PM, Peter Ansell <ansell...@gmail.com> wrote:
>
> 2009/4/27 Alan Ruttenberg <alanrut...@gmail.com>:

> Mmmm... I understand the dilemma but in real cases where someone makes


> up two primary identifiers for the same record it may get complicated
> deciding which one to use. Maybe there should just be a resolution
> service that tells people the best identifier to use and clearly shows
> that the other identifier shouldn't be used because we want to keep a
> one-to-one URI-to-Record system for simplicity.

FWIW, I think this line of thinking - what services can/should be
built *on top* of the shared names stuff? By stripping away all but
the role to uniquely identify the record and to provide some minimal,
predictable, metadata we keep the project within a scope where we're
more likely to get broader consensus.

I've heard more than one useful sounding service mentioned in the
context of these discussion, things that I think should be done, but
not by this project, in the interest of staying focused. For example,
the question of how one finds the URI for the think one wants to refer
to is an important question, and a service that helps do this would be
great. I don't think it's within scope of sn to do that, but it is
within scope to listen for what requirement might impinge on what
we're doing and try to support such activities as best possible.

-Alan

Eric Prud'hommeaux

unread,
Apr 26, 2009, 11:35:11 PM4/26/09
to shared...@googlegroups.com

iirc, RFC3986 (URI spec) specifies only %-encoding, and HTML adds the
'+' shorthand only for spaces in CGI parms. Actual '+'s in CGI parns
or anywhere in IRIs must be encoded as %2B, so "foo bar"="bar+baz"
looks like <http://a.example/?foo+bar=bar%2Bbaz> .

encoding functions like xpath's encode-for-uri s/ /%20/ so if want
people to access shared names without standing on their heads, then
yes, %20 is the way to go.

> -Alan
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "Shared names" group.
> To post to this group, send email to shared...@googlegroups.com
> To unsubscribe from this group, send email to shared-names...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/shared-names?hl=en
> -~----------~----~----~----~------~----~------~--~---

--
-eric

office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(er...@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

signature.asc

Sergei Egorov

unread,
Apr 26, 2009, 11:32:44 PM4/26/09
to shared...@googlegroups.com

To ensure that ID-derived URI is a unique character string as required by RDF URI
specification we can do something like this:

- first, the identifier is converted to the canonical octet sequence by lowercasing it if the
namespace is not case sensitive, and converting the non-ASCII characters to utf-8
sequences. Sequential numerical identifiers should have their leading zeroes removed,
unless the namespace specifies fixed-length format.

- the set of ID characters which must not be urlencoded should be fixed. My
understanding of rfc2396 is that this set is [-_.!~*\'()a-zA-Z0-9] (so called "unreserved" characters).

- all other characters must be urlencoded using %xx scheme where both hex
digits are lowercase.

- the case of the database name part should be fixed

- the rest of the URI should be in lowercase and allow no spelling variations.

Sergei

Alan Ruttenberg

unread,
Apr 27, 2009, 5:55:05 AM4/27/09
to shared...@googlegroups.com
On Sun, Apr 26, 2009 at 11:32 PM, Sergei Egorov <e...@acm.org> wrote:
>
>
> To ensure that ID-derived URI is a unique character string as required by RDF URI
> specification we can do something like this:
>
> - first, the identifier is converted to the canonical octet sequence by lowercasing it if the
> namespace is not case sensitive, and converting the non-ASCII characters to utf-8
> sequences. Sequential numerical identifiers should have their leading zeroes removed,
> unless the namespace specifies fixed-length format.
>
> - the set of ID characters which must not be urlencoded should be fixed. My
> understanding of  rfc2396 is that this set is  [-_.!~*\'()a-zA-Z0-9]  (so called "unreserved" characters).
>
> - all other characters must be urlencoded  using %xx scheme where both hex
> digits are lowercase.

Almost. The so-called reserved characters do not need to be url
encoded in all schemes.

"Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding."

So even though "+", for example, is in the "reserverd" set, it doesn't
need to be escaped in an http URI because it plays no special role in
that scheme.

As oracle, I've been using the following constructor from
http://java.sun.com/j2se/1.5.0/docs/api/java/net/URI.html

URI(String scheme, String authority, String path, String query, String
fragment)
Constructs a hierarchical URI from the given components.
>
> - the case of the database name part should be fixed

Here the question will be whether we adopt an always-lower-case
convention or not. Common usage has mixed case so one way to go is to
continue this practice. OTOH it's pretty arbitrary how the casing is
done and predictability would suggest going with a single case
uniformly would result in fewer errors.

Peter Ansell

unread,
Apr 27, 2009, 6:19:13 AM4/27/09
to shared...@googlegroups.com
2009/4/27 Sergei Egorov <e...@acm.org>:

>
>
> To ensure that ID-derived URI is a unique character string as required by RDF URI
> specification we can do something like this:
>
> - first, the identifier is converted to the canonical octet sequence by lowercasing it if the
> namespace is not case sensitive, and converting the non-ASCII characters to utf-8
> sequences. Sequential numerical identifiers should have their leading zeroes removed,
> unless the namespace specifies fixed-length format.

You can't lowercase the identifier and expect it to be unique. DBpedia
for a single example would break if you did that. Removing leading
zeroes seems to be just as dramatic as lowercasing the whole thing and
will likely cause a lot more issues than it will fix. If the
identifier is seen as an opaque set of characters we wouldn't have to
suggest changes to it. There is nothing special about numerical
strings with a large number of zeroes that makes them easier sets of
characters to change in my opinion.

> - the set of ID characters which must not be urlencoded should be fixed. My
> understanding of  rfc2396 is that this set is  [-_.!~*\'()a-zA-Z0-9]  (so called "unreserved" characters).

What do you mean by "fixed"?

> - all other characters must be urlencoded  using %xx scheme where both hex
> digits are lowercase.

Keep in mind that other people have suggested that when people percent
encode things they should keep the hex digits as uppercase... [1] for
a random example. I haven't actually heard it suggested before that
people lowercase hex digits but it is likely to have been done
somewhere I guess.

> - the case of the database name part should be fixed

I fail to understand why you insist on lowercase for the identifier
which is much more brittle than the namespace, and not insist on
lowercasing the database name part which is not brittle as we were the
ones who created it.

I am not sure that I completely understand you though. Do you mean
fixed at all uppercase or all lowercase? Would the namesapce be the
only part of the URI that could have uppercase characters for some
reason?

If the database name is used as an identifier for a databank
description perhaps would it be lowercased? Would it be the exception
to the rule, (other than dbpedia and other wiki's which are case
sensitive)?

> - the rest of the URI should be in lowercase and allow no spelling variations.

I agree that the scheme and authority should be lowercased, as per
common convention.

Trying to create a new convention will be harder than you realise I
think and avoiding issues with case sensitive identifiers from the
start would be prudent I think. If databases currently use uppercase
for their local and foreign identifiers what is the rationale behind
making them lowercase for a new common naming scheme? I think it would
be better just to publish specifications about each database and
people would know for a particular database what the common convention
was, independent of this scheme, so that when they eventually come to
use the scheme after they see how useful RDF is they won't have to
battle with ambiguities if they ever reference a database that relies
on its private identifiers not being changed in order to access it
right.

(Sorry if I come across a little terse... I have had to figure out how
to make up hacks to fix issues related to experimental lowercasing of
identifiers, where the database doesn't even use lowercase itself)

Cheers,

Peter

[1] http://tools.ietf.org/html/rfc3986

Sergei Egorov

unread,
Apr 27, 2009, 10:02:43 AM4/27/09
to shared...@googlegroups.com
From: "Peter Ansell" <ansell...@gmail.com>

>You can't lowercase the identifier and expect it to be unique. DBpedia
>for a single example would break if you did that.

This means that DBpedia identifiers are case-sensitive and should
not be lowercased. The lowercasing rule is only applicable to situations
where all case variations work equally well and refer to the same record.

>Removing leading
>zeroes seems to be just as dramatic as lowercasing the whole thing and
>will likely cause a lot more issues than it will fix. If the
>identifier is seen as an opaque set of characters we wouldn't have to
>suggest changes to it. There is nothing special about numerical
>strings with a large number of zeroes that makes them easier sets of
>characters to change in my opinion.

There are many databases which ignore leading zeroes in identifiers. For
example, in Entrez Gene both 7157 and 007157 work and mean the same.
We have to chose a single representative ID from this equivalence class,
and the "no leading zeroes" rule fixes the choice.

>> - the set of ID characters which must not be urlencoded should be fixed. My
>> understanding of rfc2396 is that this set is [-_.!~*\'()a-zA-Z0-9] (so called "unreserved" characters).
>
>What do you mean by "fixed"?

I mean that we should agree on this set and everybody should use the same set.


> - all other characters must be urlencoded using %xx scheme where both hex
> digits are lowercase.
>
>Keep in mind that other people have suggested that when people percent
>encode things they should keep the hex digits as uppercase... [1]

Any case works, as long as everybody uses the same case. I proposed lower
case because it is the case usually used in RFC examples.


>> - the case of the database name part should be fixed
>
>I fail to understand why you insist on lowercase for the identifier
>which is much more brittle than the namespace, and not insist on
>lowercasing the database name part which is not brittle as we were the
>ones who created it.
>
> I am not sure that I completely understand you though. Do you mean
> fixed at all uppercase or all lowercase?

I meant to say that database names should be spelled exactly as specified, with
no allowance for case variation. We could enforce either all-lower or all-upper
case for database names, but it is not strictly necessary as long as everybody
uses the same case-sensitive form (which may be problematic - see below).


>Trying to create a new convention will be harder than you realise I
>think and avoiding issues with case sensitive identifiers from the
>start would be prudent I think. If databases currently use uppercase
>for their local and foreign identifiers what is the rationale behind
>making them lowercase for a new common naming scheme? I think it would
>be better just to publish specifications about each database and
>people would know for a particular database what the common convention
>was, independent of this scheme, so that when they eventually come to
>use the scheme after they see how useful RDF is they won't have to
>battle with ambiguities if they ever reference a database that relies
>on its private identifiers not being changed in order to access it
>right.

The downside to treating all databases as having case-sensitive
identifiers is that people frequently ignore case differences and many
databases don't enforce their "common conventions" even when they
exist. Lowercasing is harsh, but has a big benefit of being easy to
remember and enforce. Nice, culturally sensitive but hard to remember
conventions tend to be ignored by a significant part of the population,
and if you do not have any strong early detection and enforcing
mechanism, the variations will proliferate. In my experience, in
situations when the outcome depends on a high degree of consistency
in identifier spelling, strict and simple rules work much better than
complex and culturally sensitive ones.

Regards,
Sergei


Michel_Dumontier

unread,
Apr 27, 2009, 10:34:11 AM4/27/09
to shared...@googlegroups.com
+1 on having a normalized dataset namespace that is upper or lower-cased.

When it comes to individual identifiers, it gets hard. There should be a regex specification for the identifier - so that we can determine that entrez gene's 7157 is canonically with or without the zeroes (and inversely, each the OBO identifiers that are x character identifiers including zeroes)

Whether the expression is enforced to be case insensitive is a matter of debate, but it makes things easier, internally, that we expect them to be case sensitive and normalized to one case or another. It should be noted that normal convention for bioinformatics identifiers is that they are in upper case.


-=Michel=-

> -----Original Message-----
> From: shared...@googlegroups.com [mailto:shared-
> na...@googlegroups.com] On Behalf Of Sergei Egorov
> Sent: Monday, April 27, 2009 10:03 AM
> To: shared...@googlegroups.com
> Subject: Re: should we have redundant identifiers for the same record
> (answer should be no)
>
>
Reply all
Reply to author
Forward
0 new messages