Fwd: IDs for citations

Peter Sefton

unread,

May 12, 2007, 5:18:46 PM5/12/07

to zoter...@googlegroups.com

In discussion over at the forums we've been talking about the plugin.

<http://forums.zotero.org/discussion/804/word-plugin-status-update/#Item_12>

I posted some thoughts and questions regarding citations IDs to the
end of that thread but as
there has been no response over there I thought I'd cross-post to
here:

Returning to IDs.

There is a real problem with the current model. I have tried exporting
references, getting someone else to import them and trying to format a
bibliography with the plugin, but it fails because of the arbitrary
IDs assigned by Zotero.

Looks like this will also be a problem even for a single user. What if
I move computers and import my references? Unless you happen to be
working with a portable Zotero installation from the start then looks
like the bibliography plugin is not going to work long term.

Bruce has posted here before about using a better identifier. I'll
quote what he wrote to me, with minor changes here.

"The details need to be worked out, but I think we need rules; e.g.:

- if an online resource, use the perma-URL
- else if item has a DOI, use it in info URI form (info:doi:xxxxxxxx)
[To which I add that handles in general should be treated this way,
not just DOIs which are a special case of handles]
- if book, use worldcat.org accession number URI; else use urn:isbn"

Now almost all Zotero resources will have a URL, so why can't we use
that as the ID, with the other strategies if that fails, and in the
worst case at least try to use something like a version of the title.
This is not quite the utopian situation Bruce describes as people can
still use dodgy URLs but it would at least solve the short term
problem that the plugin is severely limited in its utility at present
by the use of the current IDs.

It would also allow a new workflow where the plugin could find links,
and ask Zotero to make new records for them if they don't exist.

There may be some issues with multiple citations pointing to the same
URL, but the user should be warned about that situation anyway.

Does Zotero have the APIs to do this at the moment? Where would one
look to change the code?

--

Peter Sefton
Senior Research Fellow / RUBRIC Technical Manager
RUBRIC Project, DeC
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA

Work: sef...@usq.edu.au
Private: p...@ptsefton.com

p: +61 (0)7 4631 1640
m: +61 (0)410 326 955

RUBRIC Website: http://www.rubric.edu.au
USQ Website: http://www.usq.edu.au
Personal Website: http://ptsefton.com

RUBRIC is supported by the Systemic Infrastructure Initiative as part of
the Commonwealth Government's Backing Australia's Ability - An
Innovative Action Plan for the Future
(http://backingaus.innovation.gov.au)

The University of Southern Queensland is a registered provider of
education with the Australian Government.

(CRICOS Codes: QLD 00244B | NSW 02225M | VIC 02387D | WA 02521C)

--

Peter Sefton
Senior Research Fellow / RUBRIC Technical Manager
RUBRIC Project, DeC
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA

Work: sef...@usq.edu.au
Private: p...@ptsefton.com

p: +61 (0)7 4631 1640
m: +61 (0)410 326 955

RUBRIC Website: http://www.rubric.edu.au
USQ Website: http://www.usq.edu.au
Personal Website: http://ptsefton.com

RUBRIC is supported by the Systemic Infrastructure Initiative as part of
the Commonwealth Government's Backing Australia's Ability - An
Innovative Action Plan for the Future
(http://backingaus.innovation.gov.au)

The University of Southern Queensland is a registered provider of
education with the Australian Government.

(CRICOS Codes: QLD 00244B | NSW 02225M | VIC 02387D | WA 02521C)

Bruce D'Arcus

unread,

May 12, 2007, 8:21:58 PM5/12/07

to zotero-dev

On May 12, 5:18 pm, "Peter Sefton" <ptsef...@gmail.com> wrote:

> There is a real problem with the current model. I have tried exporting
> references, getting someone else to import them and trying to format a
> bibliography with the plugin, but it fails because of the arbitrary
> IDs assigned by Zotero.
>
> Looks like this will also be a problem even for a single user. What if
> I move computers and import my references? Unless you happen to be
> working with a portable Zotero installation from the start then looks
> like the bibliography plugin is not going to work long term.

Right. Using local IDs is at best a short-term expedient.

This is one of those high-level details that really needs to be
addressed sooner rather than later, and certainly before the server
stuff gets even designed, much less built.

> Bruce has posted here before about using a better identifier. I'll
> quote what he wrote to me, with minor changes here.
>
> "The details need to be worked out, but I think we need rules; e.g.:
>
> - if an online resource, use the perma-URL
> - else if item has a DOI, use it in info URI form (info:doi:xxxxxxxx)
> [To which I add that handles in general should be treated this way,
> not just DOIs which are a special case of handles]
> - if book, use worldcat.org accession number URI; else use urn:isbn"
>
> Now almost all Zotero resources will have a URL, so why can't we use
> that as the ID, with the other strategies if that fails, and in the
> worst case at least try to use something like a version of the title.

In ODF, citaitons almost certainly will be implemented using RDF,
where the identity of citation sources will always be established
using URIs. I think this is a good practice in any case.

I think the way to solve the cases where you don't know what URI to
use is to assign it one anyway (a URN UUID, or even some zotero.org
identifier), and also include the raw metadata (or at least a minimal
form of it).

> This is not quite the utopian situation Bruce describes as people can
> still use dodgy URLs but it would at least solve the short term
> problem that the plugin is severely limited in its utility at present
> by the use of the current IDs.

Right. Standardizing on URIs does not get us immediately to nrivana,
but it does get us a lot closer, and moving towards the notion of
distributed resources identified by URI is also an approach that can
be refined and improved over time.

> It would also allow a new workflow where the plugin could find links,
> and ask Zotero to make new records for them if they don't exist.
>
> There may be some issues with multiple citations pointing to the same

> URL ...

How so? If they use the same URI (not URL), they ought to refer to the
same thing.

Bruce

Peter Sefton

unread,

May 12, 2007, 8:33:56 PM5/12/07

to zoter...@googlegroups.com

Bruce asks what I meant by this:

> > There may be some issues with multiple citations pointing to the same
> > URL ...
>
> How so? If they use the same URI (not URL), they ought to refer to the
> same thing.
>
> Bruce

Yep - they should but if you have more than one record for the same
thing then which one should Zotero use? The first one it finds? How
will it go about warning you that you have two?Just interface issues
and the best way to find a solution is to get a prototype up and
running.

The ICE team is currently working on the word-processor plugin side,
would anyone care to tackle the Zotero side of this?

Peter

Matthias Steffens

unread,

May 12, 2007, 9:43:38 PM5/12/07

to zoter...@googlegroups.com

On 13-May-07 at 10:33 +1000 Peter Sefton wrote:

> Bruce asks what I meant by this:
>
> > > There may be some issues with multiple citations pointing to
> > > the same URL ...
> >
> > How so? If they use the same URI (not URL), they ought to refer
> > to the same thing.

> Yep - they should but if you have more than one record for the same

> thing then which one should Zotero use? The first one it finds? How
> will it go about warning you that you have two?

This is a very crucial point. We've discussed this at length earlier
on the xbiblio mailing list some time ago but to no real avail.

Please don't forget real-world issues. There won't be necessarily a
single record pointing to a resource. There will be duplicate
records (sometimes for unavoidable reasons). And there will be
differences between variants of the same resource which are not
negligible.

> Just interface issues

I'd like to stress that these are *not* only interface issues. It's
crucial that (whenever possible) an identifier will resolve to a
user's personal & local (i.e. a trusted/approved) resource, and not
to an arbitrary resource somewhere on the net. Only if that local
resource cannot be resolved, then a global identifier should be used.

It's important that *both*, local (identified by a full URL) as well
as global identifiers are stored.

Matthias

Bruce D'Arcus

unread,

May 12, 2007, 11:01:34 PM5/12/07

to zoter...@googlegroups.com

On 5/12/07, Matthias Steffens <matthias...@googlemail.com> wrote:

...

> > > How so? If they use the same URI (not URL), they ought to refer
> > > to the same thing.
>
> > Yep - they should but if you have more than one record for the same
> > thing then which one should Zotero use? The first one it finds? How
> > will it go about warning you that you have two?
>
> This is a very crucial point. We've discussed this at length earlier
> on the xbiblio mailing list some time ago but to no real avail.
>
> Please don't forget real-world issues. There won't be necessarily a
> single record pointing to a resource. There will be duplicate
> records (sometimes for unavoidable reasons). And there will be
> differences between variants of the same resource which are not
> negligible.
>
> > Just interface issues
>
> I'd like to stress that these are *not* only interface issues. It's
> crucial that (whenever possible) an identifier will resolve to a
> user's personal & local (i.e. a trusted/approved) resource, and not
> to an arbitrary resource somewhere on the net.

OK, but this (trust) is not the same as identification.

> Only if that local
> resource cannot be resolved, then a global identifier should be used.
>
> It's important that *both*, local (identified by a full URL) as well
> as global identifiers are stored.

As above, I don't see how what you say up top results in this conclusion.

If I have a citation source identified in my document as
"http://ex.net/1" it seems reasonable enough that the formatter will
look first (or maybe even only) in my local and/or user data. That
someone else has metadata about the same resource in their database is
no problem at all.

You don't need any local ID, and it's counter-productive to use them.

Bruce

Matthias Steffens

unread,

May 13, 2007, 6:29:57 AM5/13/07

to zoter...@googlegroups.com

On 12-May-07 at 23:01 -0400 Bruce D'Arcus wrote:

> > It's crucial that (whenever possible) an identifier will resolve
> > to a user's personal & local (i.e. a trusted/approved) resource,
> > and not to an arbitrary resource somewhere on the net.
>
> OK, but this (trust) is not the same as identification.

Yes, that may be true. Please don't get me wrong, I fully agree with
you that global identifiers are key, but want to point out that a
global identifier should resolve to a preferred resource whenever
possible.

Correct formatting of a reference list may in fact depend on a local
record, which may have accurate retrieved-date information for
online articles, contain translated titles to be used, or have
author names corrected with accents/umlauts -- all information that
may not be available from another representation of this resource.

> If I have a citation source identified in my document as
> "http://ex.net/1" it seems reasonable enough that the formatter
> will look first (or maybe even only) in my local and/or user data.

This must be possible, that's my point.

If a record contains only the global identifier (such as a DOI
number), then the above will only work if the Word/OOo document
includes the base URL of my local bibliographic database as well as
my user account name. Otherwise it won't be able to query my local
database.

So, yes, if my database/account info is transferred with the
document, global-only identifiers might work. However, my personal
experience tells me that a system is more robust if the individual
pieces (i.e. the single citations/references) are fully
self-describing. As a simple example, what would happen if a user
copies a citation from one document to another? Local access
information would be lost if not included within the actual citation
element. As a result, formatting in the new document may differ.
People will complain about this, rightly so IMHO.

Matthias

Bruce D'Arcus

unread,

May 13, 2007, 7:57:27 AM5/13/07

to zotero-dev

On May 13, 6:29 am, Matthias Steffens

<matthias.steff...@googlemail.com> wrote:
> On 12-May-07 at 23:01 -0400 Bruce D'Arcus wrote:
>
> > > It's crucial that (whenever possible) an identifier will resolve
> > > to a user's personal & local (i.e. a trusted/approved) resource,
> > > and not to an arbitrary resource somewhere on the net.
>
> > OK, but this (trust) is not the same as identification.
>
> Yes, that may be true. Please don't get me wrong, I fully agree with
> you that global identifiers are key, but want to point out that a
> global identifier should resolve to a preferred resource whenever
> possible.
>
> Correct formatting of a reference list may in fact depend on a local
> record, which may have accurate retrieved-date information for
> online articles, contain translated titles to be used, or have
> author names corrected with accents/umlauts -- all information that
> may not be available from another representation of this resource.

If you look at the examples you've given, this are situations where
the user is forced to fix bugs (in, for example, original data, or
Zotero translators). None of that data (save for one) is in any way
user-specific.

Example: worldcat.org is an *excellent* source of data (and good
URIs!). But it seems Zotero sources what is ultimately MARC data,
where they have this dumb convention of doing titles like "The title :
the subtitle". So we have a title-cased title and the colon delimiter
has an extra spec preceding it. So naturally I have to manually change
these titles to remove the spurious space and to make it full title
case; e.g. "The Title: The Subtitle".

First, we can all agree Zotero ought to strtip the exta space. Some
might say it should mess with the title casing; perhaps because in
their locale or with their journals, titles use the same simple title
case. Well, OK, but it could be a user-level config flag for display.

Retrieved dates are not really that important to be tied to users. All
a retrieved date says is "this resource was available at this URL, on
this date". If I'm collaborating with someone who accessed the same
record on a different date, just use the latest; that's all that
matters

> > If I have a citation source identified in my document as
> > "http://ex.net/1" it seems reasonable enough that the formatter
> > will look first (or maybe even only) in my local and/or user data.
>
> This must be possible, that's my point.
>
> If a record contains only the global identifier (such as a DOI
> number), then the above will only work if the Word/OOo document
> includes the base URL of my local bibliographic database as well as
> my user account name. Otherwise it won't be able to query my local
> database.

Yes, but that's not tied to the citations per se. You can just have it
as a configuration parameter which you store somewhere in the
document.

> So, yes, if my database/account info is transferred with the
> document, global-only identifiers might work. However, my personal
> experience tells me that a system is more robust if the individual
> pieces (i.e. the single citations/references) are fully
> self-describing.

That's fair; it's helpful to include the full metadata that
corresponds to these IDs. I agree.

> As a simple example, what would happen if a user
> copies a citation from one document to another? Local access
> information would be lost if not included within the actual citation
> element. As a result, formatting in the new document may differ.

I can only speak of how we are talking about it in ODF and OOo: if you
copy a citation to another document, all the citation metadata --
including the URI -- gets copied as well. Copying a trusted source
property should be easy too.

> People will complain about this, rightly so IMHO.

These details are solveable problems. But the current situation is
this:

Citations in documents are tied not only to specific applications and
users, but to specific instances of a database. E.g. this is not a
robust system if I work on multiple machines, or collaborate with
other authors, or in particular if those authors use different
bibliographic databases. I consider that unacceptable. Dont you?

The only way I can see to solve this is to use identifiers that are
application/database-independent. URis fit those criteria.

Bruce

Matthias Steffens

unread,

May 13, 2007, 8:46:19 AM5/13/07

to zoter...@googlegroups.com

On 13-May-07 at 11:57 -0000 Bruce D'Arcus wrote:

> > Correct formatting of a reference list may in fact depend on a
> > local record, which may have accurate retrieved-date information
> > for online articles, contain translated titles to be used, or
> > have author names corrected with accents/umlauts -- all
> > information that may not be available from another
> > representation of this resource.
>
> If you look at the examples you've given, this are situations
> where the user is forced to fix bugs (in, for example, original
> data, or Zotero translators). None of that data (save for one) is
> in any way user-specific.

I agree, but it's unrealistic to think that bibliographic software
will automatically fix *all* "bugs" after online record retrieval
(after all, one's bug may be another one's "feature"). In case of
translated titles and spelling of author names, automatic addition
or correction may not even be possible at all. Also, it's
unrealistic that software will provide options for *all* user-/
application-specific formatting cases (title case in titles being
one example). If a user regards his own (local) record as a perfect
fit, why not use it if available? To do so, a user's access info (db
base URL & user account) has to travel at least with the document,
and even better with the citation itself. (sorry, I'm repeating
myself, so I'll stop here.)

> > As a simple example, what would happen if a user copies a
> > citation from one document to another? Local access information
> > would be lost if not included within the actual citation
> > element. As a result, formatting in the new document may differ.
>
> I can only speak of how we are talking about it in ODF and OOo: if
> you copy a citation to another document, all the citation metadata
> -- including the URI -- gets copied as well. Copying a trusted
> source property should be easy too.

*If* metadata (including local access info) are stored along side
the citation, then I agree that my issue is independent from the
issue of global vs local identifiers. But *solely* passing around
global identifiers will not work in many cases. This was my point.

Matthias

Bruce D'Arcus

unread,

May 14, 2007, 8:29:42 AM5/14/07

to zotero-dev

On May 13, 8:46 am, Matthias Steffens
<matthias.steff...@googlemail.com> wrote:

...

> > If you look at the examples you've given, this are situations
> > where the user is forced to fix bugs (in, for example, original
> > data, or Zotero translators). None of that data (save for one) is
> > in any way user-specific.
>
> I agree, but it's unrealistic to think that bibliographic software
> will automatically fix *all* "bugs" after online record retrieval
> (after all, one's bug may be another one's "feature"). In case of
> translated titles and spelling of author names, automatic addition
> or correction may not even be possible at all. Also, it's
> unrealistic that software will provide options for *all* user-/
> application-specific formatting cases (title case in titles being
> one example). If a user regards his own (local) record as a perfect
> fit, why not use it if available?

I recognize where you're coming from on this, but to turn things
around and underline a point ....

The only reason a user has to maintain their own local data is because
our tools -- and the infrastructure in which they're enmashed --
suck.

Does any user *really* want to maintain their own bibliographic
database? Or do they do it because they feel they must?

Whether it takes us 2, 5, 10, or 20 years to get beyond that is an
open question, but I prefer sooner rather than later ;-)

Bruce

Elena Razlogova

unread,

May 14, 2007, 8:53:58 AM5/14/07

to zoter...@googlegroups.com

Bruce, Matt, Peter--

Just to remind you, this is not really true:

> Now almost all Zotero resources will have a URL, so why can't we use
> that as the ID, with the other strategies if that fails, and in the
> worst case at least try to use something like a version of the title.

and neither is this:

> The only reason a user has to maintain their own local data is because
> our tools -- and the infrastructure in which they're enmashed --
> suck.

This is only true of online and catalogued source--many unpublished
sources, and most letters or interviews have no online records and it
unrealistic to expect that every interview or letter be published on
the web. Many of these sources will remain local. The solution you're
debating here would only work for online material, not for archival
sources. Perhaps you must use URL, doi, or ISBN when they exist, but
for other sources I don't see how one could avoid using something
more old-fashioned, like an auto-generated short reference, such as:
Author Last[, et. al]; Short Title [Item Type if no Title], Date; and
perhaps creation date/time to ensure a unique ID.

Best,
Elena

Bruce D'Arcus

unread,

May 14, 2007, 11:45:37 AM5/14/07

to zotero-dev

Hi Elena,

On May 14, 8:53 am, Elena Razlogova <elena.razlog...@gmail.com> wrote:

> This is only true of online and catalogued source--many unpublished
> sources, and most letters or interviews have no online records and it
> unrealistic to expect that every interview or letter be published on
> the web. Many of these sources will remain local. The solution you're
> debating here would only work for online material, not for archival
> sources.

I've done a lot of archival work, so I haven't forgotten about this
issue ;-)

To use a URI to identify something does not mean that resource has to
be "on the web." URIs are just global names. That someone puts
something at an online location that maps to a URI is just good
practice, but not necesary.

See:

<http://norman.walsh.name/2006/07/25/namesAndAddresses>

> Perhaps you must use URL, doi, or ISBN when they exist, but
> for other sources I don't see how one could avoid using something
> more old-fashioned, like an auto-generated short reference, such as:
> Author Last[, et. al]; Short Title [Item Type if no Title], Date; and
> perhaps creation date/time to ensure a unique ID.

There are different ways to deal with this, and these details bear
further discussion because this is indeed a little tricky.

It could be as simple as auto-generated URIs like:

http://zotero.org/resources/1234

... or:

urn:uuid:224ab023-77b8-4396-a75a-8cecd85b81e3

... or something else. I like the first one, BTW, since it's a smart
way to exploit the zotero.org infrastructure moving forward.

Bruce

Elena Razlogova

unread,

May 17, 2007, 7:17:09 AM5/17/07

to zoter...@googlegroups.com

Hi Bruce--

I'm new to this so I'm still not sure I understand how you can
generate global URIs for objects you don't publish online.

> To use a URI to identify something does not mean that resource has to
> be "on the web." URIs are just global names. That someone puts
> something at an online location that maps to a URI is just good
> practice, but not necesary.

> It could be as simple as auto-generated URIs like:
>
> http://zotero.org/resources/1234

But how can you be sure that your URI is global if it is generated
and stored only in your local db? What are the chances that someone
using Zotero in China won't generate the same URI by accident? What
you're doing here is just autogenerating a unique ID (unique in your
local context, not globally) with a web address appended to it--a
fake URL that doesn't lead anywhere. What are the advantages of this
vs. just a local unique number ID?

Thanks,
Elena

Bruce D'Arcus

unread,

May 17, 2007, 11:02:12 AM5/17/07

to zotero-dev

Hi Elena,

On May 17, 7:17 am, Elena Razlogova <elena.razlog...@gmail.com> wrote:

> I'm new to this so I'm still not sure I understand how you can
> generate global URIs for objects you don't publish online.

No problem.

Let's say you own a domain name: elena.net. You control that domain.
Therefore, you can invent any URIs you want. You can decide to put
some representation of those resources (say HTML pages) at those
locations, or not; it's up to you. A URI is just a name.

Obviously it's good to have some that one can access at the location
that the URI resolves to. But even more important is that you create a
context in which a large community of users will use the same URI
(name) to refer to the same thing (bibliographic resource).

> > To use a URI to identify something does not mean that resource has to
> > be "on the web." URIs are just global names. That someone puts
> > something at an online location that maps to a URI is just good
> > practice, but not necesary.
> > It could be as simple as auto-generated URIs like:
>
> >http://zotero.org/resources/1234
>
> But how can you be sure that your URI is global if it is generated
> and stored only in your local db? What are the chances that someone
> using Zotero in China won't generate the same URI by accident? What
> you're doing here is just autogenerating a unique ID (unique in your
> local context, not globally) with a web address appended to it--a
> fake URL that doesn't lead anywhere. What are the advantages of this
> vs. just a local unique number ID?

You're asking good questions about the practical details. There's no
doubt that if a resource has no globally unique ID, it becomes a
little more diffcult to assign one automatically that is useful beyond
the local context.

In my example above, I was asuming a fairly smart server which assigns
the URI; not that it gets generated locally. You could also imagine
merging a local ID with a zotero.org based user URI; like:

http://zotero.org/users/doej/resources/1234

That would avoid the case of two users assigning the same URI, though
you then can have multiple URIs to refer to the same thing. So a
machine can ping the server for further information if it wants, or
perhaps there's also an HTML page there that gives a user readble
information about the resource.

That example is essentially a URI representation of the local ID, but
it has a number of benefits:

1. you tie the id to the user (not the unidentified local database); I
use the same URI/ID on whatever machine I am working
2. as I said, you can allow others to access information about the
resource

Obviously when you have resources that do already have global IDs
(URI, DOs, ISBNs, etc.), the advantages over local IDs are much
greater, but even for those that don't, I think they are still there.

Another alternative is to locally generate a unqiue URI using a UUID.
That would not be resolvebla, and you'd also have to deal with that
problem of merging potentially multiple URis..

I'm not saying I have all the answer to these questions, but I am
saying that they are important questions for us to collectively
answer, and that local IDs only are bad practice.

I want to repeat the goal here so we don't get tripped up on the
details:

If three users collaborate on a docoument, possibly using different
ediiting applications and bibliographic databases, the citations
should not break; they should remain live and updateable as the
document is passed around.