A proposal for fixing plugin problems with unstable IDs

32 views
Skip to first unread message

Peter Sefton

unread,
May 17, 2007, 4:06:57 PM5/17/07
to zoter...@googlegroups.com
Hi all,

I asked here about IDs particularly in relation to the word processor
plugin. As usual I trivialised some of the issues and I think the
ensuing discussion pointed out where I was oversimplifying. My
question now is what can we do to fix the worst of the holes without
waiting for large-scale changes to Zotero or indulging in continued
philosophizing.

The biggest issue I can see is that currently if you use the plugin
the you can only format your doc from the same Zotero installation as
you used to create the refs - so this rules out sharing the document
with collaborators and makes the plugin a short-term proposition. That
is, you can use it format a bibliography now as long as your machine
doesn't crash and need to have things restored from backups, but not
in a few years from a different machine.

I'm going to put forward a concrete proposal for how we might fix
these issues in the short term. I'm looking at things we can do
without having to change large slabs of Zotero. Minimal changes that
will do the minimum amount of harm. This does not solve all the
problems or deal with the issue very elegantly, but I think it would
represent a step forward.

I propose that we re-engineer the plugin to use URLs as citation IDs
(with code to migrate the existing citations using the old unstable ID
system). For items that already have a URL this would be a simple
process.

For items that do not have a URL, create a URL which is a reference to
the local zotero installation. Something like
http://localhost:zotero-port/id/{GUID} where {GUID} is a globally
unique Identifier. That URL could be made to resolve to a web page
describing the item if you happen to have it Zotero. Obviously when
making a bibliography we'd need to suppress this URL. Or it could be
stored in the notes field instead.

For the case where there's more than one record with the same URL,
while I understand that there might be use cases where this makes
sense, for most users this would be regarded as an error, and it would
be good to let the user know that the problem exists. One solution is
to encourage people to delete the 'spare' copies, but if they decide
to keep the duplicates, how about using a tag to help resolve the
conflict . If there are multiple records with the same ID ask the user
to either pick a tag that specifies which ones are needed or just
accept that the result will be essentially random.

If we had the above changes which are only to the plugin and plugin
API then I'd be able to share references with colleagues, resolve
problems caused by duplicate references and have some confidence that
I will be able to revisit my document in the future.

I've got enough Australian Government funding to cover these changes
or to help with more elaborate solutions, but we need to act NOW for
our project. Can anyone from the Zotero team comment on this proposal?
And he rest of this list, of course.

pt

--

Peter Sefton
Senior Research Fellow / RUBRIC Technical Manager
RUBRIC Project, DeC
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA


Work: sef...@usq.edu.au
Private: p...@ptsefton.com

p: +61 (0)7 4631 1640
m: +61 (0)410 326 955

RUBRIC Website: http://www.rubric.edu.au
USQ Website: http://www.usq.edu.au
Personal Website: http://ptsefton.com

RUBRIC is supported by the Systemic Infrastructure Initiative as part of
the Commonwealth Government's Backing Australia's Ability - An
Innovative Action Plan for the Future
(http://backingaus.innovation.gov.au)

The University of Southern Queensland is a registered provider of
education with the Australian Government.

(CRICOS Codes: QLD 00244B | NSW 02225M | VIC 02387D | WA 02521C)


--

Peter Sefton
Senior Research Fellow / RUBRIC Technical Manager
RUBRIC Project, DeC
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA


Work: sef...@usq.edu.au
Private: p...@ptsefton.com

p: +61 (0)7 4631 1640
m: +61 (0)410 326 955

RUBRIC Website: http://www.rubric.edu.au
USQ Website: http://www.usq.edu.au
Personal Website: http://ptsefton.com

RUBRIC is supported by the Systemic Infrastructure Initiative as part of
the Commonwealth Government's Backing Australia's Ability - An
Innovative Action Plan for the Future
(http://backingaus.innovation.gov.au)

The University of Southern Queensland is a registered provider of
education with the Australian Government.

(CRICOS Codes: QLD 00244B | NSW 02225M | VIC 02387D | WA 02521C)

Bruce D'Arcus

unread,
May 17, 2007, 4:46:19 PM5/17/07
to zotero-dev

On May 17, 4:06 pm, "Peter Sefton" <ptsef...@gmail.com> wrote:

> I asked here about IDs particularly in relation to the word processor
> plugin. As usual I trivialised some of the issues and I think the
> ensuing discussion pointed out where I was oversimplifying. My
> question now is what can we do to fix the worst of the holes without
> waiting for large-scale changes to Zotero or indulging in continued
> philosophizing.

We can't avoid philosophizing a little bit in trying to come to a
practical and concrete solution though ;-)

...

> I propose that we re-engineer the plugin to use URLs as citation IDs
> (with code to migrate the existing citations using the old unstable ID
> system). For items that already have a URL this would be a simple
> process.

Here's my first practical philosophical question: what does it mean
for an item to "have a URL"?

This is an issue that has importance across the whole spectrum of what
Zotero does (translators, UI, citation formatting, data export), so
it's an important one.

Let me be concrete: if I save a journal article metadata from a
publisher webpage, does that item then "have a URL" of that source
page?

Or a book: is it the URL for the Amazon web page the metadata was
scraped from? What happens if I get my metadata from Amazon and my
collaborator gets it from worldcat.org?

I still think, as my questions are hinting at, that we need
conventions that can accomodate some of the more widely deployed
global identifiers (particularly DOIs).

My preference would be to fairly precise about this and say "a
resolvable online resource." This includes pretty much any document on
the web. For those, use its URL.

For books and journal articles that have global identifiers of their
own and where Zotero is scraping in essence second-hand metadata, we
just need some rules for that (basically, what to do with ISBNs and
DOIs).

> For items that do not have a URL, create a URL which is a reference to

> the local zotero installation. Something likehttp://localhost:zotero-port/id/{GUID} where {GUID} is a globally


> unique Identifier. That URL could be made to resolve to a web page
> describing the item if you happen to have it Zotero. Obviously when
> making a bibliography we'd need to suppress this URL. Or it could be
> stored in the notes field instead.

My second more practical question is this: what value does the
localhost URL have here, as opposed, say, to a urn (urn:uuid:[GUID]),
or even to use the zotero.org domain (http://zotero.org/resources/id/
[GUID]? Just that you could rework Zotero to directly resolve it?

Bruce

Peter Sefton

unread,
May 17, 2007, 7:40:11 PM5/17/07
to zoter...@googlegroups.com


On 5/18/07, Bruce D'Arcus <bdarcu...@gmail.com> wrote:

> We can't avoid philosophizing a little bit in trying to come to a
> practical and concrete solution though ;-)

I knew you would anyway :-)



> Here's my first practical philosophical question: what does it mean
> for an item to "have a URL"?

The practical approach? Use whatever happens to be in the URL field in
a Zotero record. At this stage the best we can do is to write a 'how
to' guide for users who care about doing things well; that could point
them to the best services and show them how to use things like DOIs



>
> My second more practical question is this: what value does the
> localhost URL have here, as opposed, say, to a urn (urn:uuid:[GUID]),
> or even to use the zotero.org domain (http://zotero.org/resources/id/
> [GUID]? Just that you could rework Zotero to directly resolve it?

No value at all. Just a stab in the dark. urn:uuid:[GUID] is better I
think, and having a something at zotero might be better still, but do
the Zotero people have such ambitions? Would the URL resolve to
something at Zotero? Who would maintain the record?

Bottom line is if there are GUIDs involved and a standard way of
expressing them then you can change how they are referenced later.

pt


> Bruce

Dan Stillman

unread,
May 17, 2007, 8:01:53 PM5/17/07
to zoter...@googlegroups.com
On 5/17/07 7:40 PM, Peter Sefton wrote:
> The practical approach? Use whatever happens to be in the URL field in
> a Zotero record. At this stage the best we can do is to write a 'how
> to' guide for users who care about doing things well; that could point
> them to the best services and show them how to use things like DOIs

Well, that's not really the best we can do. We just need to decide what
exactly it means to have a value in the URL field and then change which
translators save a URL in the URL field (and/or update accessData) and
which attach a web link. For example, we changed the Amazon translator
several months ago to create web link attachments rather than to use the
URL field, since Amazon basically just has metadata. The NYTimes
translator, on the other hand, uses the URL field.

Trickier are sites like JSTOR that don't represent any sort of canonical
source but still offer retrievable versions of the reference. Bruce, are
you saying you think an item from JSTOR should use the URL field or
should not? The latter, I assume, since presumably all references there
have global identifiers of their own and can be found in other places?
At the moment for some reason we store both a URL field value and a web
link attachment, which is certainly wrong.

I'll comment on some of the other issues later.

Bruce D'Arcus

unread,
May 17, 2007, 8:29:26 PM5/17/07
to zotero-dev
On May 17, 7:40 pm, "Peter Sefton" <ptsef...@gmail.com> wrote:

> On 5/18/07, Bruce D'Arcus <bdarcus.li...@gmail.com> wrote:

> > Here's my first practical philosophical question: what does it mean
> > for an item to "have a URL"?
>
> The practical approach? Use whatever happens to be in the URL field in
> a Zotero record.

But this field is already an existing source of problems that needs
some fixing. You already see users complaining on the forums when URLs
incorrectly show up in their bibliographic entries.

If the translators are tweaked to only store resolvable (HTTP) URIs
for the document per se (in short, online resources; newspaper and
magazine articles, blog posts, interview transcripts, etc.) in that
field and to grab DOIs wherever possible, that would help resolve both
issues: correct citation formatting, and clear identification.

Bruce

Bruce D'Arcus

unread,
May 17, 2007, 8:35:02 PM5/17/07
to zotero-dev
Oops, seems we were posting at the same time ...

On May 17, 8:01 pm, Dan Stillman <dstill...@zotero.org> wrote:
> On 5/17/07 7:40 PM, Peter Sefton wrote:
>
> > The practical approach? Use whatever happens to be in the URL field in
> > a Zotero record. At this stage the best we can do is to write a 'how
> > to' guide for users who care about doing things well; that could point
> > them to the best services and show them how to use things like DOIs
>
> Well, that's not really the best we can do. We just need to decide what
> exactly it means to have a value in the URL field and then change which
> translators save a URL in the URL field (and/or update accessData) and
> which attach a web link. For example, we changed the Amazon translator
> several months ago to create web link attachments rather than to use the
> URL field, since Amazon basically just has metadata. The NYTimes
> translator, on the other hand, uses the URL field.

Bingo; that's what I was meaning.

> Trickier are sites like JSTOR that don't represent any sort of canonical
> source but still offer retrievable versions of the reference. Bruce, are
> you saying you think an item from JSTOR should use the URL field or
> should not? The latter, I assume, since presumably all references there
> have global identifiers of their own and can be found in other places?
> At the moment for some reason we store both a URL field value and a web
> link attachment, which is certainly wrong.

The JSTOR example is a little tricky, but if we partly look at it from
the standpoint of citation formatting (at least as it exists now as it
slowly accommodates the web better), then I guess it should be a web-
link, and so not stored in the URL field.

I seem to notice a lot of translators miss the DOIs for articles, BTW.
That's not good.

But this is the kind of discussion of details I was wanting. It might
be worth creating a wiki page with different examples at some point?

Bruce

Elena Razlogova

unread,
May 17, 2007, 10:46:59 PM5/17/07
to zoter...@googlegroups.com
Hi Dan et al.--

> For example, we changed the Amazon translator
> several months ago to create web link attachments rather than to
> use the
> URL field, since Amazon basically just has metadata.

From a user perspective basically what this means is that in order
to get to an Amazon record, I have to double-click on a reference to
expand it, then click again on the web link to see the page. That's
two extra clicks I didn't have to do before. It had been much more
convenient to just click on "URL:" tag to go to the website. By the
way, it's not just metadata--Amazon includes reader comments on the
book that change over time so it is completely reasonable to want to
keep a live link to the page.

My question is this: In this new system of unique identifiers, will I
be allowed to paste the Amazon link into the URL field to make life
more convenient for me again (this is what I do now)? It seems to me
there should be some field where a user can paste whatever URL
happens to be convenient to access their document--and one can
reasonably have a preference for one specific instance of multiple
copies of the same document available online (i.e. a bartleby.com or
a gutenberg.org online copy of a book)--without the inconvenience of
two extra clicks and an extra "web link" child item that takes up
extra space better used for a research note.

It seems from the discussion that eventually this identifier field
will be automatically generated by a central Zotero server (although
I think the plan was also to give libraries options to create their
own servers which then complicates things). Would it be possible to
create a URI or whatever field that generates these automatic URI IDs
and leave the URL field for the user to control?

Thanks,
Elena

Josh Greenberg

unread,
May 17, 2007, 11:02:20 PM5/17/07
to zoter...@googlegroups.com
> My question is this: In this new system of unique identifiers, will I
> be allowed to paste the Amazon link into the URL field to make life
> more convenient for me again (this is what I do now)? It seems to me
> there should be some field where a user can paste whatever URL
> happens to be convenient to access their document--and one can
> reasonably have a preference for one specific instance of multiple
> copies of the same document available online (i.e. a bartleby.com or
> a gutenberg.org online copy of a book)--without the inconvenience of
> two extra clicks and an extra "web link" child item that takes up
> extra space better used for a research note.

I don't know if I'm overthinking this (I do love the philosophizin'),
but isn't what Elena's describing here a question that deals not with
properties of the item itself, but rather with resolving the abstract
metadata record for an item (say, a book) to a particular incarnation
of it (an Amazon page for that book)?

The deeper problem is that Elena's having to hack the URL field
because there isn't an openURL resolver for Amazon (which presumably
is what she'd actually want, so that the "Locate" button did exactly
what she's describing).

Now, if only we could easily stitch together custom, personalized
openURL resolvers ("I want to find books at NYPL, and if they don't
exist there from Amazon..."), we could leave the URL field alone to
be no more than what it is, a pointer to the thing itself.

(However, if what you're interested in is the Amazon page itself,
with its reviews, etc., then what you'd actually want is a webpage
item with that page's metadata and its URL, 'cause you're interested
in that page as an independent thing, rather than as a proxy for the
book it's selling. There's probably a relationship to be drawn
between that page and its related book item, but that's a whole other
conversation)...

- Josh

Dan Stillman

unread,
May 18, 2007, 3:32:55 AM5/18/07
to zoter...@googlegroups.com
On 5/17/07 11:02 PM, Josh Greenberg wrote:
> The deeper problem is that Elena's having to hack the URL field
> because there isn't an openURL resolver for Amazon (which presumably
> is what she'd actually want, so that the "Locate" button did exactly
> what she's describing).
>
> Now, if only we could easily stitch together custom, personalized
> openURL resolvers ("I want to find books at NYPL, and if they don't
> exist there from Amazon..."), we could leave the URL field alone to
> be no more than what it is, a pointer to the thing itself.

This is a pretty great idea. At the risk of getting pretty far
off-topic, what it suggests is that, in addition to site translators,
we need site *search* translators that would generate a site-specific
search URL for a given item for sites that didn't support OpenURL.
(Alternatively, they could convert an OpenURL URL into a site-specific
search, but that's probably not necessary. The main benefit would be the
potential for external reuse of the search translators.) This feature
would be roughly equivalent to Firefox's Keyword Search mechanism,
though it might need to be a bit more flexible and use small JS
functions rather than just strings with placeholders.

Does OpenURL provide a standard mechanism for the client to determine if
any results were found? It seems that would be necessary for Josh's
"stitch together" request, though user-created site search translators
could certainly do this by running a detectWeb()-like function (or even
detectWeb() itself) on the response.

I don't know that chaining would even be necessary, though. With the
ability to create custom resolvers, you could just have a drop-down menu
from the Locate button that let you find the item in the site of your
choosing.

(Elena, as a quick response, while I understand the practical issues, I
don't think that limitations of the current UI should influence these
sorts of core design questions. Ultimately, a NYT URL isn't the same as
an Amazon URL, and while nothing will prevent you from pasting an Amazon
URL into the field, various parts of Zotero (the Word plugin,
collaborative features, etc.) might make assumptions based on the nature
of the URL it expects to find in the field. But the point is taken, and
as Josh's post suggests, I think we can come up with an even better
solution.)

Matthias Steffens

unread,
May 18, 2007, 4:24:18 AM5/18/07
to zoter...@googlegroups.com
On 18-May-07 at 03:32 -0400 Dan Stillman wrote:

> > Now, if only we could easily stitch together custom, personalized
> > openURL resolvers ("I want to find books at NYPL, and if they don't
> > exist there from Amazon..."), we could leave the URL field alone to
> > be no more than what it is, a pointer to the thing itself.
>
> This is a pretty great idea. At the risk of getting pretty far
> off-topic, what it suggests is that, in addition to site translators,
> we need site *search* translators that would generate a site-specific
> search URL for a given item for sites that didn't support OpenURL.

AFAIK, OpenURL is designed and meant to resolve to a *single*
resource, while standard search protocols such as SRU and OpenSearch
are meant to retrieve multiple hits.

<http://www.loc.gov/standards/sru/>
<http://www.opensearch.org/>

From what you're describing, I think that OpenSearch would be tool
that can describe site-specific search URLs in a standard way.
AFAIK, there are even Firefox plugins that can generate OpenSearch
description documents (which define the site-specific search URLs)
from a site that doesn't provide OpenSearch functionality itself.

> Does OpenURL provide a standard mechanism for the client to
> determine if any results were found? It seems that would be
> necessary for Josh's "stitch together" request

For sites that support OpenSearch, this is provided in the
OpenSearch response (which is a RSS 2.0 or Atom 1.0 feed).

Matthias

Matthias Steffens

unread,
May 18, 2007, 4:40:47 AM5/18/07
to zoter...@googlegroups.com
On 17-May-07 at 22:46 -0400 Elena Razlogova wrote:

> > For example, we changed the Amazon translator several months ago
> > to create web link attachments rather than to use the URL field,
> > since Amazon basically just has metadata.
>
> From a user perspective basically what this means is that in order
> to get to an Amazon record, I have to double-click on a reference to
> expand it, then click again on the web link to see the page. That's
> two extra clicks I didn't have to do before. It had been much more
> convenient to just click on "URL:" tag to go to the website. By the
> way, it's not just metadata--Amazon includes reader comments on the
> book that change over time so it is completely reasonable to want to
> keep a live link to the page.

> It seems to me there should be some field where a user can paste


> whatever URL happens to be convenient to access their document

I'm with Elena here. From a user perspective it would be very
confusing to me if I'm not allowed to paste into the URL field
whatever URL I'd like to choose. This would mean that the URL field
becomes essentially useless as a general purpose field, which, IMHO,
is not a desirable design.

There are many cases where an URL to a constantly updated web page
is more useful than a locally stored version of it. And this URL
should be easily accessible from within the interface.

> Would it be possible to create a URI or whatever field that
> generates these automatic URI IDs and leave the URL field for the
> user to control?

I also think that eventually a separate URI field would be more
appropriate for storing & dealing with URLs that are meant to be
unique identifiers. This would allow users to paste, say, an Amazon
URL into the URL field, while the URI field could still contain a
WorldCat identifier (or whatever appropriate).

Matthias

Dan Stillman

unread,
May 18, 2007, 4:58:37 AM5/18/07
to zoter...@googlegroups.com
On 5/18/07 4:24 AM, Matthias Steffens wrote:
> From what you're describing, I think that OpenSearch would be tool
> that can describe site-specific search URLs in a standard way.
> AFAIK, there are even Firefox plugins that can generate OpenSearch
> description documents (which define the site-specific search URLs)
> from a site that doesn't provide OpenSearch functionality itself.

OK, so, clearly a good enough idea that it already exists, built in to
Firefox, with a large number of sites already supported...

So, perhaps the best way to do this would be to just use the existing
Mozilla functionality, with users adding search engines via the "Manage
Search Engines..." interface rather than through Zotero and then simply
selecting within the Zotero prefs which installed engines should show up
under Locate.

While this wouldn't be exactly analogous to OpenURL, most smart search
engines (like Amazon) will go directly to the item page itself if passed
a unique identifier.

Dan Stillman

unread,
May 18, 2007, 5:40:17 AM5/18/07
to zoter...@googlegroups.com
On 5/18/07 4:40 AM, Matthias Steffens wrote:
> I also think that eventually a separate URI field would be more
> appropriate for storing & dealing with URLs that are meant to be
> unique identifiers. This would allow users to paste, say, an Amazon
> URL into the URL field, while the URI field could still contain a
> WorldCat identifier (or whatever appropriate).

No one is proposing stopping the user from being able to type or paste
what they like into the URL field, but that doesn't mean it's good
practice or wouldn't cause unwanted behavior elsewhere in Zotero.
Pasting in an Amazon URL shouldn't be necessary if we use something like
OpenSearch to provide a generalized way of looking up resources. For
everything else there are web links, and while the interface for web
links perhaps isn't ideal for everyone at the moment (feel free to offer
suggestions), conceptually they seem more appropriate for what we're
talking about--links to arbitrary pages on the web that are relevant but
not distinct enough to be separate top-level items.

This isn't just a question of having a URI. (I don't think that Zotero
needs to or even should expose the URI it uses for items, since that
will presumably be based on some algorithm derived from the other
fields.) One place the current inconsistent behavior is problematic is
in citations. At the moment, if there's a value in the URL field and an
access date, Zotero generates a bibliography containing "Retrieved from
http://www.example.com" or the like. This is generally desirable and
necessary with a web document, but probably annoying and unwanted with
journal articles pulled from a database. If the URL field was only used
for web documents and not for links to secondary database catalogs, this
would be far less of an issue.

(Of course, you could also argue that it's often unnecessary to include
a URL for something like an article from a magazine website and that the
citation behavior needs to offer some sort of preference or
item-specific toggle anyway, but at least the default assumption would
be much more frequently correct if the URL field was used consistently...)

Bruce D'Arcus

unread,
May 18, 2007, 9:27:12 AM5/18/07
to zotero-dev
On May 18, 4:58 am, Dan Stillman <dstill...@zotero.org> wrote:

...

> So, perhaps the best way to do this would be to just use the existing
> Mozilla functionality, with users adding search engines via the "Manage
> Search Engines..." interface rather than through Zotero and then simply
> selecting within the Zotero prefs which installed engines should show up
> under Locate.
>
> While this wouldn't be exactly analogous to OpenURL, most smart search
> engines (like Amazon) will go directly to the item page itself if passed
> a unique identifier.

I like the direction of this discussion.

Just a few random things to keep sight of:

First, I wonder how this fits with the discussion Matthias and I were
earlier having about in essence trusted (metadata) sources? Alf Eaton
has also thought about this idea (of not always having to have *one*
centralized database).

Maybe it's just that the "search translators" (to use Dan's new term)
might be classified, with certain kinds of sources being metadata
sources, and others perhaps just providing access to the items per
se.

I've not thought about this in any detail; just noting it for
consideration.

Second, the mention of an RSS/Atom based search service reminds me
again that I'd like to use something similar syndication model for CSL
stores, where perhaps users could subsribe to them as they would a
news feed. As in Atom, CSL files are identified by URI.

Again, not something worth digressing on, but did want to note it.

But to get back on topic, where does this leave us WRT to Peter's
proposal?

Do we agree:

1) Citations in document should be identified by URI?

2) We should constrain the URL field to dereferenceable resources?
Where present, this is the URI.

3) If a DOI is present, use the DOI in INFO URI form?

4) If an ISBN is present, use it in URN form?

5) If no standard ID, generate a UUID and use it in URN form?

This list is a strawman of course, but it ought to at least give us
something to work with.

BTW, ultimately whatever we come up with ought to be documented at and
in consultation with the new biblio ontology project, so that it's
reflected in the RDF.

Bruce

Elena Razlogova

unread,
May 18, 2007, 4:18:46 PM5/18/07
to zoter...@googlegroups.com
Dan et al.--

Josh's OpenSearch idea, as modified during discussion, sounds great.
On this:

> At the moment, if there's a value in the URL field and an
> access date, Zotero generates a bibliography containing "Retrieved
> from
> http://www.example.com" or the like. This is generally desirable and
> necessary with a web document, but probably annoying and unwanted with
> journal articles pulled from a database. If the URL field was only
> used
> for web documents and not for links to secondary database catalogs,
> this
> would be far less of an issue.

> (Of course, you could also argue that it's often unnecessary to
> include
> a URL for something like an article from a magazine website and
> that the
> citation behavior needs to offer some sort of preference or
> item-specific toggle anyway, but at least the default assumption would
> be much more frequently correct if the URL field was used
> consistently...)

Perhaps Zotero could have a preference for including URLs in
citations but, as a default, set the preference to include them in
web documents only (that's how Scribe is set up).

On this:

> No one is proposing stopping the user from being able to type or paste
> what they like into the URL field, but that doesn't mean it's good
> practice or wouldn't cause unwanted behavior elsewhere in Zotero.

I assume you're already planning this, but if Zotero will act on the
data in URL field in particular ways, perhaps there should be some
kind of validation on user data entry. Presumably most of the time
entry in URL field would be automatic, but people make all sorts of
mistakes that are hard to predict--for example, the other day a user
entered "Boston" in a Date field instead of Publisher field and as a
result his Zotero RDF import didn't work.

Best,
Elena

Dan Stillman

unread,
May 18, 2007, 7:31:49 PM5/18/07
to zoter...@googlegroups.com
On 5/18/07 4:18 PM, Elena Razlogova wrote:

...

> I assume you're already planning this, but if Zotero will act on the
> data in URL field in particular ways, perhaps there should be some
> kind of validation on user data entry. Presumably most of the time
> entry in URL field would be automatic, but people make all sorts of
> mistakes that are hard to predict--for example, the other day a user
> entered "Boston" in a Date field instead of Publisher field and as a
> result his Zotero RDF import didn't work.

I thought we determined that the "Boston" wasn't the problem--I'm able
to put "Boston" into dc:date in an RDF file and import it fine--but
rather "1997-98", which was failing due to a Zotero date-parsing bug...

Point taken, though. The URL field probably should make sure a valid URL
is entered. In addition to input validation, I've wanted for a while a
sort of preflight check when generating a bibliography that would flag
potential errors with using the selected items with the selected style,
check to make sure all necessary fields had values, etc.

Dan Stillman

unread,
May 18, 2007, 9:25:17 PM5/18/07
to zoter...@googlegroups.com
On 5/18/07 9:27 AM, Bruce D'Arcus wrote:

> Do we agree:
>
> 1) Citations in document should be identified by URI?
>
> 2) We should constrain the URL field to dereferenceable resources?
> Where present, this is the URI.
>
> 3) If a DOI is present, use the DOI in INFO URI form?
>
> 4) If an ISBN is present, use it in URN form?
>
> 5) If no standard ID, generate a UUID and use it in URN form?
>

There may be value in just using zotero.org URLs for the URIs. That'd be
my inclination, though I haven't fully thought through the pros and
cons. But since the goal would be for every item the server knows about
to have its own page with metadata anyway, there might not be any great
reason not to.

To get us down to specifics, here's one possible (and probably deeply
flawed) scenario:

1) I add an item to my library with a DOI.

2) My Z client passes the DOI to Z server. Z server looks the DOI up in
its tables and gives back a zotero.org URI with a GUID, creating a new
one if necessary. My Z client stores the URI with the item.

4) I cite the item in Word, and the Word plugin saves the zotero.org URI
with the field.

5) I send the document to Bruce for edits.

6) Bruce's Zotero OOo plugin goes to update fields (OK, bookmarks) and
asks Zotero for the metadata for the item with the specified zotero.org
URI. Bruce's Z client checks for any existing items with the z.org URI.
Just to make things more difficult, Bruce's Z client doesn't have the
item, so it passes the URI to the Z server.

7) Z server looks up the GUID from the zotero.org URI, finds the DOI and
any other unique identifiers associated with the URI, and passes them
back to Bruce's Z client.

8) Bruce's Z client now has some unique identifiers and queries CrossRef
or some other source for the rest of the metadata (or maybe the Z server
has already done this and passes it back). The Z client creates a new item.

9) Bruce now automatically has the item in his library just from being
passed a Word document with a citation.


Some problems/questions with the above scenario:

A) I'm not sure where the data for #7, linking various unique
identifiers, would come from. Is there a good source for this? This
dataset might emerge naturally as users begin to sync data with the
server, but we're not quite at that stage yet.

B) As mentioned above, using Zotero URIs may not be necessary. One
benefit is that, if Bruce had the same item in his library but with only
an ISBN and not a DOI, the item should already have a Zotero URI and the
Z client wouldn't need to contact the server in #6. One downside, of
course, is that the Z client would need to contact the Z server when
first adding the item. (To handle cases where there was no network
connection or the Z server was unavailable, Zotero and the Word plugin
could still support using a URL, DOI URN, etc., and simply look up a
zotero.org URI later when there was a connection.) Now, the Zotero
server could still keep track of associated unique IDs even without
assigning a URI to them, but it would need to generate a GUID internally
anyway, and that plus zotero.org/something/ is really all the URI would be.

C) If the item has no unique identifier, Zotero might try to query
CrossRef or some other source for metadata and a unique identifier. It
also might try to query the Zotero server, which perhaps has stored
metadata from previous queries on unique identifiers. But assuming it
doesn't find an existing record, does it store metadata from the item on
the Z server, associated with the GUID? It could then link up with an
identical or very similar item from another user, but what happens if
one of the fields is changed? Ultimately, anything involving syncing
metadata from the client is problematic until the Z server has accounts,
authentication, permissions, etc., and the Z client has mechanisms for
controlling access to synced items. Then a user could "publish" a new
item and have administrative control over it, and changes would be
handled via the permissions system. But until then, perhaps items
without authoritative unique identifiers would just need to be exchanged
manually, which would at least make it possible to share a document
across multiple computers or with a colleague...

Thoughts on any of the above appreciated...

- Dan

Matthias Steffens

unread,
May 19, 2007, 6:23:24 AM5/19/07
to zoter...@googlegroups.com
Dan, thanks for the detailed outline of a possible Zotero server
workflow. I think your scenario sounds reasonable. However, I'd
prefer if this scenario would be a bit more open, i.e. not entirely
designed around (and thus dependent on) a central Zotero server.
IMHO, it would be beneficial for everybody, if this system would
optionally allow for decentralized record retrieval. This would also
solve my issues of trusted server sources discussed earlier on this
list.

That said, I wonder whether multiple OpenSearch URL templates
defined in a user's Firefox/Zotero client could also be used to
establish a more distributed system of resolving identifiers. It
would also allow Zotero to query several servers in the user's
individual order of preference.

For each citation extracted from a Word/OOo document, Zotero could
walk a list of defined OpenSearch URL templates[1]:

http://zotero.org/?q={searchTerms}
http://server2.net/?query={searchTerms}
http://server3.com/search.php?q={searchTerms}

As an example, for a record with a DOI identifier (say
info:doi:10.1029/95JC02188), Zotero could resolve this to:
(URL encoding omitted)

http://zotero.org/?q=info:doi:10.1029/95JC02188
http://server2.net/?query=info:doi:10.1029/95JC02188
http://server3.com/search.php?q=info:doi:10.1029/95JC02188

and query each server in the user's given order of preference. As in
the example above, Zotero's own server could be listed first by
default. As soon as an URL request resolves to a single record, the
Zotero client would stop and use the fetched IDs/metadata/etc for
further processing. I.e., in most cases, the Zotero client would
just query the Zotero server and be done with it.

For other records with ISBN or Zotero GUID identifiers, the above
queries could look like:

http://zotero.org/?q=urn:isbn:0415948738
http://server2.net/?query=urn:isbn:0415948738
http://server3.com/search.php?q=urn:isbn:0415948738

http://zotero.org/?q=http://zotero.org/whatever/8e077da7-8f98-49cc-841b-e95632d05414
http://server2.net/?query=http://zotero.org/whatever/8e077da7-8f98-49cc-841b-e95632d05414
http://server3.com/search.php?q=http://zotero.org/whatever/8e077da7-8f98-49cc-841b-e95632d05414

As a last resort (if nothing was found by the above), the Zotero
client could use the record's metadata (such as the title) to
perform similar OpenSearch searches.

Another possibility would be to agree on using CQL[2] search syntax
within the query string, like:
(URL encoding again omitted)

http://zotero.org/?q=rec.identifier any info:doi:10.1029/95JC02188 urn:isbn:0415948738 ...

So, in other words, I understand that it's tempting to build the
Zotero sharing system entirely around a centralized Zotero server
using Zotero GUIDs as the *main* identifiers. However, this would
not allow to include other servers as sources for retrieval of
record IDs or bibliographic metadata.

I'd like to stress that my above scenario does not exclude a central
Zotero server and Zotero GUIDs -- both would still be needed as
outlined in your previous post. It would just introduce an
additional level of flexibility when resolving records.

Matthias

[1]: <http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_URL_template_syntax>
[2]: <http://www.loc.gov/standards/sru/cql/index.html>

Bruce D'Arcus

unread,
May 19, 2007, 7:33:28 AM5/19/07
to zotero-dev
On May 19, 6:23 am, Matthias Steffens
<matthias.steff...@googlemail.com> wrote:

> Dan, thanks for the detailed outline of a possible Zotero server
> workflow. I think your scenario sounds reasonable. However, I'd
> prefer if this scenario would be a bit more open, i.e. not entirely
> designed around (and thus dependent on) a central Zotero server.

Yeah, I agree.

If I understand your proposal right, Dan, you'd basically be using
Zotero (the server) to create new stable global URIs, and then
associating them with legacy identifiers on the server.

In other words, if we have a resource with a doi of
"10.1029/95JC02188", you ultimately assign it a URI of something like
"http://zotero.org/id/982273847334". That URI would then be used as
the citation identifier in documents, and the relevant part of the RDF
encoding might look like:

<b:Citation rdf:about="http://zotero.org/id/982273847334">
<b:doi>10.1029/95JC02188</b:doi>
</b:Citation>

... or even:

<b:Citation rdf:about="http://zotero.org/id/982273847334">
<owl:sameAs rdf:resource="info:doi:10.1029/95JC02188"/>
</b:Citation>

This would be in contrast to using "info:doi:10.1029/95JC02188" as the
URI identifier.

<b:Citation rdf:about="info:doi:10.1029/95JC02188">
...
</b:Citation>

I think in principle that's not such a bad thing. Indeed, I have been
saying all along that with the server, you have the ability to mint
new meaingful URIs and should absolutely exploit it to the hilt.

But in practice I get to the single point-of-failure problem. In
short, what happens if Zotero fails? Perhaps your funding dries up.
What then?*

Or even more simply and practically, what happens if a user is working
off-line? Or they use a different bibliographic tool?

That said, I don't think it's totally obvious that using an info URI
is ideally such a great thing (because not dereferencable).

>From an RDF perspective, BTW, the question of which URI to use is
ultimately a social question: whose authority do you most trust, and
whose do you expect *others* to similarly trust such that when
different people want to refer to the same thing, they use the same
URI?

If you want to refer to some edition of a Mark Twain book, for
example, do you use the isbn encoded as an effectively non-
dereferenceable URN?

urn:isbn:0895772175

Or do you say instead that ISBNs are unreliable (they are sometimes
not unique, and one has two different IDs for hardcover and softcover,
which is irrelevant for our purposes), and instead use the better OCLC
IDs as INFO URI?

info:oclcnum:12940137

Or do you use the cool URIs from worldcat.org as your trusted source
because they are derefernceable?

http://www.worldcat.org/isbn/0895772175
http://www.worldcat.org/oclc/12940137

Or do we forget about all that and essentially only use zotero URIs?

We just have to make some choices among imperfect options, hopefully
based on some clear view of the requirements.

In general, I think every piece of this sytem needs to avoid assuming
a centralized model, even if perhaps it might be implemented as one.

On that count, perhaps (?) rewriting DOIs in particular would not be
good practice, because when confronted with such a URI in a document,
getting back to the DOI and the resource would require the zotero
server?

Bruce

* this of course is an important question of long-term sustainability.
I'd like to hear more about how you guys are thinking of this, but
we'll skip it here.

Message has been deleted

Richard Karnesky

unread,
May 19, 2007, 9:49:52 AM5/19/07
to zotero-dev
> Or even more simply and practically, what happens if a user is working
> off-line?
If the client has already connected with the server to negotiate the
full citation metadata (including the "Zotero URI" which is used as a
globally unique ID), this won't be a problem. Bibliographic metadata
rarely changes. If it has been corrected, the client can easily get
it during the next sync.

If this transaction hasn't been done, but some other trusted URI (DOI/
OCLC/ISBN/PMID/arXiv #/etc.) is available from the local reference
manager, I don't think it is a problem either. During the next
connection, it can easily retrieve and start to use the zotero GUID.

If it doesn't have any such global ID or trusted/predictable local ID,
I suppose that a local ID would have to be generated & there should be
some mechanism (OpenSearch or whatever) to sync that local ID with the
Zotero GUID.

> Or they use a different bibliographic tool?

Ideally, by allowing OpenSearch & other standards, the responsibility
would be on (and the ability would be with) the other bibliographic
tool to perform this same sync.

> whose do you expect *others* to similarly trust such that when
> different people want to refer to the same thing, they use the same
> URI?

Yes! As I've said before, it is important to include other URIs for
the same resource in the metadata that aren't used as the global
unique identifier. At the very least, this will make it easier to
retrieve/replace a GUID from another trusted server in the event the
Zotero server isn't adequate.

--Rick

Reply all
Reply to author
Forward
0 new messages