Zotero as a datasource for BibLaTeX/Biber

670 views
Skip to first unread message

philkime

unread,
Feb 9, 2011, 4:33:22 AM2/9/11
to zotero-dev
As the current main Biber developer, I am looking into using Zotero as
a data source for BibLaTeX so that users can access Zotero data
directly when processing documents with BibLaTeX. It is not clear to
me what the best way to do this is, for the following reasons:

* LaTeX users need to cite by an identifying key which is not visible
in the Zotero interface
* It's not clear which API to use - a direct SQLite connection or the
ATOM/REST interface which looks rather underdocumented.
* An interim solution would be a URL accessible BibTeX format export
file but it's not clear if this is possible to retrieve via a URL and
again there is the issue of the citation key which needs to be known
in advance.

If anybody has an comments on these things, I'd be glad to hear them.
BibLaTeX/Biber is developing rapidly now and already has features far
more advanced than any other bibliography systems I can think of
(particularly in the areas of Unicode, sorting and cross-entry
inheritance) and it would be a shame not to be able to pull the data
from Zotero.

Bruce D'Arcus

unread,
Feb 10, 2011, 12:49:08 PM2/10/11
to zoter...@googlegroups.com

Maybe a bit overblown ("far more advanced"?), but I'd just add that
I'm thinking about a similar workflow for pandoc (which recently added
full CSL 1.0 support), with similar needs.

Bruce

Louis-Dominique Dubeau

unread,
Feb 10, 2011, 2:19:26 PM2/10/11
to zoter...@googlegroups.com
On Wed, 2011-02-09 at 01:33 -0800, philkime wrote:
> As the current main Biber developer, I am looking into using Zotero as
> a data source for BibLaTeX so that users can access Zotero data
> directly when processing documents with BibLaTeX.

I'd sure like to see such support. One or two years ago, I checked the
state of Zotero and the state of the LaTeX tools and I decided that the
best way to integrate the two was to come up with my own homegrown
solution.

> It is not clear to
> me what the best way to do this is, for the following reasons:
>
> * LaTeX users need to cite by an identifying key which is not visible
> in the Zotero interface

Right. As I recall, that was one of the major issues I had. The keys are
not readily accessible. They are also not generated correctly. Things
may have changed but last I checked, some of tools in the latex
toolchain won't work with accented characters in the keys. Unfortunately
Zotero does not strip them out. Here's an actual example:

dharmaśrī_pi_1978

The name of the author has diacritics. (And the title is one word
because the title is in Pinyin with each syllable demarcated by spaces.
I have not spent any time trying to figure a solution for that.) Here's
another fun one:

mdo-sṅags-bstan-paʼi-ñi-ma_dbu_1978

It does not handle articles (as in: the, a, an, un, une, des, les, le,
la, ein, die, etc.) well:

bareau_les_1955

I'd rather have the next word of the title rather than a single article
as the abbreviated title.

Besides problems like above, I just did not like the algorithm. For
instance, years are always put in the key. I'd rather have the year
added ONLY if it helps disambiguate an otherwise ambiguous key. This
means that adding a new entry can make an old key change. In practice, I
think it is a rare instance which can be readily fixed by the user. It
is just that if I know the author and title, I can figure out the key
fairly easily. It is rare that I know the year of publication.

I was also concerned by changes to Zotero mucking up the keys. Work on
your dissertation, Zotero changes the key generation algorithm and then
all the keys cease to work. Yay!

So I decided the best course of action would be to work manually:
export, regenerate keys and cleanup, run the latex toolchain. The export
bit is manual. All the rest is launched by "make". By generating my own
keys, I isolate myself from whatever changes are made to Zotero.

> If anybody has an comments on these things, I'd be glad to hear them.
> BibLaTeX/Biber is developing rapidly now and already has features far
> more advanced than any other bibliography systems I can think of
> (particularly in the areas of Unicode, sorting and cross-entry
> inheritance) and it would be a shame not to be able to pull the data
> from Zotero.

Indeed it would and maybe the time is ripe for something like this. For
what it is worth, I'm attaching my rekeying code to this email. It is an
ugly beast but it may give some ideas. Besides regenerating the keys, it
does some ad hoc cleanup which is probably of interest to no one else.
This code is used on a daily basis but only by me, and only for one
project so far. So things are hardcoded including the name of the input
(biblio.bib) and output (normalized.bib) files. Both input and output
are bibtex files. You need pybliographer to run it.

Ciao,
Louis

rekey.py

philkime

unread,
Feb 11, 2011, 11:21:32 AM2/11/11
to zotero-dev
Hmm, yes, it was a bit overblown, sorry.

philkime

unread,
Feb 11, 2011, 11:25:36 AM2/11/11
to zotero-dev
I don't mind so much the accents in keys as biber has full UTF-8
support for keys too but the issue is more that when actually
typesetting, you need to be able to uniquely refer to an entry in
order to cite it. I really don't like the idea of calculating the key
from fields on the fly - far too much room for breakage there. There
is a unique key in the SQL-lite backend but it's not exposed in any
way. Say a user cites "xyz" in a latex document and as a data source
points to a Zotero db or Atom feed. I need to be able to look for that
key in the data source, get the entry and then parse it into the
internal biblatex data model format. Also, users need to be able to
have easy access to the keys of the entries or they can't cite them by
key in the first place ...

Richard Karnesky

unread,
Feb 11, 2011, 11:28:12 AM2/11/11
to zotero-dev
I don't know if much has changed since your post to the zotero forums
a few weeks ago:
http://forums.zotero.org/discussion/15941/accessing-zotero-data-from-biber/

> * LaTeX users need to cite by an identifying key which is not visible
> in the Zotero interface

The developers have said in the past that a local identifier field
will be added in the future(which would be useful as a human-writable
key for BibTeX (and MODS XML IDs for pandoc+citeproc-hs)).

Until this happens, my suggestion for short-term integration remains
the same: have Zotero export the entire database to BibTeX, sorted by
date added to the database (so that the auto-generated keys will
remain fairly stable). Longer term improvements to this could be to
eventually use a richer output format (biber has added basic RIS file
support, and I hope that supporting richer XML/RDF-based formats is on
your roadmap), to access the data directly without an intermediate
format, and to allow the use of Zotero (or some other citeproc
implementation) to produce the formatted references (allowing CSL-
based styles to be used in LaTeX documents).

If you don't rely on the Zotero client for your BibTeX file, you can
automatically generate the keys using the same method the client
uses. As a LaTeX user, this seems preferable than using the much more
obscure UIDs used by zotero.


> * It's not clear which API to use - a direct SQLite connection or the
> ATOM/REST interface which looks rather underdocumented.

You can also use the local client API:
http://www.zotero.org/support/dev/interacting_with_zotero_from_within_firefox
(this has the benefit of being able to use Zotero-produced export
files & it should be relatively easy to implement, but also means the
data will be very safe and the connection should be reliable).

If you use a direct SQLite connection, note that it must be read-only
and that it may still fail on some NFS- or Windows-based systems.

I'm not sure if I'd agree that the ATOM/REST interface is
undocumented; most existing features are described at:
http://www.zotero.org/support/dev/server_api
But the functions available are still limited. You would not have
human-readable keys & would probably need to write a parser for the
Atom feed. There are plans to support the standard bibliographic
export formats found in the Zotero client in the future. Having these
would make the REST API more compelling.

--Rick

Bruce D'Arcus

unread,
Feb 11, 2011, 11:43:51 AM2/11/11
to zoter...@googlegroups.com
On Fri, Feb 11, 2011 at 11:25 AM, philkime <Phi...@kime.org.uk> wrote:

> I don't mind so much the accents in keys as biber has full UTF-8
> support for keys too but the issue is more that when actually
> typesetting, you need to be able to uniquely refer to an entry in
> order to cite it. I really don't like the idea of calculating the key
> from fields on the fly - far too much room for breakage there. There
> is a unique key in the SQL-lite backend but it's not exposed in any
> way. Say a user cites "xyz" in a latex document and as a data source
> points to a Zotero db or Atom feed. I need to be able to look for that
> key in the data source, get the entry and then parse it into the
> internal biblatex data model format. Also, users need to be able to
> have easy access to the keys of the entries or they can't cite them by
> key in the first place ...

The difficulty with the citation key issue is just that it's by
definition a local (to a file, or a database, or in this case, a user)
identifier. This assures a processor can find the correct item (and
requires the processor have access to those particular records, which
is a big limitation in many contexts).

But also, the tradition in the BibTeX world is that this identifier is
human-readable, and that it can be ideally recalled from memory.

Finally, it should be stable.

So we have three requirements in the context of zotero:

1. unique to a library
2. human-readable
3. stable

... and I would add a fourth requirement:

4. that these user-based identifiers can be associated with global
identifiers (URIs, including DOIs as URI). E.g. we need to stop
privileging local identifiers.

So those are the requirements.

Right now, we have URIs for Zotero items that look like:

<http://www.zotero.org/bdarcus/items/2VDXDIMR>

So we have a key like "2VDXDIMR". This solves requirements #1 and #3,
but fails on #2.

We could add a field for the user data where we might have RDF that looks like:

<http://www.zotero.org/bdarcus/items/2VDXDIMR> a rl:Item ;
bibtex:key "elden-2009-territory" ;
rl:source <info:isbn13:9780816654833> .

E.g. people can add the key themselves in their user data, and can edit them.

But that fails on requirement #3.

Alternately, we could just say the id itself should be a
human-readable slug, such that the item URI becomes:

<http://www.zotero.org/bdarcus/items/elden-2009-territory>

While that introduces other issues, it does solve all of the
requirements I note.

Bruce

Richard Karnesky

unread,
Feb 11, 2011, 11:50:52 AM2/11/11
to zotero-dev
If Philip does not make a Zotero plugin using the client API and does
not wait for server-side BibTeX support, he doesn't have to worry
about any limitations in Zotero's BibTeX export (that being said, we
should fix what we can).

> Right. As I recall, that was one of the major issues I had. The keys are
> not readily accessible. They are also not generated correctly. Things
> may have changed but last I checked, some of tools in the latex
> toolchain won't work with accented characters in the keys. Unfortunately
> Zotero does not strip them out. Here's an actual example:
>
> dharmaśrī_pi_1978

As previously noted, there are multiple toolchains that support
accented keys. There were enough complaints about toolchains still in
use that did not that accented characters are trasliterated or dropped
from keys (so the above would probably be 'dharmar_pi_1978' at
present).


> It does not handle articles (as in: the, a, an, un, une, des, les, le,
> la, ein, die, etc.) well:

It strips all of these except for les now (although other multilingual
articles will probably also need to be added).


> Besides problems like above, I just did not like the algorithm. For
> instance, years are always put in the key. I'd rather have the year
> added ONLY if it helps disambiguate an otherwise ambiguous key.

This is a matter of personal preference. I personally use just the
lead author, the year, and a disambiguation token. These are fast to
type (though perhaps somewhat more difficult to remember which article
it is you're citing until you get a better feel for the literature),
do not have the inherent problems you listed with needing to parse
titles intelligently, and I'd argue that most other tools use a
similar format for auto-generated keys. (You'll also be able to
impress your colleagues when you remind them that 'Smith' wrote that
particular paper in 1973.)

(And, because it is a matter of personal preference, it seems like the
local ID field is one of the best ways to address it.)


> I was also concerned by changes to Zotero mucking up the keys. Work on
> your dissertation, Zotero changes the key generation algorithm and then
> all the keys cease to work. Yay!

I don't know if that's a significant concern: one would presumably be
able to map between different key algorithms fairly easily. And in
your present environment, you always have a copy of the older BibTeX
file anyway.

--Rick

Richard Karnesky

unread,
Feb 11, 2011, 12:17:43 PM2/11/11
to zotero-dev
> So we have three requirements in the context of zotero:
>
> 1. unique to a library
> 2. human-readable
> 3. stable

I agree with all of this: all BibTeX-tools attempt this (but automatic
key generators often fail on point 3).

> ... and I would add a fourth requirement:
>
> 4. that these user-based identifiers can be associated with global
> identifiers (URIs, including DOIs as URI). E.g. we need to stop
> privileging local identifiers.

I'm not completely opposed to this, as:
Local ID + context = global ID
(so if you know that the key and the library it came from, you don't
need much more)


> <http://www.zotero.org/bdarcus/items/2VDXDIMR> a rl:Item ;
>     bibtex:key "elden-2009-territory" ;
>     rl:source <info:isbn13:9780816654833> .
>
> E.g. people can add the key themselves in their user data, and can edit them.
>
> But that fails on requirement #3.

How? The URI <http://www.zotero.org/bdarcus/items/2VDXDIMR> is still
stable. The BibTeX key (like any user-editable data) is not
permanent, but I'd consider that it too could be at least as stable as
automatically generated keys: most authors do not need to spend a lot
of time re-keying their IDs.


> Alternately, we could just say the id itself should be a
> human-readable slug, such that the item URI becomes:
>
> <http://www.zotero.org/bdarcus/items/elden-2009-territory>
>
> While that introduces other issues, it does solve all of the
> requirements I note.

This is what I'd advocate for (as I said: local ID+context). But I
don't see how that is "stable". If you allow people to rewrite the
local ID, they may want to use 'Elden-2009' instead'. If you
automatically generate the human-readable slug (as is done now), you'd
still have the problem of disambiguating multiple papers by Elden that
came out in 2009 that had 'territory' as the first non-particle word
in the title.

This is where automatic key generation gets a bit sloppy: how do you
order those references? I've previously suggested ordering it by the
date added based on the user's entire library (and I still think
that's probably the best way). But what happens when you have a multi-
part paper & you add part 2 to your database before part 1. Are you
then stuck forever using 'http://www.zotero.org/bdarcus/items/
elden-2009-territory-2' to refer to part 1? And key stability could
break down when you'd remove earlier items in the database that had
been the reason for the disambiguation in the first place.

I think the benefits of human-generated local IDs to resolve issues
like this cleanly outweigh the downsides. If we're hyper-sensitive to
stability of these as global IDs, we need only add additional context
(namely the date/time the local ID went into effect).

--Rick

Bruce D'Arcus

unread,
Feb 11, 2011, 12:31:53 PM2/11/11
to zoter...@googlegroups.com
Just a quick clarificatio:

On Fri, Feb 11, 2011 at 12:17 PM, Richard Karnesky <karn...@gmail.com> wrote:
>> So we have three requirements in the context of zotero:

>> 4. that these user-based identifiers can be associated with global


>> identifiers (URIs, including DOIs as URI). E.g. we need to stop
>> privileging local identifiers.
>
> I'm not completely opposed to this, as:
>  Local ID + context = global ID
> (so if you know that the key and the library it came from, you don't
> need much more)

Right.

But there is a subtle thing I keep mentioning, which is that the
zotero item should not be the source data itself, and so they should
have different identifiers.

So there's user data about the item (notes, tags, creator, updated
date-time, maybe key) and then there's item metadata itself (title,
authors, etc.). They should be distinct in both the data layer and the
UI.

That allow you to get to the source data indifferent ways:

- you can get to the user data for the source either with the user +
key, or user + global URI

- you can get to some representation of the data with only the global URI

So I want to decouple processing from specific zotero item ids.

Bruce

skornblith

unread,
Feb 11, 2011, 4:14:24 PM2/11/11
to zotero-dev
On Feb 11, 12:31 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> Just a quick clarificatio:
>
> On Fri, Feb 11, 2011 at 12:17 PM, Richard  Karnesky <karne...@gmail.com> wrote:
>
> >> So we have three requirements in the context of zotero:
> >> 4. that these user-based identifiers can be associated with global
> >> identifiers (URIs, including DOIs as URI). E.g. we need to stop
> >> privileging local identifiers.
>
> > I'm not completely opposed to this, as:
> >  Local ID + context = global ID
> > (so if you know that the key and the library it came from, you don't
> > need much more)
>
> Right.
>
> But there is a subtle thing I keep mentioning, which is that the
> zotero item should not be the source data itself, and so they should
> have different identifiers.

This will work if the item already has a DOI or ISBN. Zotero doesn't
currently do large-scale record linkage to assign URIs to resources
that don't already have them. I'm not opposed to this functionality,
but it's not exactly easy to implement. There are many issues involved
in linking bibliographic records, such as:

1) Determining when item data is similar enough that they probably
refer to the same item
1a) Dealing with misspellings, incomplete titles, different
formatting, etc.
1b) Determining when two items with similar titles by the same
author(s) are actually different
1c) Dealing with cases where the same record has multiple "unique"
identifiers (e.g., multiple ISBNs or multiple DOIs)
2) If using a probabilistic record linkage model, creating (manually)
a sufficiently large training set to model the data
3) Doing linkage for a large number of records very quickly

It's basically the duplicate detection problem magnified. It's not
unsolvable, but it is a large undertaking, and I think there are more
important objectives to complete at present (e.g., a server-side write
API).

> So there's user data about the item (notes, tags, creator, updated
> date-time, maybe key) and then there's item metadata itself (title,
> authors, etc.). They should be distinct in both the data layer and the
> UI.
>
> That allow you to get to the source data indifferent ways:
>
> - you can get to the user data for the source either with the user +
> key, or user + global URI
>
> - you can get to some representation of the data with only the global URI
>
> So I want to decouple processing from specific zotero item ids.

If you are limited to BibTeX, you are only allowed one key per item
and the key isn't a URI, so this isn't necessarily relevant to the
current discussion.

Also, to drive home my point regarding access dates: at present, a
global URI described with BIBO metadata would not provide the
information needed to cite a webpage, unless the webpage contains a
cite-able issue date.

Simon

Bruce D'Arcus

unread,
Feb 11, 2011, 4:32:05 PM2/11/11
to zoter...@googlegroups.com

I'm not limiting this discussion to BibTeX; trying to fold it into
broader discussion. And in that context, it is relevant.

> Also, to drive home my point regarding access dates: at present, a
> global URI described with BIBO metadata would not provide the
> information needed to cite a webpage, unless the webpage contains a
> cite-able issue date.

I think this is trivial compared to much bigger issues around lack of
document portability.

Bruce

philkime

unread,
Feb 11, 2011, 5:09:31 PM2/11/11
to zotero-dev
On Feb 11, 5:28 pm, Richard Karnesky <karne...@gmail.com> wrote:
> I don't know if much has changed since your post to the zotero forums
> a few weeks ago:

I was just trying to get a sense of the likely directions. biblatex/
biber is heading for regarding bibtex as a legacy format because its
data model is just too restricted, name handling is rather clunky etc.
We have a customised btool library for parsing bibtex files which does
limited UTF-8 handling for generating initials etc. to get round some
of the limitations but we now have a pluggable driver architecture
where we can theoretically access arbitrary bib data sources and parse
them into biblatex's internal structures. So, I would rather not spend
much time on forcing various next generation utils like Zotero to
output a legacy format to use as an interface if I can help it. I'd
ideally like to directly access the data source, look for a key that a
user has cited by and pull the data. This is natural model - the
unique key (or local key plus library context etc. which is exactly
the bibtex model - keys are specific to a particular file or set of
files usually, which is fine).

I really don't like the idea of allowing users to use pseudo-keys
which are basically agglomerations of parts of the entry data and then
looking these up by data matching. That's a nightmare which is really
hard to implement in an extensible way for all of the reasons
mentioned in this thread - you have to deal with multi-volume, multi-
part, same-year, same author same year etc. etc. biblatex/biber will
soon have an automatic method of generation minimal unique name and
name lists for a given bib set - I don't know of any other tool that
does this and it was really hard to implement for exactly these sorts
of reasons. It's only possible in biblatex/biber at all because it
uses a full programming language to do these things (perl in biber)
and because such disambiguation is one of the core design features of
biblatex. Adding it onto a tool to try to allow pseudo-keys is
probably just about impossible to do in an sensible way and really
impossibel to maintain. Unique, user-maintained keys are almost
certainly the way to go. I mean, you essentially need a primary key at
the data presentation level otherwise you can't extract anything if
you are not in a GUI browsing through your entries which nobody using
Zotero as a (potentially remote) data source will be.

I don't think it's possible to get around the fact that to cite
something, you need to know how to fetch some unique piece of data if
you don't have the entire data source "in front" of you. That needs
foreknowledge of some unique key. It doesn't really matter if a user
might change this, that's no different from the .bib files which get
passed around and are often changed. It breaks and you fix the key.
Trying to guess which entry a user wants by some list of entry fields
is, as I said, a swift road to madness even though it looks nice to
start with.

philkime

unread,
Feb 14, 2011, 3:48:23 PM2/14/11
to zotero-dev
Is there some documentation/spec for the Zotero RDF format? That might
be an interim solution.

Richard Karnesky

unread,
Feb 15, 2011, 10:17:24 AM2/15/11
to zotero-dev
> Trying to guess which entry a user wants by some list of entry fields
> is, as I said, a swift road to madness even though it looks nice to
> start with.

To play devil's advocate, this is precisely what the RTF scan feature
does, so I think you're overstating things. I won't press it too
hard, as I agree that local identifiers are the best solution to this
& look forward to their implementation. But I don't think that the
lack of them should stop you from building something useful.

--Rick

Richard Karnesky

unread,
Feb 15, 2011, 10:26:00 AM2/15/11
to zotero-dev
> Is there some documentation/spec for the Zotero RDF format? That might
> be an interim solution.

Yes, Zotero RDF gets around some BibTeX limitations (many inherent,
though BibTeX export could still be improved). BIBO or MODS would
also do this & would be useful for non-Zotero tools. They are both
reasonably stable and well documented. But I'm not sure how a few
inherent BibTeX limitations are limiting biber at this point (given
the only issues you listed (names and utf-8) aren't addressed that
much better in other formats by Zotero) or how any of these export
formats would address the problem with keys, as the GUID that Zotero
uses is not "human friendly".

--Rick

skornblith

unread,
Feb 15, 2011, 11:52:11 AM2/15/11
to zotero-dev
The Bibliontology RDF format is newer, better, and better documented,
and also supports all Zotero fields. You can look at the Bibliontology
documentation (http://bibotools.googlecode.com/svn/bibo-ontology/trunk/
doc/index.html) for the basics; the mapping to Zotero fields is
documented at https://www.zotero.org/trac/wiki/BiboMapping (although
this might be slightly out of date) or in JSON at the top of
Bibliontology RDF.js.

Simon

philkime

unread,
Feb 16, 2011, 3:39:46 PM2/16/11
to zotero-dev
The problem is trying to couple something to a typesetting system
which is bound to citing by key - I wouldn't really want to try loose
data matching if I could have a key at some point since people writing
with latex tend to want a guarantee that what they cite is what's in
the bibliography. Data matching can never do that (the guarantee, that
is). That's why I'm overstating a bit I think, changing that model in
latex would change a major basic assumption - you cite the right thing
with the right key or you get a fatal error. There isn't much benefit
for such an audience in "best guess" matching of data entries - it has
to be right or just fail/throw a warning. It seems I'll have to wait
for this to be implemented in some way in Zotero or perhaps look at
the RDF format.

philkime

unread,
Feb 16, 2011, 3:56:35 PM2/16/11
to zotero-dev
We're not really looking for a better format than bibtex as biber will
always support this format but now biber has modular drivers to access
other data sources so I looked around to see which ones are likely
candidates and Zotero is certainly one of them. I'd prefer a direct
connection than an "export first" route like Zotero RDF but perhaps
this isn't realistic at the moment give the slightly different
approach (the necessity of keys in the latex bibliography world).
Biber uses a hacked btparse library to allow the bibtex C routines to
deal with some UTF-8 cases it needs and the internal UTF-8
capabilities of biber are very good so it's not so much a limitation
that's driving this, it's more a desire to allow biber/biblatex users
to draw from non-bibtex data sources. We have a beta RIS driver and
are working on a dedicated biblatexml format which maps closely to the
internal data structures biber uses for bibliography processing but
there are clearly some major sources like Zotero which I'd like to be
able to deal with ...

philkime

unread,
Feb 16, 2011, 3:57:32 PM2/16/11
to zotero-dev
Yes, I'll certainly be looking at this, thanks.

Bruce D'Arcus

unread,
Feb 16, 2011, 4:28:57 PM2/16/11
to zoter...@googlegroups.com
On Wed, Feb 16, 2011 at 3:56 PM, philkime <Phi...@kime.org.uk> wrote:

> We're not really looking for a better format than bibtex as biber will
> always support this format but now biber has modular drivers to access
> other data sources so I looked around to see which ones are likely
> candidates and Zotero is certainly one of them. I'd prefer a direct
> connection than an "export first" route like Zotero RDF but perhaps
> this isn't realistic at the moment give the slightly different
> approach (the necessity of keys in the latex bibliography world).
> Biber uses a hacked btparse library to allow the bibtex C routines to
> deal with some UTF-8 cases it needs and the internal UTF-8
> capabilities of biber are very good so it's not so much a limitation
> that's driving this, it's more a desire to allow biber/biblatex users
> to draw from non-bibtex data sources. We have a beta RIS driver and
> are working on a dedicated biblatexml format which maps closely to the
> internal data structures biber uses for bibliography processing but
> there are clearly some major sources like Zotero which I'd like to be
> able to deal with ...

Have you looked closely at bibutils? This seems to me the most
comprehensive data and format library out there (supports mods, ris,
endnote/refer, OOXML, bibtex, biblatex, etc.), and the mappings are
all laid in the C source as a series of simple maps. If nothing else,
you might be able to borrow the essential mapping logic and model
(which is the hard part).

FWIW, I had originally recommended Chris use MODS for his core format,
since he had a custom XML format.

BIBO RDF is based on things I learned working with MODS, but is also
designed to really exploit RDF and linked data principles, while fully
supporting (I hope) the Zotero data. That world is linked together not
by local keys, but global URIs. So the scope is more ambitious, you
might say.

To support BIBO in biber, though, a couple of things you'd want:

1) a generic RDF parser

2) a way to map those RDF triples to your own internal model; here's
just one example I wrote in Python using rdf-object mapper (which
include #1 above):

<https://github.com/bdarcus/bibo-py>

From comments in the models.py file:

"""
This provides basic object mapping for key classes and relations in the
Bibliographic Ontology (bibo). Examples:

>>> book = Book('<http://example.net/books/1>')
>>> book.date = "2001"
>>> publisher = Organization('<http://abcbooks.com>')
>>> publisher.name = "ABC Books"
>>> publisher.city = "New York"
>>> book.title = "Some Book Title"
>>> book.publisher = publisher
>>> print(book.publisher.name)
... ABC Books
""

3) a way to map these global objects to local document keys, and vice
versa (an easy step, but still a level of indirection*).

Bruce

* E.g., in the python example, you're talking something like:

dockey == zotero_item.source

... where you find the zotero item by its label property.

philkime

unread,
Feb 17, 2011, 7:48:52 AM2/17/11
to zotero-dev
That's helpful, thanks. I have looked at this a little but I think I
will perhaps concentrate on this as it looks to be a good general
solution to many RDF based formats.

philkime

unread,
Feb 17, 2011, 3:24:50 PM2/17/11
to zotero-dev
Is there a reason why the Wiki link on the BIBO main page is broken?

Bruce D'Arcus

unread,
Feb 17, 2011, 3:29:12 PM2/17/11
to zoter...@googlegroups.com
On Thu, Feb 17, 2011 at 3:24 PM, philkime <Phi...@kime.org.uk> wrote:

> Is there a reason why the Wiki link on the BIBO main page is broken?

Which "wiki link"? And you mean this "main page"?

<http://bibliontology.com/>

Bruce

philkime

unread,
Feb 17, 2011, 3:47:11 PM2/17/11
to zotero-dev
Ah, sorry, I was looking at the Google code page which has a broken
wiki link - I found the examples on the "real" web site.

I have installed a decent perl RDF parsing suite (RDF::Trine,
RDF::Query) which seems to work well. Having looked at the BIBO model,
I'm a bit concerned that there is nothing explicit for name parts
beyond "first/last". For complex typsetting and sorting, we need to
discriminate name suffices and prefices, names with multiple parts or
hyphenated parts etc. We also really need the facility to explicitly
specify the initials for each part. One of the main motivations away
from the bibtex format is the name parsing as we have a lot of quite
complex code to take bibtex name strings and parse them at a high
enough resolution to be able to do decent typesetting. A model which
forced us to continue to have to parse fairly high-level name
components wouldn't be much a gain for us in the long term. Names are
first-class data items in bibliographies and so we need something
really fine-grained ideally. I understand that BIBO wasn't really
designed with typsetting in mind so this isn't a criticism and we
probably have to support it anyway but I was hoping to avoid writing
and supporting our own data format ...

Avram Lyon

unread,
Feb 17, 2011, 3:56:18 PM2/17/11
to zoter...@googlegroups.com
2011/2/17 philkime <Phi...@kime.org.uk>:

> enough resolution to be able to do decent typesetting. A model which
> forced us to continue to have to parse fairly high-level name
> components wouldn't be much a gain for us in the long term. Names are
> first-class data items in bibliographies and so we need something
> really fine-grained ideally. I understand that BIBO wasn't really
> designed with typsetting in mind so this isn't a criticism and we
> probably have to support it anyway but I was hoping to avoid writing
> and supporting our own data format ...

I haven't explored name representations in RDF, but I'm sure there has
been some work done in this sphere already. citeproc-js also prefers
to consume more fine-grained name data than Zotero provides, and it
would be good to bring more flexible names to Zotero itself some day.
Working out a reasonable way to represent the names in BIBO/RDF could
be an important step in that direction.

Avram

Bruce D'Arcus

unread,
Feb 17, 2011, 4:01:38 PM2/17/11
to zoter...@googlegroups.com, philkime
On Thu, Feb 17, 2011 at 3:47 PM, philkime <Phi...@kime.org.uk> wrote:
> Ah, sorry, I was looking at the Google code page which has a broken
> wiki link - I found the examples on the "real" web site.
>
> I have installed a decent perl RDF parsing suite (RDF::Trine,
> RDF::Query) which seems to work well. Having looked at the BIBO model,
> I'm a bit concerned that there is nothing explicit for name parts
> beyond "first/last". For complex typsetting and sorting, we need to
> discriminate name suffices and prefices, names with multiple parts or
> hyphenated parts etc. We also really need the facility to explicitly
> specify the initials for each part. One of the main motivations away
> from the bibtex format is the name parsing as we have a lot of quite
> complex code to take bibtex name strings and parse them at a high
> enough resolution to be able to do decent typesetting. A model which
> forced us to continue to have to parse fairly high-level name
> components wouldn't be much a gain for us in the long term. Names are
> first-class data items in bibliographies and so we need something
> really fine-grained ideally. I understand that BIBO wasn't really
> designed with typsetting in mind so this isn't a criticism and we
> probably have to support it anyway but I was hoping to avoid writing
> and supporting our own data format ...

Yeah, that's why we defer to FOAF for agent (including name) representation.

But it's worth keeping in mind that RDF is beautifully extensible. So
you could invent your own properties for these details (and as Avram
notes, we tackled some of this in the CSL/citeproc-js arena).

Perhaps a better approach would be to post a note on the bibo google
group (where foaf people hangout also) laying out your concerns, and
see if we could come up with a solution.

One wrinkle to consider is that RDF is, like a relational database, a
fundamentally unordered data model. So solutions are best that don't
depend on order where avoidable.

Bruce

philkime

unread,
Feb 17, 2011, 4:03:50 PM2/17/11
to zotero-dev
A bigger problem for me a the moment is that in the Zotero RDF format,
the "key" (rdf:about) seems:

* random
* starts with '#'

This makes it almost impossible to use the data in biber because users
don't know necessarily what to use as a citation key and even if the
one in the RDF export was stable, it would mean hacking it because
'#' isn't legal in LaTeX as a citation key. Hmm.

Richard Karnesky

unread,
Feb 17, 2011, 4:18:25 PM2/17/11
to zotero-dev
On Feb 17, 4:48 am, philkime <Phi...@kime.org.uk> wrote:

> > Have you looked closely at bibutils?
> ....
> That's helpful, thanks. I have looked at this a little but I think I
> will perhaps concentrate on this as it looks to be a good general
> solution to many RDF based formats.

Just for clarity: Bibutils does not handle RDF-based formats yet: It
uses MODS XML as an intermediate format & can translate to/from other
bibliographic standards, but not BIBO or other RDF-based ones.

--Rick

Bruce D'Arcus

unread,
Feb 17, 2011, 4:25:20 PM2/17/11
to zoter...@googlegroups.com, philkime
On Thu, Feb 17, 2011 at 4:03 PM, philkime <Phi...@kime.org.uk> wrote:
> A bigger problem for me a the moment is that in the Zotero RDF format,
> the "key" (rdf:about) seems:
>
> * random
> * starts with '#'
>
> This makes it almost impossible to use the data in biber because users
> don't know necessarily what to use as a citation key and even if the
> one in the  RDF export was stable, it would mean hacking it because
> '#' isn't legal in LaTeX as a citation key. Hmm.

It's best not to get too focused on the syntax. If you were to run an
RDF parser on that files, you'd see that those nodes get expanded to a
full URI.

Keys in RDF are URIs, so as I said earlier, you'd need a way to map
the URI to a local key. Certainly Zotero will provide that, but don't
yet.

Also, I submitted a bug report here that explained those values are
more appropriate as <http://zotero.org/user/doe/12353124>.

Bruce

philkime

unread,
Feb 17, 2011, 4:41:18 PM2/17/11
to zotero-dev
Ah, I'm with you now. I am coming from a more XML-oriented angle and
so just looked at the RDF/XML as if it were coming out of the perl
XML::LibXML module ...

Still, as you say, the "keys" aren't much use if they are not easily
to hand for the user and also stable between exports.

philkime

unread,
Feb 25, 2011, 2:29:21 AM2/25/11
to zotero-dev
An update - I have a beta Zoter RDF/XML driver for biber now. It's
currently the only real option as it's an export format where users
can actually see the "keys" (rdf:about attribute) assigned to an
entry. It doesn't solve everything since these attributes often
contain arbitrary URI encoded strings which LaTeX cannot use as a key
at all due to all the special characters. I handle the default
"#item_nnn" keys by allowing users to cite by this sort of key, minus
the '#'. I'll open a feature request to have a user-visible "ID" field
which overrides all others if set.

philkime

unread,
Mar 4, 2011, 4:42:05 PM3/4/11
to zotero-dev
On Feb 11, 10:32 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:

> I think this is trivial compared to much bigger issues around lack of
> document portability.

Just to re-cap, I've looking into all this quite a lot since this
discussion and I think I can conclude the following:

1. There is currently little chance of us using Zotero or biblio or
indeed anything I've seen as the future data model for biblatex/biber
as they are not really geared towards typesetting and are not fine
grained enough for what we need. The extensibility of RDF etc. isn't
really any help as it has to be an implemented, recognised data model
and so there is no real advantage over straight XML or even the
current bibtex data model: the issue is not technical extensibility
but a controlled, strict model which is widely implemented. For
biblatex at least, that's important because the results (a typeset
document) are fairly precise (people quibble about commas ...)
2. The lack of a suer-defined citation key in entries is a major
problem and there is now real way round it. The natural model for
citation key based systems is that users cite by key, the backed takes
the key, looks through a data source for an entry with that key and
constructs some object from that entry to use in e.g. typesetting.
This is fast since you don't have to look in entries without that key.
Using auto-generated keys is slow and messy - you have to open every
data source entry, construct a key from some information and then
compare it to the key the user used - slow, messy and ugly. The Zotero
model of using as "key" whatever it thinks is the best uniquely
identifying information (URL, DOI, whatever) doesn't play well with
this - the key needs to be stable. More importantly, it needs to be
user defined so that people can avoid special chars which are not
allowed in citation keys in some systems. URLS are horrible in this
regard - LaTeX certainly can't use most URLS as keys as they contain
all sort of LaTeX unfriendly characters. Even the default Zotero key
which is used if there is no "identifying" information breaks LaTeX
(#item_nn). Also, when you are reading document sources with
citations, keys like URLS are pretty useless if you want a quick idea
of what the citation refers to. Traditional citation keys, user-
defined, like "Smithetal:2010" are much better.

Anyway, I have implemented an experimental biber driver for the Zotero
RDF/XML format but due to the key issue it will always be a bit of a
mess unless I can persuade you to add a GUI visible "citationkey"
field which users can define. It doesn't even have to be considered as
the real "identifying" key. It just has to be visible to the users so
they know what to cite with if using citation key based systems and it
has to make it into export formats in some form.

Bruce D'Arcus

unread,
Mar 4, 2011, 4:56:13 PM3/4/11
to zoter...@googlegroups.com
On Fri, Mar 4, 2011 at 4:42 PM, philkime <Phi...@kime.org.uk> wrote:
> On Feb 11, 10:32 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
>
>> I think this is trivial compared to much bigger issues around lack of
>> document portability.
>
> Just to re-cap, I've looking into all this quite a lot since this
> discussion and I think I can conclude the following:
>
> 1. There is currently little chance of us using Zotero or biblio or
> indeed anything I've seen as the future data model for biblatex/biber
> as they are not really geared towards typesetting and are not fine
> grained enough for what we need. The extensibility of RDF etc. isn't
> really any help as it has to be an implemented, recognised data model
> and so there is no real advantage over straight XML or even the
> current bibtex data model: the issue is not technical extensibility
> but a controlled, strict model which is widely implemented. For
> biblatex at least, that's important because the results (a typeset
> document) are fairly precise (people quibble about commas ...)

My suggestion for you, then, is to do what we've done in the CSL
world: create a JSON representation which matches your internal model.
It'd be nice if we could settle on a common one, but I have a feeling
that may be hard.

For comparison, CSL JSON currently looks like (though it's not
specified, and so subject to change):

{
"id": "doe:99",
"authors": [
{
"family": "Doe",
"given": "Jane"
}
],
"title": "The Title",
"container-title": "Publication Title",
"issued": [2000, 3, 14]
}

> 2. The lack of a suer-defined citation key in entries is a major
> problem and there is now real way round it. The natural model for
> citation key based systems is that users cite by key, the backed takes
> the key, looks through a data source for an entry with that key and
> constructs some object from that entry to use in e.g. typesetting.
> This is fast since you don't have to look in entries without that key.
> Using auto-generated keys is slow and messy - you have to open every
> data source entry, construct a key from some information and then
> compare it to the key the user used - slow, messy and ugly. The Zotero
> model of using as "key" whatever it thinks is the best uniquely
> identifying information (URL, DOI, whatever) doesn't play well with
> this - the key needs to be stable. More importantly, it needs to be
> user defined so that people can avoid special chars which are not
> allowed in citation keys in some systems. URLS are horrible in this
> regard - LaTeX certainly can't use most URLS as keys as they contain
> all sort of LaTeX unfriendly characters. Even the default Zotero key
> which is used if there is no "identifying" information breaks LaTeX
> (#item_nn). Also, when you are reading document sources with
> citations, keys like URLS are pretty useless if you want a quick idea
> of what the citation refers to. Traditional citation keys, user-
> defined, like "Smithetal:2010" are much better.

Yes, I've run into this myself today trying to use Zotero-sourced data
as a source for a pandoc/citeproc-based workflow. I ended up writing
code to add the right keys to the MODS output, and it was definitely a
PITA.

I'd be in favor of what I've previously suggested:

a) a field to hold this data; I'd call it something generic like label

b) a better function to create a default key/label and populate the field

c) but, it could be edited by the user

This label would then map to:

- bibtex key
- mods:mods/@ID
- a property on the RDF
- ID on RIS
- etc.

Bruce

philkime

unread,
Mar 4, 2011, 5:09:21 PM3/4/11
to zotero-dev
On Mar 4, 10:56 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:

> My suggestion for you, then, is to do what we've done in the CSL
> world: create a JSON representation which matches your internal model.
> It'd be nice if we could settle on a common one, but I have a feeling
> that may be hard.

Hmm, ok, I will certainly look into this. Would be useful for us
anyway to have such a representation.

> This label would then map to:
>
> - bibtex key
> - mods:mods/@ID
> - a property on the RDF
> - ID on RIS
> - etc.

That would be ideal - is there anything I can do to help this along? I
can see several requests for exactly this in the feature request forum
going back in some cases, a couple of years ...

Avram Lyon

unread,
Mar 5, 2011, 4:55:22 AM3/5/11
to zoter...@googlegroups.com
2011/3/5 philkime <Phi...@kime.org.uk>:

> That would be ideal - is there anything I can do to help this along? I
> can see several requests for exactly this in the feature request forum
> going back in some cases, a couple of years ...

This has been green-lighted for the next revision of the Zotero data
model (probably Zotero 2.2) (see
https://github.com/ajlyon/zotero-bits/issues#issue/24), and so the
Zotero folks are interested in getting this into CSL soon too.

Avram

Bruce D'Arcus

unread,
Mar 5, 2011, 10:03:03 AM3/5/11
to zoter...@googlegroups.com

Not following the last point. What is missing in CSL?

Bruce

Avram Lyon

unread,
Mar 6, 2011, 6:52:40 AM3/6/11
to zoter...@googlegroups.com
2011/3/5 Bruce D'Arcus <bda...@gmail.com>:

>> This has been green-lighted for the next revision of the Zotero data
>> model (probably Zotero 2.2) (see
>> https://github.com/ajlyon/zotero-bits/issues#issue/24), and so the
>> Zotero folks are interested in getting this into CSL soon too.
>
> Not following the last point. What is missing in CSL?

Oh, that's just me not knowing much about CSL and making mistakes left
and right. This is of course already in CSL 1.0. This will hopefully
be in Zotero and mapped to the existing citation-label in CSL.

Avram

Reply all
Reply to author
Forward
0 new messages