We need to talk about citation keys

194 views
Skip to first unread message

Emiliano Heyns

unread,
Jul 19, 2017, 8:12:06 AM7/19/17
to zotero-dev
As I'm rebuilding BBT for Z5, I'm trying to minimise, preferably eliminate, the monkey-patches and sandbox-piercing that BBT employed to do its work. Right now I'm at a crossroads with regards the generation of citation keys. My desiderata for this are:
  1. When generating citation keys, I need to be able to search the whole library the reference lives in, not just the subset being exported, for potential duplicates. Given the translator sandboxing, this means citation keys must be generated outside the sandbox.
  2. This citation key must be attached to the reference so it does not change unless explicitly triggered by the user
  3. There are people with large libraries, so the search for duplicates needs to be efficient
  4. The citation keys must be available within the translator, so the citation key generated outside the sandbox must find a way in.
What I did before was:
  1. Pierce the sandbox so key generation can be initiated from within the translator. 
  2. This key generation searches a secondary cache that holds the citation keys, and places the new key there, because searching the "extra" field is way too slow.
  3. If the key is meant to be pinned (that means doesn't change when e.g. the reference title changes), I also store it in the "extra" field so it will sync
  4. Return the generated key to the translator and it's off to the races.
This however has a number of drawbacks that I'd love to get rid of while I'm rebuilding BBT.
  1. As from the Phil Karlton quote "There are only two hard things in Computer Science: cache invalidation and naming things.": I'd dearly love to get away from my secondary cache. I'm seeing cases in the wild where it gets out of sync to the actual library which requires not just a full table scan but a parse of each result to find the pinned keys in the "extra" field, which gets me to
  2. There is no way to efficiently search for duplicates in the "extra" field, as it requires inspecting each of the extra fields to lift the key out.
  3. I don't want to pierce the translator sandbox anymore. The sandbox has some very decent defences in place that are tricky to work around. It's extra work for no benefit, and potentially risky.

The "clean" option fro the Zotero pov of course is to store all citation keys in the "extra" field, pinned or not, and retain the secondary cache for searches. I don't like this because cache invalidation remains an issue. The citekeys would get to the translators without sandbox-busting, which is better than it was. I know this is always picked as the preferred option because it doesn't interfere with the regular operations of Zotero, but it sure does interfere with mine because of the efficiency and/or cache invalidation problems.
I've been experimenting with several approaches, including:

Hacky alternatives that violate the spirit of the Zotero API but won't break it, and which would allow for efficient finding of duplicates:
  1. Store the citation keys in a tag. This will make the tag selector instantly useless because it will be flooded with citation key "tags", and I seem to recall there were performance problems when too many tags were present in a Zotero library. Perhaps the performance problem has been solved; I've poked around a bit in Zotero to see if I can just have the tags not show up in the tag picker, but haven't gotten far yet. Such hiding would involve monkey-patching.
  2. Store the citation keys in a specially formatted linked URL (e.g. url="zotero://better-bibtex/citekey", title="@citekey"). Downside is that every reference will show that it has an attachment when potentially that can be a sole "attachment" just holding the citekey. I've likewise looked into hiding those in the UI but haven't gotten far on that either. The hiding again would involve monkey-patching. "zotero://" urls cannot be imported back I've just found, so I may have to settle on something like "https://better-bibtex/citekey"
Hacky alternatives that will (likely) break Zotero:
  1. Store the citation keys in spurious relation records. When I previously tried this some years ago this screwed up my library beyond any salvation (btw, the account "emilianoheyns" that linked to it can be killed AFAIC).
  2. Store the citation keys in the existing tables for custom keys. Will likely not sync, and may or may not break Zotero.
Finally then, my question:

I very much would love to see a non-hacky alternative that would just allow me to store the citation key associated to the reference, efficiently searchable, syncable, and available to the translators. I know support for custom fields has been talked about many times before but I'm hopeful now that the major work on Z5 is done, there would be time to implement custom fields. If support for custom fields is not likely to emerge in the foreseeable future, I'm leaning towards the linked URL alternative at the moment.

Emiliano Heyns

unread,
Jul 19, 2017, 8:29:52 AM7/19/17
to zotero-dev
BTW I was talking about custom fields as the generic solution, but if just one extra field specific for the citation key would be added, that would already solve the vast bulk of the problem for BBT. This citation key field would also be useful for Markdown and the RTF scanner, so it's not just a BBT thing, and would make the exporters I traditionally bundle fully compatible with Zotero without the presence of the BBT plugin. The plugin would still be available as it does a lot more than export, but as said, just dropping the translators in would work on any Z5 with that extra field present.

Bruce D'Arcus

unread,
Jul 19, 2017, 9:42:04 AM7/19/17
to zoter...@googlegroups.com
Agreed on the "need a citation key field at minimum" request.

On Wed, Jul 19, 2017 at 8:29 AM Emiliano Heyns <emilian...@iris-advies.com> wrote:
BTW I was talking about custom fields as the generic solution, but if just one extra field specific for the citation key would be added, that would already solve the vast bulk of the problem for BBT. This citation key field would also be useful for Markdown and the RTF scanner, so it's not just a BBT thing, and would make the exporters I traditionally bundle fully compatible with Zotero without the presence of the BBT plugin. The plugin would still be available as it does a lot more than export, but as said, just dropping the translators in would work on any Z5 with that extra field present.

--
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zotero-dev+...@googlegroups.com.
To post to this group, send email to zoter...@googlegroups.com.
Visit this group at https://groups.google.com/group/zotero-dev.
For more options, visit https://groups.google.com/d/optout.

Dan Stillman

unread,
Jul 20, 2017, 12:30:54 AM7/20/17
to zoter...@googlegroups.com
A citation key field (or fields) has been long planned as part of the
fixed field changes after 5.0. Custom fields will happen later. As we've
said, we'll be adding new fields as soon as we shut off 4.0 syncing,
which we're hoping to do later this year.

In the meantime, I'm not really clear on the problem with using the
Extra field. In a 6,700-item test library, I can pull out 1,000 PubMed
IDs stored in the Extra field in < 80 ms.

var d = new Date();
Zotero.DB.queryAsync("SELECT itemID, value FROM itemData JOIN
itemDataValues USING (valueID) WHERE fieldID=22")
.then(function (rows) {
var ids = [];
for (let row of rows) {
let lines = row.value.split(/\n/).filter(line =>
line.startsWith('PMID'));
if (lines.length) {
ids.push(lines[0].split(/:\s*/)[1]);
}
}
Zotero.debug(ids.length);
Zotero.debug(new Date() - d);
});

I also don't see why you have to do anything to the sandbox. You're
already monkey-patching functions on the outside, so why can't you just
add the calculated citation key to the data that gets passed in (via
itemToExportFormat(), say)?

Emiliano Heyns

unread,
Jul 20, 2017, 4:53:45 AM7/20/17
to zoter...@googlegroups.com
On Thu, Jul 20, 2017 at 6:30 AM, Dan Stillman <dsti...@zotero.org> wrote:
On 7/19/17 8:29 AM, Emiliano Heyns wrote:
BTW I was talking about custom fields as the generic solution, but if just one extra field specific for the citation key would be added, that would already solve the vast bulk of the problem for BBT. This citation key field would also be useful for Markdown and the RTF scanner, so it's not just a BBT thing, and would make the exporters I traditionally bundle fully compatible with Zotero without the presence of the BBT plugin. The plugin would still be available as it does a lot more than export, but as said, just dropping the translators in would work on any Z5 with that extra field present.

A citation key field (or fields) has been long planned as part of the fixed field changes after 5.0. Custom fields will happen later. As we've said, we'll be adding new fields as soon as we shut off 4.0 syncing, which we're hoping to do later this year.

Ah, that sounds good. I'll go see if it's feasible for the BBT users who have chimed in to wait that out.
 
In the meantime, I'm not really clear on the problem with using the Extra field. In a 6,700-item test library, I can pull out 1,000 PubMed IDs stored in the Extra field in < 80 ms.

var d = new Date();
Zotero.DB.queryAsync("SELECT itemID, value FROM itemData JOIN itemDataValues USING (valueID) WHERE fieldID=22")
.then(function (rows) {
    var ids = [];
    for (let row of rows) {
        let lines = row.value.split(/\n/).filter(line => line.startsWith('PMID'));
        if (lines.length) {
            ids.push(lines[0].split(/:\s*/)[1]);
        }
    }
    Zotero.debug(ids.length);
    Zotero.debug(new Date() - d);
});

The problem with using the extra field is that I've had complaints from multiple users on either underpowered systems or with really large libraries that this was really, really slow. I didn't start adding my own DB for the fun of maintaining it; I'd dearly love to ditch it. I'm a little surprised to be honest given how extensively normalized the Zotero DB is; if parsing a text field would have been fast enough, why not just have all items as serialized JSON in a single-column table?
 
I also don't see why you have to do anything to the sandbox. You're already monkey-patching functions on the outside, so why can't you just add the calculated citation key to the data that gets passed in (via itemToExportFormat(), say)?

For several reasons, one of which is that I need access to the citeproc dateparser. I also really, really want to get away from monkey patching wherever possible. Having the translators being drop-in compatible without the plugin present is a soft desideratum because it would allow having it running server-side to generate BibTeX from the Zotero online API.

Emiliano Heyns

unread,
Jul 20, 2017, 5:03:47 AM7/20/17
to zoter...@googlegroups.com
Oh and I also pierce the sandbox in order to provide output caching. That's squarely on me of course; BBT is *lots* slower than the stock bibtex exporter, so I cached the generated output per-item. I am considering dropping that so I don't have to pierce the sandbox anymore. 

Emiliano Heyns

unread,
Jul 24, 2017, 8:31:40 AM7/24/17
to zotero-dev
I have a slightly more complicated query right now:

select item.itemID, item.libraryID, extra.value as extra
from items item
left join itemData field on field.fieldID = 22 and field.itemID = item.itemID
left join itemDataValues extra on extra.valueID = field.valueID
where item.itemTypeId not in (14, 1) and item.itemID not in (select itemID from deletedItems)

but this takes 100-130ms on my 2015 MacBook Air on 53 items, not even doing any recognition on the "extra" field, just getting it and looping through the rows. I'll see if I can do further performance tweaks.

Are references guaranteed to have the "extra" field even if it's empty, or should I take into account (as I do here with the left joins) that it can also just be absent? 

Emiliano Heyns

unread,
Jul 24, 2017, 9:29:38 AM7/24/17
to zotero-dev
On Monday, July 24, 2017 at 2:31:40 PM UTC+2, Emiliano Heyns wrote:

I have a slightly more complicated query right now:

select item.itemID, item.libraryID, extra.value as extra
from items item
left join itemData field on field.fieldID = 22 and field.itemID = item.itemID
left join itemDataValues extra on extra.valueID = field.valueID
where item.itemTypeId not in (14, 1) and item.itemID not in (select itemID from deletedItems)

but this takes 100-130ms on my 2015 MacBook Air on 53 items, not even doing any recognition on the "extra" field, just getting it and looping through the rows. I'll see if I can do further performance tweaks.


I've simplified the scan (still doing nothing, just grabbing the rows and looping through them but doing nothing with or to the rows -- not even lifting the citekey out):

select item.itemID, item.libraryID, extra.value as extra
from items item
join itemData field on field.itemID = item.itemID
join itemDataValues extra on extra.valueID = field.valueID
where field.fieldID = 22 and field.itemID not in (select itemID from deletedItems)

But this still takes 100-130ms on my system. This happens right after I import 53 items into an empty DB; I do see a bunch of queries like

SELECT IA.itemID FROM itemAttachments IA NATURAL JOIN items I LEFT JOIN itemData ID ON (IA.itemID=ID.itemID AND fieldID=1) LEFT JOIN itemDataValues IDV ON (ID.valueID=IDV.valueID) WHERE parentItemID=? AND linkMode NOT IN (?) AND IA.itemID NOT IN (SELECT itemID FROM deletedItems) ORDER BY contentType='application/pdf' DESC, value=? DESC, dateAdded ASC [2, 3, '']

between the start and the end of my scan but I don't see how the select I'm running could have triggered those; my current guess is that there's still something running async as a result of the import (the IDs match the freshly imported references)

Emiliano Heyns

unread,
Jul 24, 2017, 9:54:19 AM7/24/17
to zotero-dev
Definitely the import running in the background. If I add a Zotero.Promise.delay(...) before I kick off my citation key scan (with the ... proportional to the number of references I'm importing), the scan is down to 2-5ms. I'll tinker on, but I'd rather not have the delay.

Randall O'Reilly

unread,
Jul 24, 2017, 2:43:16 PM7/24/17
to zoter...@googlegroups.com
My lab has been using zotero productively, with the BBT plugin, for a few years now, and we’re excited about the new version 5.0, and particularly the planned advances beyond that which apparently will finally support a citekey field. I just spent some time reviewing the various threads over the years about the (lack of) support for citekeys in zotero, and wanted to attempt to summarize the situation and offer a bit of perspective from our “real world” experience.

I see two basic issues that have prevented more rapid adoption of citekeys:

1. Bibtex-style, human-readable citekey's are non-unique, and don't scale well.
+ Every attempt to solve the uniqueness / scalability of the citekey makes it more like a random hash.
+ So just use random hash keys in the first place

2. Bibtex/latex users are a small % of the user population

Here’s our actual experience relative to these issues:

1. We have over 22k citations in my lab group database (https://www.zotero.org/groups/340666/ccnlab), covering a significant chunk of the computational + cognitive + neuroscience literature, and use a simple last-name-of-first-three-authors+2-digit-year formula for our citekeys, and have *very few* collisions, to the point where it really isn't a major issue resolving the "a" or "b" version of a duplicate citekey -- happens very rarely.

+ Meanwhile, we obtain *considerable* productivity benefits from being able to just write out citations by knowing the paper's authors, without having to constantly stop and look things up. Moreover, the readability of the plaintext is extremely beneficial -- we can easily see what the references are in the plaintext (markup) version of a document, which would be impossible with a hash code.

2. A major component of our use of zotero is outside of bibtex, on mediawiki and now Hugo, for doing quick literature reviews on our lab wiki. All of the above productivity benefits apply in this case. Indeed, the increasing popularity of markdown and the broader JAMstack architecture for the web: https://jamstack.org/ puts all the same pressure on the use of citekeys as is the case in latex / bibtex — people are increasingly writing content in plaintext using markup languages, and could really benefit from a simple way to include references.

+ For example, in our new Hugo-based lab wiki, we write e.g., {{< cite “CiteKey99” >}} and it includes a nice APA-formatted reference (and {{< citedreferences >}} at the end of a page lists the full bibliography — we implemented similar templates for mediawiki as well).

So, as is often the case, “theoretical” limitations that would seem to doom a particular approach are actually not so much of a problem in practice (and our choice of the 3-author citekey format over the more popular single-author version is an important practical “trick” that makes this more workable in practice).

Anyway, I just wanted to do everything I can to help push things toward full support of citekeys (and BBT more generally) as this would make our life a lot easier. As it is, we had to create a separate linking database to be able to use our citekeys to access zotero items, and this then has the usual sync problems (which again are not horrible in practice, but occasionally annoying): https://grey.colorado.edu/CompCogNeuro/index.php/WikiCite

Thanks for all your hard work on making this great tool!

- Randy
----
Dr. Randall C. O'Reilly
Professor, Department of Psychology and Neuroscience
University of Colorado Boulder
345 UCB, Boulder, CO 80309-0345
303-492-0054 Fax: 303-492-2967
http://www.colorado.edu/faculty/oreilly


Emiliano Heyns

unread,
Jul 24, 2017, 3:07:25 PM7/24/17
to zotero-dev
On Monday, July 24, 2017 at 8:43:16 PM UTC+2, Randall O'Reilly wrote:

Anyway, I just wanted to do everything I can to help push things toward full support of citekeys (and BBT more generally) as this would make our life a lot easier.  As it is, we had to create a separate linking database to be able to use our citekeys to access zotero items, and this then has the usual sync problems (which again are not horrible in practice, but occasionally annoying):  https://grey.colorado.edu/CompCogNeuro/index.php/WikiCite


From what I've understood, Z4 needs to be phased out before it can happen, which is planned for later this year.
 
Reply all
Reply to author
Forward
0 new messages