Zotero.org sync engine prototype for nodeJS?

187 views
Skip to first unread message

panyasan

unread,
Mar 24, 2021, 8:44:56 AM3/24/21
to zotero-dev
I want to create a local copy and mashup of my library and my zotero groups for faster searching and more fine-grained queries that are currently possible with the zotero.org server (this will also avoid queries to zotero.org). For this, I want to use a local couchbase database and a NodeJS script which does a sync just as the Zotero-Client does, saving all new and updated items in one big database.

https://www.zotero.org/support/dev/web_api/v3/syncing#full-library_syncing explains how the sync process works. Before I go and implement it from scratch in node, I wonder if someone already has worked on and published a prototype of a NodeJS-based sync engine which can then be connected to a storage engine, or if people have written sync engines for their own apps which could get me started. I will definitively also look at the Zotero client sources but something more Node-ish would be useful.

Thanks for any suggestions you might have.

Christian

Emiliano Heyns

unread,
Mar 24, 2021, 11:15:17 AM3/24/21
to zotero-dev
I created a nodejs solution for one-way zotero-to-postgres sync for https://github.com/kaplanrehab (zotero-sync-pg), but it was a paid gig, and it's not open source, so I'm not at liberty to share. You could see if he has in interest in sharing. It wasn't excessively complicated (under 1 kloc) but I didn't find it to be trivial either.

Emiliano Heyns

unread,
Mar 24, 2021, 11:27:12 AM3/24/21
to zotero-dev
It's been a long while since I worked this, but ISTR that Zotero can tell you mid-sync that the remote has changed and that you should start again, so you'll want robust transactions support. I don't know what couchbase has on offer here.

Christian Boulanger

unread,
Mar 24, 2021, 4:28:36 PM3/24/21
to zotero-dev
Thank you Emiliano, I will ask kaplanrehab if he is willing to set the code free. In my case, it's really a read-only copy of Zotero-data that doesn't change very often, so there might be less of a problem with transactions. But I'll see...

Christian Boulanger

unread,
Mar 26, 2021, 3:38:40 AM3/26/21
to zotero-dev
Unfortunately, your client's firm does not longer exist (he has joined another company), and there is not public contact info. I know this is a lot to ask, but you be willing to write to him and ask if he allows you to open source the main sync engine code? I think this would be incredibly useful for the community, as there is currently no library available that allows to back up (or otherwise redudantly store) zotero library data. Of course, I totally understand if that's not possible.

Emiliano Heyns schrieb am Mittwoch, 24. März 2021 um 16:27:12 UTC+1:

Emiliano Heyns

unread,
Mar 26, 2021, 4:46:15 AM3/26/21
to zotero-dev
Done.

Emiliano Heyns

unread,
Mar 26, 2021, 12:37:50 PM3/26/21
to zotero-dev

Christian Boulanger

unread,
Mar 26, 2021, 12:47:26 PM3/26/21
to zotero-dev
This is fantastic. Thank you very much!

Emiliano Heyns

unread,
Mar 26, 2021, 4:33:40 PM3/26/21
to zotero-dev
A fair bit of that code is creating specialized views that Richard needed for his analysis. The actual sync isn't too hard now that I look at it again, but the note from earlier stands -- you can be told mid-sync you have to abort and retry, so transactions are going to be required. I stuff the Zotero objects into Pg JSON columns because I didn't see much value in converting the JSON objects you get from the API into a bunch of connected tables. Pg allows you to query the items as they are, as objects. But I see that the sqlite JSON1 extension would allow you to do the same, without the hassle of setting up Pg. I might actually take a stab at this for an Overleaf project I've been pondering.

Christian Boulanger

unread,
Mar 26, 2021, 5:31:51 PM3/26/21
to zotero-dev
Couchbase does have transaction support so I need to look into that. Of course, it would be ideal to re-arrange the sync code to be backend-agnostic, so that people can plug in their particular backend by providing a connector that implements a simple interface.

Dan Stillman

unread,
Mar 26, 2021, 6:23:49 PM3/26/21
to zoter...@googlegroups.com
On 3/26/21 4:33 PM, Emiliano Heyns wrote:
> A fair bit of that code is creating specialized views that Richard
> needed for his analysis. The actual sync isn't too hard now that I
> look at it again, but the note from earlier stands -- you can be told
> mid-sync you have to abort and retry, so transactions are going to be
> required.

Not really transactions. For a one-way pull, if the remote version
changes in the middle of a multi-request process, you'd just want to
restart the sync, probably after a brief pause to allow a remote process
to finish. There's no inherent problem if you don't, as long as you're
only storing the earliest library version you received and passing that
for your 'since' value (so that you don't miss updates), but you might
end up with an inconsistent view of the library that wouldn't be
resolved until the next sync. It could also result in a save failure if,
say, an item references a collection that was added after you started
the sync.

You don't need actual transactions in terms of your database store. A
sync process can involve dozens or hundreds of requests, and there's no
reason to roll everything back if there's a later problem.

Emiliano Heyns

unread,
Mar 27, 2021, 11:02:14 AM3/27/21
to zotero-dev
Wouldn't item creations in the source lead to duplicates being created on the sink if you restart a sync? Are the items keys effectively GUIDs so that I could just use upsert?

Emiliano Heyns

unread,
Mar 27, 2021, 1:24:06 PM3/27/21
to zotero-dev
On Friday, March 26, 2021 at 10:31:51 PM UTC+1 Christian Boulanger wrote:
Couchbase does have transaction support so I need to look into that. Of course, it would be ideal to re-arrange the sync code to be backend-agnostic, so that people can plug in their particular backend by providing a connector that implements a simple interface.

That's the kind of thing that tends to spin out into being the bulk of the project. Zotero sync really isn't very complex, certainly given that I got the transaction requirement wrong, and then suddenly you have a situation where a document store makes much more sense than a SQL database. In fact, on re-reading the sync docs, a simple key-value document store should be drop-dead simple for one-way sync.

Christian Boulanger

unread,
Mar 27, 2021, 2:18:37 PM3/27/21
to zotero-dev
Emiliano Heyns schrieb am Samstag, 27. März 2021 um 18:24:06 UTC+1:
. In fact, on re-reading the sync docs, a simple key-value document store should be drop-dead simple for one-way sync.

Yes, in fact, a "connector" interface probably would only need to require "upsert(key)" and "delete(key)"  for a given library. Once the data is in the document store, one can do much more complex queries than is possible using Zotero's Web API, and much faster and without any rate limits. This will also take off load from zotero.org. Problably not a problem right now, but I would expect that Zotero will increasingly be used not only as a personal or group reference database, but also as a backend for bibliographic data analysis. Off topic: I have often thought of the (untapped) wealth of data that collectively has been amassed on zotero.org...


Dan Stillman

unread,
Mar 27, 2021, 3:08:47 PM3/27/21
to zoter...@googlegroups.com
On 3/27/21 11:02 AM, Emiliano Heyns wrote:
> Wouldn't item creations in the source lead to duplicates being created
> on the sink if you restart a sync? Are the items keys effectively
> GUIDs so that I could just use upsert?

The object keys don't change — that's why they're keys. If you just
repeat a fetch of items, the items you downloaded before haven't
changed, so nothing changes in your database.

If you're following the syncing instructions [1], you're not even
looking at the data for the previously downloaded objects — you're
first just getting a map of object keys to object versions and skipping
those that are up to date based on the key and version in your database,
without even fetching the previously downloaded objects. So if new items
were added and you restarted the sync with the same `since=`, there
would just be a few items in the `?format=versions` response that were
either new or with updated versions, and those would be the only ones
you would fetch and process.

[1] https://www.zotero.org/support/dev/web_api/v3/syncing#sync_library_data

Emiliano Heyns

unread,
Mar 27, 2021, 4:05:10 PM3/27/21
to zotero-dev
Yeah, one-way sync is easier than I had figured. The reason why I was wondering about the keys is that they can be generated locally before first sync (right?), and they seemed a bit short to be GUIDs.

Dan Stillman

unread,
Mar 27, 2021, 4:21:35 PM3/27/21
to zoter...@googlegroups.com
On 3/27/21 4:05 PM, Emiliano Heyns wrote:
> Yeah, one-way sync is easier than I had figured. The reason why I was
> wondering about the keys is that they can be generated locally before
> first sync (right?), and they seemed a bit short to be GUIDs.

They're unique to a given library and object type. They can be generated
locally, but you wouldn't generate a key that matched anything you
already had locally, so we're talking about the likelihood of generating
the exact same key that was generated between syncs on another device
synced to that library, and the odds of that are [math math math] very low.

And object versioning would ensure that you got a conflict in that case
(local version 0 does not match remote version n > 0), not an overwrite.

Emiliano Heyns

unread,
Mar 27, 2021, 4:22:52 PM3/27/21
to zotero-dev
Right, makes total sense.

Christian Boulanger

unread,
Mar 29, 2021, 6:44:27 AM3/29/21
to zotero-dev
It is true that one-way sync is dead simple with a key-value store. Here's what is good enough for my purposes now: https://gist.github.com/cboulanger/b3cdf02339e0e2087ac445b81126029c

Thank you for posting your code in any case!

Christian Boulanger

unread,
Mar 31, 2021, 3:35:10 AM3/31/21
to zotero-dev
Dan, one question about the sync process:

in https://www.zotero.org/support/dev/web_api/v3/syncing#sync_library_data it states that "items" and "items/top" have to be retrieved separately. Why is that? Aren't the top items included in the library items?

Christian Boulanger

unread,
Mar 31, 2021, 9:55:13 AM3/31/21
to zotero-dev

Here's a more fully-featured function which backs up all libraries which are accessible to the owner of the API key to a local couchbase server, using an "adapter"-approach (so that stuff could be saved in other key-value stores via an adapter class).

Dan Stillman

unread,
Mar 31, 2021, 2:03:09 PM3/31/21
to zoter...@googlegroups.com
On 3/31/21 3:35 AM, Christian Boulanger wrote:
> it states that "items" and "items/top" have to be retrieved
> separately. Why is that? Aren't the top items included in the library
> items?

That's assuming 1) a relational database where child items have a
foreign-key dependency on their parent items and/or 2) a GUI-based
program that wants to show sync progress immediately by showing parent
items first.

(Now that there's a potential third level of items for annotations,
which are children of attachments, (1) alone isn't guaranteed to avoid
FK failures in the non-top request, though it reduces them.)

Emiliano Heyns

unread,
Apr 12, 2021, 11:40:10 AM4/12/21
to zotero-dev
What external object does "sandbox.getZoteroApi" refer to here? Is that an existing library that handles Zotero API access?

Christian Boulanger

unread,
Apr 12, 2021, 12:46:20 PM4/12/21
to zotero-dev
Yes, this one: github.com/tnajdek/zotero-api-client , which is wrapped by a few higher-level constructs in https://gist.github.com/cboulanger/e3719a774af761048100aa2271521fb2#file-zotero-api-js and exposed by the sandbox object as an  preconfigured instance.

Christian Boulanger

unread,
Apr 12, 2021, 12:49:20 PM4/12/21
to zotero-dev
I am working on a library that models Zotero entities (Library, Item, Collection, Attachment, etc.) as classes based on @tnajdek's library but it is so rudimentary that it is not publishable yet.

Emiliano Heyns

unread,
Apr 12, 2021, 12:59:12 PM4/12/21
to zotero-dev
Ah I'd missed the other files in the gist. And that the backup script doesn't save collections. I need those too.

Are you planning to publish on npm? Then I might wait for it, depending on how far off it is. Otherwise I'd just bung something together in typescript. 


--
You received this message because you are subscribed to a topic in the Google Groups "zotero-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/zotero-dev/iKyR6U-QD6Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to zotero-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zotero-dev/96cdd137-caac-47e2-8f9c-f14ecb98321an%40googlegroups.com.

Christian Boulanger

unread,
Apr 12, 2021, 1:12:15 PM4/12/21
to zoter...@googlegroups.com
Unfortunately, I won't be able to publish anything non-embarrassing for quite some time, my scripts are also just ad-hoc stuff that get the job done. But I do think it would be nice to have a high-level library that can do stuff like synchronizing and maybe one day I'll get to it unless of course someone else beats me to it - which would be very nice :-)


Emiliano Heyns

unread,
Apr 16, 2021, 5:58:50 AM4/16/21
to zotero-dev

Christian Boulanger

unread,
Apr 16, 2021, 9:59:21 AM4/16/21
to zotero-dev
Very cool, thank you very much!!

Emiliano Heyns

unread,
Apr 16, 2021, 10:47:02 AM4/16/21
to zotero-dev
I needed one anyhow, and with the new insights I gained here on one way sync, it's really almost trivial. There's a bundled store that dumps to json files. 

Christian Boulanger

unread,
Apr 22, 2021, 5:57:00 AM4/22/21
to zotero-dev
This thread has already produced three new github projects!


Thanks to Emiliano for designing a flexible sync engine that can sync Zotero data to anything that has a Node API (or needs one, like Bookends). There could be a sync backend for other reference managers which allow scripting in some form or have a Web API.
Reply all
Reply to author
Forward
0 new messages