ItemAnnotation Understanding Fields

102 views
Skip to first unread message

Ryan W. West

unread,
May 26, 2023, 4:30:40 AM5/26/23
to zotero-dev
Hi, I'm working on a plugin that would provide a two-way sync of PDF annotations between Zotero (inside the sqlite db itemAnnotations table) and KOReader PDF annotations (via their pdffile.sdr/metadata.pdf.lua file) (more context). As part of this, I need to directly read and write rows from the itemAnnotations table and am thus trying to understand all of its contents. Most of it is self-explanatory but I had several questions:

- How exactly is 'rects' within 'position' calculated? I observe that the 'rects' array contains subarrays that each seem to map to one 'line' of highlight as rendered in a PDF (so an annotation spanning two lines would have two distinct rect subarrays). The contents of each subarray appear to be [x, y, x2, y2] to form a full rectangle and I believe x=0 y=0 corresponds to the bottom-left corner of each page. But is this always correct, and if so which part of the starting and ending character is selected as the first or second point (i.e. top of a t or bottom of a p)? I'd like to generate these assuming I only know the page and text to highlight (and different set of point coordinates from KOReader) - would the reader be okay with inexact (but nearby) rect values if I can't get the formula quite right?

- How does sortIndex work?
- Does the parentItemID correspond to an attachment PDF file or the parent Item that the PDF associates to? The name suggests the latter, but I ask because a parent item can have multiple attachments so I can't see how only having a unique SQL key field to the parent would distinguish multiple PDFs.
Message has been deleted

Martynas Bagdonas

unread,
May 26, 2023, 6:45:17 AM5/26/23
to zotero-dev
- Yes, 'rects' is an array [[x1, y1, x2, y2], ...] where x1 < x2 and y1 < y2. They're in PDF page coordinate system, which originates from the bottom-left corner. While annotating, the Zotero PDF reader aims to keep the number of rects as small as possible and often equals to the number of lines. However, in cases of poorly OCRer PDFs or diagonal text, the number of rects may even be equal to the number of characters. When importing annotations, the original rects from the PDF annotation are preserved, so inexact values likely won't be an issue.
- 'sortIndex' takes the format of (page index)|(closest text offset)|(Y coordinate from the top), allowing for the sorting of annotations based on their page index, position in the text, or Y coordinate when the page contains no text.
- 'parentItemID' refers to the parentID of an annotation item, which is always an attachment.

Ryan W. West

unread,
May 26, 2023, 12:18:53 PM5/26/23
to zotero-dev
Thank you, that's very helpful. I think I understand the others now but am not sure about generating sortIndex "(page index)|(closest text offset)|(Y coordinate from the top)", had a few more questions:

- (Y coordinate from the top). This treats the top as 0, so you can derive it from `rects[0][3]` by subtracting that from the page height and always rounding down to nearest integer (decimal truncation). My question is, is this always correct or would it be calculated differently if highlights were imported from PDF embeddings (and thus the rects values aren't 'natively' calculated by Zotero)?

- For (closest text offset), is this taken from a found text layer of the PDF and the integer value corresponds to the number of characters so far (so basically zero-indexing a string array)? This seems to line up with my experiments but I'm not sure how different text layers on the same page (e.g. header, column1, column2, footer, watermark) would be concatenated together to all have a unique and predictable offset. (From the other program I will know the starting and ending coordinates of the highlight to 'import' and the actual text therein, so I'm trying to figure out how I could get this offset from that data. Maybe there's a Zotero function I can call that gets the offset from a x y coordinate 'tap').

- Do you by chance know how a plugin can look up PDF page height? Perhaps there is documentation somewhere I need to familiarize myself with.

Ryan W. West

unread,
May 26, 2023, 3:38:54 PM5/26/23
to zotero-dev
I'd also like to ask what the best way to read and write to the itemAnnotations table (insert, modify, and delete row) would be for Zotero 7 (Beta - hopefully it won't change a ton in this area). It looks like the Web API and Javascript API are possibly options (https://www.zotero.org/support/dev/web_api/v3/write_requests) and read somewhere that directly modifying the sqlite database could cause corruption and is discouraged. If the above options don't work then maybe I'd need to write a plugin. Was wondering if any option would be preferable or easiest, and if there is any documentation or command to specifically use one to read/write this table. 

Martynas Bagdonas

unread,
May 30, 2023, 4:47:05 PM5/30/23
to zotero-dev
- Yes, the Y-coordinate from the top is precisely as you described and always the same.
- The Zotero PDF reader depends on the original character order in the PDF file. This is different from some other viewers which reflow the text by separating it into columns, paragraphs, lines, and words. To determine the closest offset, the rectangle of each character needs to be known.
- It is possible to obtain the page height from a loaded PDF tab, but not otherwise.
- Unfortunately, I do not have an answer to your question regarding database reading and writing.

Ryan W. West

unread,
May 30, 2023, 5:00:21 PM5/30/23
to zotero-dev
Thanks for all the help, I'm actively working on something that can convert annotations from Zotero to KOReader and vice versa. The biggest issues I think will be calculating pageLabel and the character offset text. These fields seem possibly less important than the others but I'll still try to find solutions. Maybe a plugin would be required to access the right variable in the character order case. I'll see if I can calculate the height from the PDF file itself and some other library, if I don't it somewhere in the web API's attachments section.

Regarding your final point, I think this will help me: https://github.com/urschrei/pyzotero/issues/154.

Dan Stillman

unread,
May 30, 2023, 5:13:56 PM5/30/23
to zoter...@googlegroups.com
On 5/26/23 3:38 PM, Ryan W. West wrote:
> I'd also like to ask what the best way to read and write to the
> itemAnnotations table (insert, modify, and delete row) would be for
> Zotero 7 (Beta - hopefully it won't change a ton in this area). It
> looks like the Web API and Javascript API are possibly options
> (https://www.zotero.org/support/dev/web_api/v3/write_requests) and
> read somewhere that directly modifying the sqlite database could cause
> corruption and is discouraged. If the above options don't work then
> maybe I'd need to write a plugin. Was wondering if any option would be
> preferable or easiest, and if there is any documentation or command to
> specifically use one to read/write this table.

Yes, you definitely wouldn't use SQL — all updates need to go through
either Zotero application code or the web API.

Whether you use the web API or the JS API really depends on your
implementation. We'll also soon be introducing a local HTTP API in the
desktop app that mirrors the web API, so if you're using an external
local process, you could write code now that uses the web API and switch
that to the local API later if you wanted offline functionality.
Reply all
Reply to author
Forward
0 new messages