Issues with search_after

26 views
Skip to first unread message

Ian Simpson

unread,
Jul 23, 2019, 6:22:51 AM7/23/19
to dev
Hi All

I have a weird problem when using the search_after parameter to retrieve a group of several thousand annotations.

I have used a search term of the following form (in python):-

r = requests.get(api_url_base+'search?'+'group=<GROUP>&user=<USER>&search_after='+last_record+'&order=desc&sort=created&limit=200&tag=<TAG>', headers=headers)

where the "last_record" entry is the retrieved date-created time stamp of the last annotation in the previous batch.

Retrieving in batches of 200 (which is the single search limit) I iterate through to get the full set, which is 2589 annotations.

When I do this I can never retrieve all of the annotations in the set; the most I can recover is 1819. I have exhaustively tested this ensuring that I am ordering and sorting the queries in exactly the same way (which I've also checked).

Bizarrely, the maximum number of annotations I can recover varies depending on how I set the limit in each batch. So, looping until "complete" I get the following:-

batch_size (max return)

20 (499)
30 (739)
40 (899)
50 (1069)
100 (1519)
150 (1819)
200 (1819) [limit]

I've spent a long time debugging and testing the elements of this, but cannot find an explanation as to why this is happening. Perhaps I'm mis-understanding how this is supposed to work, but from the blog and API entry it seems extremely straightforward, but just not working as intended. Does anyone know what I'm doing wrong, or whether this reveals a problem with the search function?

Also, is there a better/more efficient way of doing this, it seems incredibly clunky.

Any help, much appreciated.

Best wishes

Ian

Robert Knight

unread,
Jul 23, 2019, 7:09:27 AM7/23/19
to Ian Simpson, dev
Hi Ian,

Would you be able to share the complete script you are using to fetch annotations or a simplified version of it? That may help us to look into the issue more quickly.

Kind Regards,
Rob.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@list.hypothes.is.
To view this discussion on the web visit https://groups.google.com/a/list.hypothes.is/d/msgid/dev/4f2a3e42-a0ca-4729-9d02-b73f3e52afd0%40list.hypothes.is.

Jack Park

unread,
Jul 23, 2019, 10:19:29 AM7/23/19
to Ian Simpson, dev
I can offer this thought based on my experience building a platform which seeks to create topic maps from annotations; I discovered that it was valuable to drop a few second time delay(started with 10 seconds, but settling at 5)  between search iterations. I'm just guessing here, but h might have problems with rapid hit rates on iterative searches. The time delay cleared everything up, at least for my project.

Cheers,
Jack

--

Patrik Hoyer

unread,
Jul 23, 2019, 4:30:47 PM7/23/19
to dev
Hi all,

I noticed the following while experimenting with the API (by just entering these URLs in the browser):


seems to yield results from the correct date (2019-07-21) but the results begin from the start of the day (00:00:00 onwards). However, stripping off the final "+00:00" gives the correct (desired) behavior of starting at the correct date and also timestamp. To see this, try:


The documentation at 


explicitly gives an example that includes the "+00:00" at the end. Thus, it would seem that either the API is not behaving correctly, or alternatively the documentation should be fixed?

Regards,
Patrik

Jon Udell

unread,
Jul 23, 2019, 5:52:10 PM7/23/19
to dev

Jack Park

unread,
Jul 23, 2019, 9:49:21 PM7/23/19
to dev
Just noticed that I annotated a PDF document which resulted, in the group, as an "Untitled Document". Went back and opened the view again; did not find a way to edit the title.

New feature suggestion?

Over

Ian Simpson

unread,
Jul 24, 2019, 5:26:21 AM7/24/19
to dev
Many thanks to all for their speedy and excellent suggestions. I've tried them out and Patrik and Jon are correct, it is the format of the timestamp. You can fix it either by removing the +00:00 or more simply as Jon suggested by correctly encoding the URL.

For reference I used the python url lib.parse quote_plus function to do this:-

import url lib.parse as up

r = requests.get(api_url_base+'search?'+'group=<GROUP>&user=<USER>&search_after='+up.quote_plus(last_record)+'&order=desc&sort=created&limit='+str(batch_size)+'&tag=<TAG>', headers=headers)

It may be worth making a note in the API doc to clarify this.

Best wishes

Ian

Katelyn Lemay

unread,
Jul 24, 2019, 10:50:40 AM7/24/19
to dev, sup...@hypothes.is
Hi Jack! We have an existing issue for that: https://github.com/hypothesis/product-backlog/issues/127. Something we hope to do in the future!

-Katelyn

Jon Udell

unread,
Jul 24, 2019, 11:04:11 AM7/24/19
to dev
> It may be worth making a note in the API doc to clarify this.

Patrik Hoyer

unread,
Jul 25, 2019, 1:17:20 AM7/25/19
to dev
Hi Jon,

Of course! Apologies for making such a simple mistake. (It was fortunate though that this was precisely the problem that Ian was also having.) Anyway, might be good to mention something about this in the docs, for the benefit of others that might bump into this...

Thanks again!

Patrik
Reply all
Reply to author
Forward
0 new messages