Incrementally retrieve data matching search query on a daily schedule without missing new articles

45 views
Skip to first unread message

Daniel Ecer

unread,
Oct 3, 2022, 1:32:43 PM10/3/22
to Europe PMC Developer Forum
Hi,

We are incrementally retrieving data from the API for a search query by adding `FIRST_IDATE` to the query.
This has worked well so far.
But there has been at least one occasion where articles that appear to be indexed on the day have not been included.
Is `FIRST_IDATE` the correct filter to be used in this case?
Is it possible that data from the whole response is omitted when paging via the cursor?

The DOIs in question are:
10.1101/2021.05.08.21256896
10.1101/2021.05.16.21257255
10.1101/2021.05.16.21257283
10.1101/2021.05.17.21257241

Those appear to be indexed on `2021-05-19` (i.e. they are now included when the `FIRST_IDATE` includes that date).

I only know they were missing because we had them referenced elsewhere.

At the time the query was:

(FIRST_IDATE:[2021-05-19 TO 2021-05-19]) ((PUBLISHER:"bioRxiv" OR PUBLISHER:"medRxiv" OR PUBLISHER:"Research Square" OR PUBLISHER:"arXiv" OR PUBLISHER:"SSRN" OR PUBLISHER:"PsyArXiv" OR PUBLISHER:"OSF Preprints" OR PUBLISHER:"Lancet Preprints" OR PUBLISHER:"Authorea Preprints" OR PUBLISHER:"ChemRxiv"))

The result produced 1202 results across two requests made on 2022-05-11.

How do I ensure that all of the data is retrieved?

Thank you





eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address Westbrook Centre, Milton Road, Cambridge, CB4 1YG.

Mohamed Selim

unread,
Oct 4, 2022, 5:30:30 AM10/4/22
to Europe PMC Developer Forum, d.e...@elifesciences.org
Hi,
Thank you for your message. I think this incident was because of an issue with solr node out of sync. You are using the correct filter, however it is expected sometimes to have "less" results for same query after sometime. This only happens for preprints which we receive new versions of, so the old version is not retrievable via search, but through the new version article page.
ex: https://europepmc.org/article/PPR/PPR486813
this process in place to make sure search doesn't return duplicate results which are basically different versions of same preprint.
I hope that clarifies everything. Please let me know if you have any questions.
Kind regards,
Mohamed

Daniel Ecer

unread,
Oct 4, 2022, 8:04:28 AM10/4/22
to Europe PMC Developer Forum, mse...@ebi.ac.uk, Daniel Ecer
Hi Mohamed,

Thank you for clarifying that.

The issue for me is that I often don't know what articles are missing.
Is there a filter for the updated date, and would that maybe be a "safer" option?
Receiving updates would be even better and I don't mind duplicates.

Thank you
Daniel

Mohamed Selim

unread,
Oct 4, 2022, 8:47:15 AM10/4/22
to Europe PMC Developer Forum, d.e...@elifesciences.org, Mohamed Selim
Hi Daniel,
I believe you are using correct filter and if any change found this is due to solr nodes out of sync, and they are eventually consistent especially if these docs are reindexed at that time (we do reindex thousands of preprints everyday). We do have another field to date update,  you can use it same as first_idate: 

(FIRST_IDATE:[2021-05-19 TO 2021-05-19]) (UPDATE_DATE:[2021-05-19 TO 2022-10-04]) ((PUBLISHER:"bioRxiv" OR PUBLISHER:"medRxiv" OR PUBLISHER:"Research Square" OR PUBLISHER:"arXiv" OR PUBLISHER:"SSRN" OR PUBLISHER:"PsyArXiv" OR PUBLISHER:"OSF Preprints" OR PUBLISHER:"Lancet Preprints" OR PUBLISHER:"Authorea Preprints" OR PUBLISHER:"ChemRxiv"))

does this help?
Thanks,
Mohamed

Daniel Ecer

unread,
Oct 7, 2022, 12:06:15 PM10/7/22
to Europe PMC Developer Forum, mse...@ebi.ac.uk, Daniel Ecer
Hi Mohamed,

Thank you again for your response.

If I understand it correctly, if a solr node is out of sync, there isn't a way of telling that from the response?
I'm assuming the "hitCount" would also be lower in that case (we didn't record it in the past but will in the future).
My main issue is that I don't know when I am missing data (without running the query again sometime in the future).

I think I didn't initially understand your comment regarding the revised preprint correctly. I think I understand it better and it opened up new questions for another day.
Although I think it probably wasn't affecting the issue originally observed. But it did lead to thinking about the updates..

We played a bit with the UPDATE_DATE, and run some queries on 5th October...

We selected a DOI indexed the previous day, which was included as expected in the following query:
"(DOI:10.31234/osf.io/xdbf9) (FIRST_IDATE:[2022-10-04 TO 2022-10-04]) (SRC:PPR)"

But it wasn't using UPDATE_DATE instead:
"(DOI:10.31234/osf.io/xdbf9) (UPDATE_DATE:[2022-10-04 TO 2022-10-04]) (SRC:PPR)"

Then again it did include the article using the 5th October (it is no longer):
"(DOI:10.31234/osf.io/xdbf9) (UPDATE_DATE:[2022-10-05 TO 2022-10-05]) (SRC:PPR)"

That suggests there was some update on the 5th and the filter is only looking at the latest update.
But if we always get data only until yesterday, we may not retrieve the article, if it keeps getting updates.

That is why we are considering of combining it with an OR with FIRST_IDATE:
"(DOI:10.31234/osf.io/xdbf9) (FIRST_IDATE:[2022-10-04 TO 2022-10-04] OR UPDATE_DATE:[2022-10-04 TO 2022-10-04]) (SRC:PPR)"

Some questions:

- Do you think that would be a good approach?
- What kind of updates would you expect?

Thank you
Daniel

Mohamed Selim

unread,
Oct 10, 2022, 3:29:57 AM10/10/22
to Europe PMC Developer Forum, d.e...@elifesciences.org, Mohamed Selim
Hi Daniel,
I just want to clarify the out of sync issue could happen for different reason, but the main thing is that it is a glitch and we investigate it once reported, it is not something you expect to happen frequently and the result is eventually consistent afterwards.
You are correct, the date updated is always the last updated date and we frequently reindex articles for various reasons so it is always changing.
I am not sure about your query, if you already searching for a specific  DOI you don't need dates at all.
Also if you are falling to the first index date. the update date doesn't add value to your query.
If you want to get all results not just the one you suspect you missed, you don't need to OR with last updated date, because first index will already return all of them.
Instead you can use AND, ex:
(FIRST_IDATE:[2021-05-19 TO 2021-05-19]) (UPDATE_DATE:[2022-10-04 TO 2022-10-10]) ((PUBLISHER:"bioRxiv" OR PUBLISHER:"medRxiv" OR PUBLISHER:"Research Square" OR PUBLISHER:"arXiv" OR PUBLISHER:"SSRN" OR PUBLISHER:"PsyArXiv" OR PUBLISHER:"OSF Preprints" OR PUBLISHER:"Lancet Preprints" OR PUBLISHER:"Authorea Preprints" OR PUBLISHER:"ChemRxiv"))

updated dated will be: [{last query run date} TO {today's date}]
of course it might return more to you than missing 5 (it returns 9 already) but it would be easier to exclude duplication from it. But as I said you can use this defensive mechanism  to make sure you are getting everything, but this is a scenario that shouldn't be often and last time reported to us was around April-May this year. If you spotted that this is not the case and still happening now please do tell us and we will try to investigate from our side.
Let me know if you have further questions.
Kind regards,
Mohamed

Reply all
Reply to author
Forward
0 new messages