Using the Snapshot and download via API millions of work ids

27 views
Skip to first unread message

Rainer M Krug

unread,
Mar 3, 2026, 4:12:35 AMMar 3
to OpenAlex Community, OpenAlex Support
Hi

I have a way of converting the OpenAlex Snapshot to parquet, index id by openalex id and retrieve about 4.5 m million records by id in about 20 minutes (8 cores, parallel processing). I will post about the workflow very soon.

Now I can easily emulate filtern by type, years, etc. What I can’t do easily is full text search, and for that I use the OpenAlex API and retrieve only the ids as results.

My question is, and this goes also to OpenAlex, is there a way of doing this more efficiently then using the standard search API and download by page with max 200 works. This took me yesterday about 12 hours for 4.5 million works (ids only).

The only thing needed for this workflow would be a csv file containing the OpenAlex ids. Is anything planned? Does it already exist and I missed it?

Thanks

Rainer

Stephan Gauch

unread,
Mar 3, 2026, 4:19:02 AMMar 3
to Rainer M Krug, OpenAlex Community, OpenAlex Support
Dear Rainer,

This sounds positively wicked!

I guess many self-hosting folk will be very interested in this.

I just wanted to point out that a “how-to-ish” documentation would probably be very appreciated.

Sorry, for injecting into the discussion as you obviously have a specific question aimed at OAX but these numbers are pretty impressive given the option for running it decentralised.

Thanking in to many words and wising the best,
Stephan
> --
> You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/openalex-community/5C3C59D6-3D29-4A0F-A3DA-324B48284ECB%40krugs.de.

Rainer M Krug

unread,
Mar 3, 2026, 4:45:58 AMMar 3
to Stephan Gauch, OpenAlex Community, OpenAlex Support
Dear Stephan

Thanks for the positive feedback.

I will definitely post a how-to - it is part of an R package and vignettes describing workflows for d=something like this are essential. Watch this space for announcements.

Cheers

Rainer
Reply all
Reply to author
Forward
0 new messages