Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Running time of fetching records in batch

37 views
Skip to first unread message

Y Han

unread,
Sep 16, 2024, 6:11:22 PM9/16/24
to OpenAlex Community
Hi,  All,

I have tested out the running time of fetching works records in batch using openAlexR (code is below). Currently it took about 3 hours to fetch 240,000 works. It is about 1,500 works per minute (80,000 works per hour). The number of records per time has little influence in running time.  I wish it can be quicker. 

Is this the normal speed? 

Thank you, 
Yan Han
The University of Arizona Libraries

###################################
# the number of works to fetch at a time has little influence the time to run oa_fetch
# 2024-09: fetch_number = 1,000, reduced the total running time of 10% comparing to fetch_number 100
# 2024-09: fetching 241,000 works took 188 minutes
# test 100 records per time

fetch_number <- 1000
time_taken <- system.time({
  for(i in seq(1, num_of_works, by=fetch_number)){
    batch_identifiers <-org_ref_works_cited[i:min(i+fetch_number-1, num_of_works)]
    batch_data <-oa_fetch(identifier=batch_identifiers)
    works_cited_df<-rbind(works_cited_df, batch_data)
  }
})
print(paste("time to run: ", time_taken["elapsed"] / 60, "minutes"))

Trang Le

unread,
Sep 16, 2024, 8:19:48 PM9/16/24
to OpenAlex Community
Hi Yan,

My guess is that most of the time is used by R to convert the resulting list to tibble. So not OpenAlex's fault. Also, I see an rbind() for a growing dataframe, which is not optimal. You can try:

range_i <- seq(1, num_of_works, by=fetch_number)
works_cited_ls <- vector("list", length = length(range_i))
for (...){
  ...
  works_cited_ls[[i]] <- batch_data
}
do.call(rbind, works_cited_ls)

A few other things that might speed up your call:
- Utilize options = list(select = ...) to specify the columns you're interested in pulling
- Use output = "list" 

See June Choe's comment. Also, we're working on a vignette on optimizing oa_fetch() call. Maybe that will give you some more concrete examples?

Best,
Trang

Samuel Mok

unread,
Sep 17, 2024, 7:17:33 AM9/17/24
to Y Han, OpenAlex Community
Whoops, I replied just to Yan instead of the mailing list. This might be useful info for others as well, so here's my reply again:

It does makes sense that changing the batchsize on your end makes no real difference, as OpenAlex has a hard cap of max 50 ids per query. This can also be found in the source code for oa_fetch, by the way. 
However, 3 hours is way slower than the max API rate: because that's a generous 10 requests per second (max 100k requests a day) -- with 50 items per request, that's 500 items per second, so if you have zero delay/latency/processing time it would take you 10 minutes to retrieve your data from OpenAlex. For a more realistic reference: I just pulled 30k works from the API in about 6 minutes which extrapolates to 45 minutes for 240k works. My script runs on python and stores the results locally in mongodb, which are definitely not the fastest solutions possible; so higher speeds should definitely be possible!

So, I suggest to look into optimizing your code! Asynchronous api calls & data storage help a lot in speeding things up, and if storing/ingesting data is a bottleneck for you maybe try limiting the returned fields to only those you need, using a more performant data storage solution like duckplyr to use duckdb as a drop-in replacement for the standard dplyer stack; etc. 

Cheers,
Samuel


--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/af9f5d47-e323-4284-b649-59065b97efcfn%40googlegroups.com.

Y Han

unread,
Sep 20, 2024, 7:11:03 PM9/20/24
to OpenAlex Community
Thank you Trang and Samuel. 

I tested run the code with your suggestion. The running time is reduced from 180 mins to 120 mins.  I also tested small batch pull (e.g. 10,000). I posted on June Choe's comment.  https://github.com/ropensci/openalexR/issues/271#issuecomment-2338416246 
Reply all
Reply to author
Forward
0 new messages