Hi
Apologies for the long post -
here is the TL;DR:
Keywords: OpenAlex, openalexR, file based instead of memory based
Open for comment.
Branch openalexPro_2
Here is the complete post with details:
I would like to present a new package project from myself and ask for some input if this is useful to you, which direction it should go, etc.
It works together with [openalexR](
https://docs.ropensci.org/openalexR/), but is file based. This makes it possible to work with much larger datasets. I used a precursor of the package to get 4.5 million works and and work with them - it worked flawlessly.
At the same time, I am trying to stay compatible with the tibble/data.frame based data structures of openalexR, which is illustrtated in the function `plot_snowball` which plots the snowball search results from `openalexPro::pro_snowball()` as well as `openalexR::oa_snowball()`.
It is at the moment in an alpha stage (if one can call it that) but it is functioning in the basic features, like doing searches, grpup_by and individual works as well as snowball searches in a file based way.
the process is
1. build the query using the `openalexR::oa_query()` function
1. downloading the json files returned by each page (raw responses) and saving them in a re-implemented function - saved on disk in one file per page returned
2. converting them into jsonl files using [jq](
https://jqlang.org) to extract the results, extract the abstract_inverted_index to an abstract, create a citation for shorter reference and adding some fields (containing the data including the chages) - saved on disk in one file per page returned
3. convert the jsonl files into a parquet database which is the source for any furtner processing using duckdb - saved on disk as parquet database
The combined use of `jq` and duckdb makes many workflows which take long in R in memory realy fast.
I have included tests which demonstrate how the package can be used as well as compares the retrieved works with the ones retrieved using openalexR.
I also included a plot function `plot_snowball()` which demonstrates that the results from openalexPro and openalexR are mainly interchangable.
I have the following questions:
1. Is this useful in general? It is useful for my use cases, but is it useful for others as well?
2. Would full compatibility with openalexR datastructures a plus?
3. Is there any interest in taking this further? I would
a. add error handling in the API requests
b. add proper handling of the API calls in the trsts, so that the tests can also be run on CRAN and ROpenSci
c. like to submit it at a later stage to ROpenSci and CRAN
d. Have additional packages like `openalexPlot` plot functions, `openalexNet` which contains fuctions for citation network analysis, extract
Please let me know your thoughts. Thanks.
---
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
Orcid ID: 0000-0002-7490-0066
Department of Geography
University of Zürich
Winterthurerstrasse 190
8075 Zürich
Switzerland