Definition of maximi=um results returned for works?

19 views
Skip to first unread message

Rainer M Krug

unread,
Jul 16, 2024, 10:05:57 AMJul 16
to OpenAlex Community
Hi

I am doing a huge number of snowball searches and saving the results into parquet files (w=ease of search). Now I get regularly errors that certain fields do are of the wrong type or structure which is caused by them missing. The problematic fields identified are:

ab
authors
doi
grants
counts_by_year
referenced_works
related_works
concepts
topics

To solve this, I have to create the entries by hand if they are missing as I am using Reset them to e.g. `as.character(NA)`.

This works for simple types like doi or abstract (which is already converted into a real abstract at that stage). But it really becomes cumbersome in the more complex types like grants or particularly authors - in these cases I essentially have to define the structure by had and always change it when changes occur in the returned data structures. Therefore my question:

Is there a way to get the definition of the maximum result? Like a scheme as in https://json-schema.org/learn/getting-started-step-by-step or something which can be easily be used to complete retrieved records?

Thanks,

Rainer

--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y19-M-72
Winterthurerstrasse 190
8075 Zürich
Switzerland

Office: +41 (0)44 635 47 64
Cell: +41 (0)78 630 66 57
email: Raine...@uzh.ch
Rai...@krugs.de

PGP: 0x0F52F982



Samuel Mok

unread,
Jul 16, 2024, 3:40:40 PMJul 16
to Rainer M Krug, OpenAlex Community

Hi Rainer,

Do you mean the complete SQL schema for the data? You can grab that here: https://github.com/ourresearch/openalex-documentation-scripts/blob/main/openalex-pg-schema.sql
You can find this schema and more details in the docs here: https://docs.openalex.org/download-all-data/upload-to-your-database/load-to-a-relational-database

I'm not sure what you're using as a backend exactly -- most JSON stores are document stores without strict schema validation so this shouldn't matter much. If you're using a relational database of some sort the schema above should be enough! Maybe the docs for ingesting data through de fit your usecase better though: https://docs.openalex.org/download-all-data/upload-to-your-database/load-to-a-data-warehouse

Cheers,
Samuel


--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/C6F63269-251A-448E-ACAF-D30EBCB7195B%40krugs.de.

Rainer M Krug

unread,
Jul 17, 2024, 3:56:49 AMJul 17
to Samuel Mok, OpenAlex Community
Hi Samuel

I am downloading the works using openalexR via API calls and save them into a parquet dataset, partitioned by a column `bm` I added (and one bm at a time and after each other). Therefore the first set of records defines the scheme used. If I can

1. Save a dummy record which contains all the fields possible with the correct types and
2. Make sure that following records do have the structure (even if NA values are used)

I could see all records into that parquet dataset.

Hope this clarifies.

So the source you provided helps - thanks. 

But following your backend statement - is there an easy file based, fast, space efficient backend which I could use for this? I am no married to parquet - but it fulfils all the needs (fast, space efficient, interoperable between R and pythion, open, file based, supported in R in an efficient way (arrow package) and is file based (no overhang and portable).

Thanks,

Rainer





Samuel Mok

unread,
Jul 17, 2024, 6:40:43 AMJul 17
to Rainer M Krug, OpenAlex Community

Hi Rainer,


I think DuckDB would be a great fit here; you can use the R package for it (https://duckdb.org/docs/api/r.html). It can also read in & export parquet files, so it should fit right into your workflow with minimal changes. 

DuckDB will handle schema creation for you, and you can set it up to combine schemas or easily add or change columns while ingesting data.

You could also import the sql schema from OpenAlex directly into DuckDB to get the full set of columns, but that might be overkill for your usecase; and it will definitely slow things down because of the restrictions in the schema.

DuckDB is free, open source, very performant & efficient; and its extremely portable -- just like SQLite the entire DB is stored as a single file. It has client APIs for every populair programming language, plus apis for interaction with various webservices and  other databases like PostgreSQL. Check out the docs, they're pretty good!

Cheers,
Samuel 

Rainer M Krug

unread,
Jul 18, 2024, 2:58:55 AMJul 18
to Samuel Mok, OpenAlex Community
OK - I managed to split the scheme definition you provided into CREATE TABLE statements and to create the tables in duckdb. So this would be a way to go.

But I am actually not looking for the schemes for the data snapshot, but the scheme for the return values from the API calls. So for example, I am looking for the scheme of the call


Which includes all fields - if some are missing in the data of OpenAlex, there is nothing returned. This is also true, if the returned data of that field would be a data structure (e.g. grants). 
 
So I should be able to reconstruct that from the info in the sql, but I am not entirely sure how. But anyway. I will post a similar question this afternoon in the openalexR webinar.

Thanks and greetings,

Rainer


Reply all
Reply to author
Forward
0 new messages