Handling of Case Sensitive abstract_inverted_index

36 views
Skip to first unread message

Krugs.de

unread,
May 12, 2025, 4:07:51 AMMay 12
to OpenAlex Community
Hi

I am running into a problem with the abstract_inverted_index of the work https://openalex.org/W2741809807.

I am using DuckDB to read in the raw jsons and put them into a parquet file, which works nicely - but I struck a wall with the work https://openalex.org/W2741809807 as it has in the abstract_inverted_index the term `open` and `Open as well. 
I think it is a DuckDB problem, as the terms are considered, when parsing for import, by DuckDB as identical as the parsing is case insensitive. 

I understand why they are duplicated (beginning of abstract), and I understand that JSON is case sensitive, but this makes the handling in e./g. DuckDB (and I assume other databases as well?).

Has anybody any solution to this problem, or is the conversion using DuckDB a dead end in this case?


Any other suggestions?

Thanks,

Rainer

Any opinions?

Thanks

Rainer
---
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Geography
University of Zürich
Winterthurerstrasse 190
8075 Zürich
Switzerland

Cell:        +41 (0)78 630 66 57
email:      Raine...@uzh.ch

Ivo Bleylevens

unread,
May 13, 2025, 5:35:50 AMMay 13
to OpenAlex Community
Hi Rainer,

I have been facing the same problem: when importing and parsing the JSON files there was a (case sensitive) duplicate key that gave me an error. This occurred very often for the abstract_inverted_index field. I was using Powershell and the default ConvertFrom-Json method, I think, that gave me the error message. I have solved this by using another method to read in and parse the JSON code, this method did not give me the error message of a duplicate key. Using this I was able to (re)construct the abstract text.

What exactly is your error message ?

Kind regards,
Ivo
Maastricht University

Op maandag 12 mei 2025 om 10:07:51 UTC+2 schreef Rai...@krugs.de:

Chris Gebert

unread,
May 13, 2025, 6:01:20 AMMay 13
to Rai...@kruge.de, OpenAlex Community
Hi Rainer,

I’m wondering if this is a problem on conversion to parquet. I don’t know very much about parquet data types and how they’re written but duckdb should be able to differentiate between keys `Open` and `open` in the abstract_inverted_index struct.

For example:

select
json_extract(abstract_inverted_index, '$.Open') as mixed_case,
json_extract(abstract_inverted_index, '$.open') as lower_case

Maybe this would require unnesting this, but I suppose it depends on your use case for the abstract_inverted_index data.

Chris

On May 12, 2025, at 4:51 AM, Krugs.de <Rai...@krugs.de> wrote:

Hi
--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/2F60F4EA-5BC7-4ED3-9F5B-F8690360F466%40krugs.de.

Krugs.de

unread,
May 13, 2025, 6:10:48 AMMay 13
to Chris Gebert, Rai...@kruge.de, OpenAlex Community
Hi Chris

Parquet is case sensitive - the problem is DuckDB which is not case sensitive.

I am now converting the abstract_inverted_index to the abstract (using jq) and delete the abstract_inverted_index (I keep the original json).

Cheers,

Rainer

---
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Geography
University of Zürich
Winterthurerstrasse 190
8075 Zürich
Switzerland

Cell:        +41 (0)78 630 66 57
email:      Raine...@uzh.ch

On 13 May 2025, at 12:01, Chris Gebert <c...@chrisgebert.net> wrote:


Reply all
Reply to author
Forward
0 new messages