Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Problem with latest snapshot

40 views
Skip to first unread message

Jörg Prante

unread,
Feb 4, 2025, 9:07:17 AMFeb 4
to OpenAlex Community
Hello,

I'm indexing the lastet OpenAlex snapshot (the "works" set) of January 29, 2025


into Elasticsearch, but there seems to be invalid JSON. Older snapshots did not have the problem.Can anyone confirm that there is a problem?

Best,

Jörg

Sol Lederman

unread,
Feb 4, 2025, 12:29:31 PMFeb 4
to Jörg Prante, OpenAlex Community
Hi,

I scanned all of the latest OpenAlex snapshot json files with the python json library and found no errors loading any of them. Can you provide more detail? Which work file failed? What mechanism are you using to extract and validate json? What error did you get?

Sol

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/754b6ce4-6c99-4557-9121-357a0521bfacn%40googlegroups.com.

Jörg Prante

unread,
Feb 11, 2025, 7:17:25 AMFeb 11
to OpenAlex Community
Hi,

the error message is

"Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (',' (code 44)): expected a value",
" at [Source: (byte[])\"{\"id\":\"W2143785713\",\"doi\":\"10.1093/pan/2.1.97\",\"doi_registration_agency\":\"Crossref\",\"display_name\":\"Traits versus Issues: Factor versus Ideal-Point Analysis of Candidate Thermometer Ratings\",\"title\":\"Traits versus Issues: Factor versus Ideal-Point Analysis of Candidate Thermometer Ratings\",\"publication_year\":1990,\"publication_date\":\"1990-01-01\",\"language\":\"en\",\"language_id\":\"https://openalex.org/languages/en\",\"ids\":{\"openalex\":\"https://openalex.org/W2143785713\",\"doi\":\"https://doi.org/10.1093/pan\"[truncated 13991 bytes]; line: 1, column: 880]",
"at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:637) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2622) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:857) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:754) ~[jackson-core-2.10.4.jar:2.10.4]",
"at com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:237) ~[jackson-core-2.10.4.jar:2.10.4]",

It is not of much help because the JSON is truncated and does not show the problem.

I'm getting the S3 objects by the AWS Java SDK version 2.30.16. It is not of great help that OpenAlex JSON data is published without checksums.

All my runs, with different setups and versions, bail out at work W2143785713.

But I can not tell if the error is triggered by AWS SDK or my program.

I will try to to verify the JSON syntax directly after pulled by AWS SDK.and to make sure there is no Elasticsearch quirk (I never encountered one).

In the end, I have to skip the whole chunk with the error, in the hope there are not many more.

Thanks for your interest.

Best,

Jörg

Jörg Prante

unread,
Feb 11, 2025, 7:48:18 AMFeb 11
to OpenAlex Community
Hi,

it turns out it was my program having difficulties with the new addition of: host_organization_lineage_names: [null]

Sorry for the noise!

Best,

Jörg
Reply all
Reply to author
Forward
0 new messages