parsing the datasets

140 views
Skip to first unread message

Mike

unread,
Jun 21, 2022, 9:12:53 AM6/21/22
to OpenAlex users
Hi,
glad to find this group.
Am I the only one having trouble parsing - in python - the "venue dataset" from the last snapshot (20220603).

I am getting the error JSONDecodeError: Expecting ',' delimiter: line 1 column 164 (char 163)
any tip how to fix this ?

Thanks in advance,
Mike

here is a code to reproduce it
import json

b'{"id": "https://openalex.org/V4210237741", "issn_l": "0582-6152", "issn": ["0582-6152", "2444-2992"], "display_name": "Anuario del Seminario de Filolog\xc3\xada Vasca \\\\"Julio de Urquijo\\\\"", "publisher": "UPV/EHU Press", "works_count": 82, "cited_by_count": 31, "is_oa": true, "is_in_doaj": true, "homepage_url": "https://ojs.ehu.eus/index.php/ASJU/index", "ids": {"openalex": "https://openalex.org/V4210237741", "issn_l": "0582-6152", "issn": ["0582-6152", "2444-2992"]}, "counts_by_year": [{"year": 2022, "works_count": 7, "cited_by_count": 4}, {"year": 2021, "works_count": 13, "cited_by_count": 11}, {"year": 2020, "works_count": 0, "cited_by_count": 11}, {"year": 2019, "works_count": 45, "cited_by_count": 0}, {"year": 2018, "works_count": 17, "cited_by_count": 0}, {"year": 2016, "works_count": 0, "cited_by_count": 2}, {"year": 2014, "works_count": 0, "cited_by_count": 1}], "x_concepts": [{"id": "https://openalex.org/C142362112", "wikidata": "https://www.wikidata.org/wiki/Q735", "display_name": "Art", "level": 0, "score": 39.0}, {"id": "https://openalex.org/C138885662", "wikidata": "https://www.wikidata.org/wiki/Q5891", "display_name": "Philosophy", "level": 0, "score": 32.9}], "works_api_url": "https://api.openalex.org/works?filter=host_venue.id:V4210237741", "updated_date": "2022-06-03", "created_date": "2022-02-03"}\n'

json.loads(q1)

Richard Orr

unread,
Jun 21, 2022, 2:40:20 PM6/21/22
to OpenAlex users
Hi Mike,

Thank you very much for letting us know about this. It's a bug we introduced when we changed part of our snapshot export process, and we missed it in QA. We've fixed it in the current snapshot. This Venue is now parseable as JSON and the display name appears as intended:

$ aws s3 cp 's3://openalex/data/venues/updated_date=2022-06-03/part_000.gz' - | zcat | grep 'V4210237741' | jq -r '.display_name'
Anuario del Seminario de Filología Vasca "Julio de Urquijo"

For anyone maintaining an existing snapshot using the updated_date partitions, you only need to replace any new partitions you downloaded after the 2022-06-09 release. The rationale in the documentation still applies; partitions you already downloaded don't contain any updates you need.

If you'd rather not download the snapshot again, it is possible to repair your copy. The problem is with the backslash-escaping in sequences like \\"Julio de Urquijo\\". JSON needs quotes inside strings to be backslash-escaped, and our export process is (wrongly) replacing all instances of \ with \\, which, to a JSON parser, means "this is a literal backslash, it's not escaping anything". So it's effectively undoing backslash-escaping everywhere, breaking JSON parsing.

The fix is to replace all double-backslashes with single ones in all Entity files. For example, you could do this for each file:

gunzip data/authors/updated_date=2022-02-11/part_006.gz
sed -i 's|\\\\|\\|g' data/authors/updated_date=2022-02-11/part_006
gzip data/authors/updated_date=2022-02-11/part_006

or do the equivalent at a convenient place in your ingestion pipeline.

Thanks,
Richard
Reply all
Reply to author
Forward
0 new messages