Convert unpaywall jsonl to csv

renate.mo...@gmail.com

unread,

Mar 17, 2019, 8:15:27 AM3/17/19

to Unpaywall discussion

Good day,

I am new to unpaywall and jsonl files. Is there a converter available to convert a jsonl file to csv.

Thanks for the help in advance.

Renate

Martin Czygan

unread,

Mar 17, 2019, 4:01:11 PM3/17/19

to renate.mo...@gmail.com, Unpaywall discussion

JSON lines [1] is merely one JSON object per line. JSON [2] itself is
typed and supports nested structures, which CSV does not.

Hence, there is not a single mapping from JSON or jsonl to CSV.

For JSON and the command line, I found the jq [3] tool useful and it
includes a CSV filter [4].

Example to extract the DOI, year and title from the unpaywall snapshot:

$ unpigz -c unpaywall_snapshot_2018-09-24T232615.jsonl.gz | jq -rc
'[.doi, .year, .title] | @csv'

The result looks something like this:

"10.1080/21645515.2017.1330236",2017,"Advancements in DNA vaccine
vectors, non-mechanical delivery methods, and molecular adjuvants to
increase immunogenicity"
"10.1088/0004-6256/135/4/1201",2008,"SHARC-II 350 μm OBSERVATIONS OF
THERMAL EMISSION FROM WARM DUST INz⩾ 5 QUASARS"
"10.1088/0022-3727/48/43/435001",2015,"Tuning magnetic anisotropy and
the damping constant using substrate-induced strain in a
FeCo/Pb(Mg1/3Nb2/3)O3-PbTiO3heterostructure"
"10.2478/v10172-012-0058-8",2012,"Examination of Coil Pipe Butt Joint
Made of 7CrMoVTiB10-10(T24) Steel After Service"
"10.2307/40203576",2001,"The Internet Revolution in China: The
Significance for Traditional Forms of Communist Control"
"10.1515/9783111445908.88a",,"Ideale Wahrheit. 1837"
"10.1515/9783110972306.83",,"FROM BUFFALO (NEW YORK) IN 1960"
"10.1515/bchm2.1926.153.1-3.10",1926,"Einfluß der Hefenvorbehandlung
auf die Affinitätskonstanten der Saccharase. II. Achte Mitteilung über
die Affinitätsverhältnisse der Saccharase."

--martin

[1] http://jsonlines.org/
[2] http://json.org/
[3] https://stedolan.github.io/jq/
[4] https://stedolan.github.io/jq/manual/#Formatstringsandescaping

> --
> You received this message because you are subscribed to the Google Groups "Unpaywall discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to unpaywall+...@googlegroups.com.
> To post to this group, send email to unpa...@googlegroups.com.
> Visit this group at https://groups.google.com/group/unpaywall.
> To view this discussion on the web visit https://groups.google.com/d/msgid/unpaywall/2635bd75-deda-433b-b24d-ee78b98dacc1%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bryan Newbold

unread,

Mar 17, 2019, 8:25:40 PM3/17/19

to Martin Czygan, renate.mo...@gmail.com, Unpaywall discussion

I would second the use of jq with the @csv filter to create flat CSV files (or @tsv for TSV). If you read the jq man page (which is quite long), you'll find it is possible to select only a subset of input lines, or to generate multiple output rows (eg, if you wanted one CSV output row for every URL, instead of every DOI).

An alternative is to write a quick python script to do the same thing, instead of learning how to use the jq tool.

Both jq and python can be single-thread bottlnecks when processing very large input files (like the unpaywall dump). If you have a powerful machine, you can use the `parallel` command line tool (GNU Parallel) to speed things up. I recommend the '--linebuffer --round-robin --pipe' mode of parallel. It can also help to compress the output (using gzip or pigz) in the same pipe command to save disk space and I/O time.

--bryan

You received this message because you are subscribed to the Google Groups "Unpaywall discussion" group. To unsubscribe from this group and stop receiving emails from it, send an email to unpaywall+...@googlegroups.com. To post to this group, send email to unpa...@googlegroups.com. Visit this group at https://groups.google.com/group/unpaywall. To view this discussion on the web visit https://groups.google.com/d/msgid/unpaywall/CAKwZ_X3oJBZh4ktL4_qsWJnVjEDSXXYt8LYBts_WPKBUa78N%2Bw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

Jason Priem

unread,

Mar 18, 2019, 11:57:30 AM3/18/19

to Bryan Newbold, Martin Czygan, renate.mo...@gmail.com, Unpaywall discussion

Thanks everyone for those ideas! I'd add one more: if you are just looking to run a small number of DOIs, the online simple query tool returns a CSV file with a subset of what you'd get in the full JSON-L dump.

j

To view this discussion on the web visit https://groups.google.com/d/msgid/unpaywall/1552868729.19643.14%40mail.archive.org.

For more options, visit https://groups.google.com/d/optout.

--

Jason Priem, co-founder

Impactstory: We make tools to power the Open Science revolution

follow at @jasonpriem and @impactstory

rajath...@gmail.com

unread,

Dec 28, 2020, 9:52:22 PM12/28/20

to Unpaywall discussion

Does anyone here use 'JQ' and 'split' command line tools on Mac? I am facing lot of issues. Split says: split: illegal option -- -

ric...@ourresearch.org

unread,

Dec 29, 2020, 12:06:18 PM12/29/20

to Unpaywall discussion

Some command line tools on Mac aren't the same versions you're most likely to find tutorials and examples for. Often the versions have different options:

Mac: https://www.freebsd.org/cgi/man.cgi?query=split&sektion=1

Most example code: https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html

You could make sure you're using the options for the versions you have, or try installing their GNU counterparts: https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/

If you post the full command you're trying to run, someone with a mac may be able to spot the problem.

Rajath

unread,

Jan 5, 2021, 12:47:24 AM1/5/21

to ric...@ourresearch.org, Unpaywall discussion

Sure,

the below is the command being executed on MAC and it's respective error. The same when I used to try on Windows previously used to produce output.

Your-MacBook-Pro:license-info-extract jrxyter$ split -l 10000 --additional-suffix=.csv unpaywall_snapshot_2020-10-09T153852.jsonl.gz

split: illegal option -- -

usage: split [-a sufflen] [-b byte_count] [-l line_count] [-p pattern]

[file [prefix]]

You received this message because you are subscribed to a topic in the Google Groups "Unpaywall discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/unpaywall/I9LMA4liypw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to unpaywall+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unpaywall/d899bba2-3a3e-4093-a8e1-37ba45d2d137n%40googlegroups.com.

Richard Orr

unread,

Jan 5, 2021, 5:12:06 PM1/5/21

to Rajath, Unpaywall discussion

Yep, that's the problem. split on your mac is a different program and doesn't have an --additional-suffix option. You could split the files and add a suffix with something like

for f in *; do mv "$f" "$f.csv"; done

Or install GNU coreutils (brew install coreutils) and run your original command with gsplit.

--

Richard Orr

Lead Developer, Unpaywall

Our Research: We build tools to make scholarly research more open, connected, and reusable—for everyone.

Rajath

unread,

Jan 5, 2021, 7:01:49 PM1/5/21

to Richard Orr, Unpaywall discussion

I tried this - beambooxs-MacBook-Pro:license-info-extract beambox$ gsplit -l 10000 --additional-suffix=.csv unpaywall_snapshot_2020-10-09T153852.jsonl.gz

But it is giving me encoded data in a very haphazard format. PFA.

I am actually working on license-type analysis.

Thanks & Regards,

Rajath C S

xaa.csv

xad.json

Bryan Newbold

unread,

Jan 5, 2021, 7:04:34 PM1/5/21

to Rajath, Richard Orr, Unpaywall discussion

I think you need to run 'gsplit' on the uncompressed .jsonl file, not
the compressed .jsonl.gz

--bryan

> https://groups.google.com/d/msgid/unpaywall/CAPGMVzFoJrtm2wAoXr6KiwJWaXz-JnP06XEiu1tzkFj3XTKheA%40mail.gmail.com.

Rajath

unread,

Jan 6, 2021, 12:54:23 AM1/6/21

to Bryan Newbold, Richard Orr, Unpaywall discussion

Well yeah, lesson learnt - beamboxs-MacBook-Pro:license-info-extract beambox$ gzcat unpaywall_snapshot_2020-10-09T153852.jsonl.gz | jq -r "[.doi,.oa_locations[0].license]|@csv">t1.csv

The above worked for me.

Quick question though - how can on go about processing the same gzip file to database. I know python can do it, it's highly tedious and overhead. Not in-memory computation.

Thanks & Regards,

Rajath C S

dan.p...@deltathink.com

unread,

Jan 7, 2021, 12:35:06 PM1/7/21

to Unpaywall discussion

To get the file into a database, I used a R script to convert the file into a CSV, then imported the CSV into the database. Creating the CSV took ages. You need to spend time getting your table structure correct. In MySQL, for example, I found VARCHAR fields worked better than (fixed) CHARs, and other optimisations didn't seem to yield much benefit. YMMV. The R script ignored the nested JSON records, which I didn't need for my analysis. I have one "raw import" table in MySQL. The initial import took 2-3 or so hours (I think), and the same again to add indexes. This is on my 2016 Mac laptop. Things were then quicker - measured in minutes or tens of minutes. The trick is to roll up your data as soon in the process as possible. E.g. I want per-journal, per-year analysis with some bells & whistles. I copied the first 1m or so records to another table so help me debug my queries, then basically did multi-level GROUP BY to roll the main data up into another "staging" table, and get the record count down to a few million rows. At that point I could leave the master data set behind, and things became very malleable - seconds or tens of seconds to dice & slice the data. Credit to the Unpaywall team - the data is really clean.

I'm curious to hear if anyone has used Google's BigQuery to connect directly to the snapshot online. I suspect it will make shorter work of this than my local copy of MySQL, and allow me to miss out the CSV bit. However, now my complex queries are written, further updates become easier. It's now mostly waiting for the lengthy initial CSV & import to run, which can happen in the background.

HTH,

Best,

Dan

Rajath

unread,

Jan 7, 2021, 12:50:00 PM1/7/21

to dan.p...@deltathink.com, Unpaywall discussion

Great to know you converted the dump to CSV. Yes, the unpaywall data is surely clean and well maintained.

Here are some pointers towards using Big Query - https://subugoe.github.io/scholcomm_analytics but even this is not direct connection to the online snapshot.

Thanks & Regards,

Rajath C S

To view this discussion on the web visit https://groups.google.com/d/msgid/unpaywall/518e4412-e269-411d-bab7-412e1a7dbbb9n%40googlegroups.com.

Bianca Kramer

unread,

Jan 7, 2021, 2:39:24 PM1/7/21

to Rajath, dan.p...@deltathink.com, Unpaywall discussion

Cameron Neylon's group at COKI (Curtin Open Knowledge Institute) also uses Google Big Query to query the Unpaywall database snapshot (and combine it with data from other sources, see e.g. this OA dashboard: http://openknowledge.community/dashboards/coki-open-access-dashboard and the executable eLife paper here: https://elifesciences.org/articles/57067/executable. I've had opportunity to try their instance of GBQ and it works really nicely.

Example queries are included in the file 'queries.zip' in this Zenodo record https://doi.org/10.5281/zenodo.3974291

I don't know about the initial processing to load the UPW snapshot into GBQ though.

kind regards, Bianca

Op do 7 jan. 2021 om 18:50 schreef Rajath <rajath...@gmail.com>:

To view this discussion on the web visit https://groups.google.com/d/msgid/unpaywall/CAPGMVzG%2BLT9SJOAiR3tL7e5Ss8W-_b-xO5-OdkJbRTpxDV8v_Q%40mail.gmail.com.

Reply all

Reply to author

Forward