Best approaches for CSV with metadata and full text for thousands of items, also "Find available PDFs" on thousands of items.

243 views
Skip to first unread message

Stan Rhodes

unread,
Mar 11, 2021, 1:48:21 PM3/11/21
to zotero-dev
Hello,

I'm not sure the best way to approach using Zotero for preparing thousands of articles and their full text for a massive, broadly interdisciplinary systematic quantitative literature review. The overall objective is to 1) use structural topic modeling across articles, which would include both some of the metadata and full text; and 2) search full text for certain key phrases which suggest that a term is being defined in that area of text. Thus, it's essential I have metadata and full text for items.

A rough overview of the envisioned process:
1. Generate large lists of articles via Scopus.com searches. RIS files of thousands of articles which match specific search queries. Depending on how well Scholarly can get decent metadata (e.g. not truncating journal names) for Google Scholar searches, I may use it too, since GSC has better overall broad disciplinary coverage.
2. Minimal data cleaning, probably by converting RIS to CSV, cleaning in R/Python, then converting back to RIS files.
3. Import RIS files into Zotero.
4. !!! Run Zotero's "Find available PDFs" on all (potentially thousands) of items. This should generate full text for all items which have a PDF (probably 30-50% of the items).
5. !!! Using JS on client side, or Zotero API, generate CSV for all items with full text.
6. Perform multiple analyses on CSV of items that includes metadata and full text.

4 and 5 are where I'd super appreciate some insight into approaches and caveats. 

I am definitely a novice when it comes to working with APIs, and also with JS and JSON. I have been able to modify some JS to retrieve full text from all selected items in Zotero, client-side, using the "Run Javascript" box. I haven't yet figured out how to generate CSVs which include full text and metadata for each item. Suffice to say, being able to use the Export menu with a checkbox option for including full text in the CSV would be a dream come true.

Cheers,
Stan

Emiliano Heyns

unread,
Apr 8, 2021, 3:15:29 AM4/8/21
to zotero-dev
On Thursday, March 11, 2021 at 7:48:21 PM UTC+1 stanle...@gmail.com wrote:
The issue I've had with MSA is that it has a semantic query layer which is "smart" but doesn't take exact phrases. But by doing that it outsmarts attempts to do reliable research and the results it returns are terrible. But perhaps a more targeted search for longer title strings from single articles gives much better results. I do have an email out to another researcher who is rumored to have code which might help me get what I need from MSA, but it's sort of a shot in the dark. Currently I am not accessing MSA via its API.


This is the work on MSA retrieval I did earlier. I thought that searched on title reliably, but I haven't worked on this in ages.

I don't know how smoothly 4. would run or how much time it'd take but it sounds like you're willing to tolerate it taking a bit of time. The full-text generation is automatic although I believe it is clipped at a certain length by default. That is configurable though.

WRT 5, you can get the full-text either locally or via the API. The API is fairly simple but I haven't worked with it much. I have the most experience with the internal JS API. The choice seems to me to be between

  1. Use the API, which is probably simpler, because the API is fully documented, but since you want to write results back to Zotero, you have to deal with sync
  2. Use the internal JS API. You don't have to deal with syn (Zotero does that for you) but the internal API is not (to my knowledge) documented. It's pretty consistent, but I've learned about it by just reading the relevant parts of the Zotero source code. If you're not already pretty comfortable with JS, that is likely going to be a hurdle.
As far as I know, the client is going to be involved in any case, because it is the client that extracts the full-text from PDFs.

Stan Rhodes

unread,
Apr 14, 2021, 6:28:39 PM4/14/21
to zotero-dev
I tried BulkMAS.js but it returns 0 items. It took a bit of work to get there, as I haven't messed with translators before. Debug output is "Translate: BulkMAS: 1: no title, skipping" with a number for each of the rows. I tried a .csv with both title and Title as a column name. I also tried putting the entries in quotes but then the format was rejected. My basic format was title,abstract for the first row and then for subsequent rows titletext,no-abstract. I also tried it as a single column with title at the top and no commas at all. I don't know if it's the translator or something with my CSV, but I tried every variation of reasonable CSV I could think of. I looked at your code on github, but didn't see anything obvious. My understanding is that it wouldn't say "no title, skipping" unless it found the CSV and its title header legitimate. But why are the rows after that not being detected as titles? If you find a fix for it, I'll test the result and make a pull request to update the readme with more precise instructions for those new to downloading translators (sorry, I don't have much more to offer in return).

Yes the docs at https://www.zotero.org/support/dev/client_coding/javascript_api seem very much under construction. But using the tidbits on the Zotero internal JS API I have been able to get an array of Zotero items. At least, I think it's an array, which has an object in it for each item, like "0": { "key": "12345" ... } "1": { ...}. And I have been able to output a basic text file from that Run JS window, so I'm working on combining the two and parsing the array into a CSV format and writing a CSV locally.

Thanks SO much for mentioning the full text length limit setting. I had no idea about this. It seems to be a reasonable length, but I need to check some papers like the long ones you see in Physics Review B or whatever which can be a 40 page review of some topic (e.g. networks). I very much appreciate your reply, it's been helpful. 


Emiliano Heyns

unread,
Apr 15, 2021, 6:00:09 AM4/15/21
to zotero-dev
If you give me a few titles I can take a look.

Stan Rhodes

unread,
Apr 15, 2021, 12:29:40 PM4/15/21
to zotero-dev
Below is the title column I was testing with as a csv:
title
Zika virus infection in pregnant women in Rio de Janeiro
De novo mutations in the sodium-channel gene SCN1A cause severe myoclonic epilepsy of infancy
Group techniques for program planning A guide to nominal group and Delphi processes
A de novo paradigm for mental retardation
Memoires for Paul de Man The Wellek Library Lectures at the University of California Irvine
De novo CNV analysis implicates specific abnormalities of postsynaptic signalling complexes
Manual de peixes marinhos do sudeste do Brasil VI
De novo carcinogenesis promoted by chronic inflammation is B lymphocyte dependent
The evolution of exchange rate regimes since 1990 evidence from de facto policies
Effect of gemtuzumab ozogamicin on survival of adult patients with de-novo acute myeloid
Pitfalls of liver stiffness measurement a 5-year prospective study of 13369 examinations
Prevalence and independent risk factors for erectile dysfunction in Spain results of the
Diagnosis of fibrosis and cirrhosis using liver stiffness measurement in nonalcoholic fatty
Mixed biodiversity benefits of agri-environment schemes in five European countries
Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes
Forbidden mass range for spin-2 field theory in de Sitter spacetime
Myocardial infarction redefined-A consensus document of The Joint European Society
Thermodynamics of irreversible processes
Third Reference Catalogue of Bright Galaxies Volume III

Matthew Robertson (mpr1255)

unread,
Feb 27, 2023, 8:37:43 PM2/27/23
to zotero-dev
Stan did you make any progress on this? I have a very similar use case and it appears the key missing piece is access to the "Retrieve Metadata" API. Wish Zotero would simply expose that and even charge for using it. It's a very powerful API for identifying the DOI of a PDF file. That, and finding the PDF itself, are the two most powerful functions of Zotero – yet it appears they can only be performed inside the GUI which severely limits large-scale quant bibliographic analysis.... Stan if you have any further information or insights they're most welcome. 

Stan Rhodes

unread,
Feb 28, 2023, 1:54:01 AM2/28/23
to zotero-dev
Hi Matthew,

I've roughly completed this project, which is the first chapter of my dissertation, but have not gone back and cleaned, diagrammed, and ordered all the scripts and processes I used to enable public replicability. That will be happening, and is a necessary part of me fully completing this project. This sort of computer-aided systematic quantitative lit review is a real bear to do. Right now, anyway.

I do have this code up on my github, see if can help you get started on what you want to do: https://github.com/stanleyrhodes/ch1-hier-casqlr/blob/main/zotero_json_fulltext.js

I did just use "Find Available PDFs" from within the GUI. There can be issues with doing it and hanging... if you can find my thread in the Zotero forum about it, the Zotero folks had some insight into what was happening and why, and I think perhaps a bug report was filed for one aspect of it.

Oh, I will also put out a general warning that I put the lit review directory in my library, which then ended up mixing the project with my general and large collection which I usually managed through the root library. I used WindingWind's very nice zotero tag addon to tag all lit review lit in the lit review folders with tags to help keep them straight and ensure I didn't have any cross-library leakage--accidental deletions or updates when working with the large bibs. Ideally, some future Zotero will have a feature that allows a sort of project library "container" that keeps everything nice and separate and helps with replicability and peer review.

I'm afraid I don't really have the bandwidth to answer further questions at this point--I'm laser-focused on my dissertation and defense--but I will be updating my github account with everything related to this project in the next couple months. The goal would be to accurately and with-decent-usability document exactly what I did rather than try to build a toolchain. Many people are more capable than I am on that toolchain front, but I am hoping this example will show them the value of working a bit in that direction. One the one hand, it's hard to see the average Zotero user even considering doing a lit review like this. User quality of life issues and features obviously outrank this madness. On the other hand, we're in the stone age for literature reviews, and we really do need to claw our way to better tech and better times. Although the Zotero stuff was work, my biggest gripe is actually about the lengths one has to go through to automate search and get data from Google Scholar, the only broadly academic full-text search engine I've found.

Cheers,
Stan
Reply all
Reply to author
Forward
0 new messages