JATS support and file formats

128 views
Skip to first unread message

eleanor...@gmail.com

unread,
Apr 20, 2022, 11:39:28 AM4/20/22
to EPrints UK User Group
Hi all,

For those I haven't met yet, I'm part of the digital research team at CoSector. We've been looking into options for supporting JATS in EPrints in response to Plan S and other policies around machine readability. 
It looks like there are some fairly well-established options for this but from what I can tell, there's much more support for conversion from docx than PDF. I wanted to reach out to the group to get your input on a couple of things:
  1. In my experience of repository content review, text files were most commonly uploaded in PDF format. Is this the case in other people's experience?
  2. Would you expect for there to be resistance from users to uploading in other formats such as docx? (e.g. concern about the editability of docx).
  3. The libraries that I've looked into are: docxToJats, meTypeset, pandoc, grobid, and CERMINE. If you know of other open source libraries facilitating this kind of conversion (ideally to JATS but those like grobid converting to TEI might be of interest too), I'd be very grateful to hear about them!
  4. Likewise, if anyone has experience with any of those I've already looked into that might be helpful it would be great to hear about that.

Best,
Eleanor

eleanor...@gmail.com

unread,
Apr 22, 2022, 4:40:32 AM4/22/22
to EPrints UK User Group
Replying to myself with some further considerations:
  • Conversion from other word processors like Apple Pages and Google Docs probably has an impact on file structure - any idea of how widely these are used would be helpful! I know a lot of us keep a hand in on the authoring side so might have thoughts.
  • The ability to convert from other formats like LATEX and odt might make users more willing to upload in the format they're writing in (instead of requiring them to convert to docx). Again, an idea of how common these are would be helpful.
Best,
Eleanor

lindsayw...@googlemail.com

unread,
Apr 22, 2022, 11:03:46 AM4/22/22
to EPrints UK User Group
Hi Eleanor,

1) Text documents - journal articles/conference items/book items/ mostly come in to our repository service as PDFs currently. That maybe because that is what we shared/make open. We convert other received content in to PDF format - as that was viewed as best format back in 2010s - consistent format across platforms, PDF viewer available free, maybe less editable/fixed. We do get content sent to us in Word - whole articles or sections/tables in multiple Word documents that we combine with image files. Physics / Maths / Stats / Computer science are likely written in LaTex/TeX and then converted for our benefit into PDFs.
2) I think any change there would be resistance - but not necessarily that much - unless that had been a bad experience with DOCX. PDFs are still editable if some one wanted to - unless security set/locked for editing. *Receive* as DOCX or LaTeX or XML, but *distribute* as PDF on demand or XML might work - like PLOS ONE or open journal systems. Would need to 'sell' why DOCX was now preferred for receipt. At a tangent, seems like legal accessibility requirements of repository content could be improved along the way, maybe?
3) Long term you would write in XML or GUI writer - that is XML/JATS underneath - like Overleaf (?), Substance (that seem to have closed down?). DOCX is XML underneath with Zip archive? Different subject disciplines are likely to adopt different approaches at different rates. Janeway open journal system team, OJS (?) and PLOS team might know about conversion libraries / preferences?
4) Sorry, no. The Plan S promotion of JATS - seem strange when I read it a while ago - more focus on publishing industry rather than current repository current usage.

Apple Pages / Google Doc - not detecting this used a lot or sent our way.

Good that someone is consider implementation of Plan S in repositories.

Linds

eleanor...@gmail.com

unread,
Apr 27, 2022, 7:56:52 AM4/27/22
to EPrints UK User Group
Hi Lindsay,

Thanks for this, lots to think about! 
In an ideal world, we'd be looking to ingest files in docx/LaTeX/odt and then convert them to PDF (or ideally) PDFa for archiving and JATS for machine readability. Within that there are possible scenarios of either generating these at the time of upload or generating them on the fly when they're requested but that step is quite far in the future! 
One of our main ideas is to minimise steps and changes in routine for researchers. Allowing them to upload in the format they've written in rather than converting to PDF might make the change in the upload process a bit easier. So if the conversion is possible they wouldn't need to convert e.g. LaTeX to docx or PDF.
You're absolutely right about accessibility. I know when we were auditing repositories using the Wave tool, they got dinged for linking to PDFs and machine readability obviously makes documents screen reader accessible too. 
The question of pages and google docs is definitely a bit more obscure. It stemmed from my testing these libraries with my own research, which was written in those two programs along with the occasional markdown. I don't imagine the difference between a docx/odt created in either of those and one created in Word would be immediately obvious so it's perhaps more a question of how many people use alternative word processors themselves. I suspect I may be in the minority!

Best,
Eleanor
Reply all
Reply to author
Forward
0 new messages