Version of Record's future.

39 views
Skip to first unread message

Luciano Panepucci

unread,
Aug 16, 2016, 9:54:22 AM8/16/16
to elife-continuum-list
In the last feew years we see that all big players are actively investing a lot in converting XML/JSON into HTML as a more rich media-data-link-enabled format.

How do you see this trend and are you considering moving the version of record from PDF to a more modern format or will the PDF continue to be THE version of record and all else only by-products?

Siân Roderick

unread,
Aug 18, 2016, 11:08:53 AM8/18/16
to elife-continuum-list
Currently at eLife the XML is the basis for several formats - PDF, html on the website and the eLife Lens format - so it might be more true to say that it's the XML that's the version of record.

Luciano Panepucci

unread,
Aug 18, 2016, 2:51:07 PM8/18/16
to Siân Roderick, elife-continuum-list
Yes, I tend to agree with you. Specially because the XML does (usually) carry more metadata and information than can (or must/should) be displayed in other formats. An example been granular information about dates that a publisher may use to keep track of versions etc.

But IF the XML is to be treated as the version of record then shouldn't one (anyone) be able to download it just like we do with PDFs? But then again: a lot of information used by the publisher in the XML is relevant to the scientific record.. Does all that really need or even may be made available?

Maybe a more important discussion would be: What information should the version of record preserve?

Kaveh Bazargan

unread,
Aug 23, 2016, 10:34:31 AM8/23/16
to elife-continuum-list
XML should certainly be the version of record. But there is a problem – and please note I am not referring to eLife here but talking generally. Very few publishers actually look at their XML. Authors, proofreaders, desk editors, all look primarily at the PDF, or perhaps HTML. Hence there is a lot of bad data and even errors in publisher XMLs. So they are understandably reluctant to have that as their VoR.

Melissa Harrison

unread,
Aug 31, 2016, 4:31:29 AM8/31/16
to elife-continuum-list
Hi Kaveh

Couldn't agree more and that is why organisations like JATS4R are so important! Also, as more and more publishers build their PDF and HTML from the XML, the XML has to be the version of record. At eLife we are currently going through a review process of our XML tagging and will be giving our vendor new instructions. We are also building Schematron validation to try and catch errors that could be missed using the more normal HTML/PDF proofing methods. We are staring with the references and will make the Schematron files available on GitHub so other publishers can use them as well.
However, I agree with you that the XML is not the "proofing" method for most publishers so poor XML can creep through with nobody being aware. Also, production departments are often not responsible for the XML, so again there is a disconnect. How we resolve this as an industry is tricky. At eLife the production department is responsible for the XML and works closely with the production vendor to generate good quality XML than conforms to our requirements. We aim to think of XML first and as the PDF and HTML as byproducts of this. However, we are aware our XML is littered with examples to the contrary and we are working on ways to rectify this going forward.

Thanks!

Melissa

Kaveh Bazargan

unread,
Aug 31, 2016, 4:44:42 AM8/31/16
to elife-continuum-list
All sounds good, Melissa. I really want to get to a situation where publishers only publish the XML and nothing else. PDF, HTML etc are then generated on the fly from the XML. Then there is one "record" only. If the vendors who claim they generate PDFs from XML automatically are really doing that, then this should be trivial. (We are creating PDF in the cloud from XML.) BTW I gave this talk on what I called the Format of Record.

Luke Skibinski

unread,
Aug 31, 2016, 7:41:09 AM8/31/16
to elife-continuum-list
Kaveh, all of our xml is available online (https://github.com/elifesciences/elife-article-xml) so if you have a process that takes JATS XML and outputs a nicely formatted PDF I'd like to see it. The caveat here is 'nicely formatted' - any old pdf is never enough, it must be *pretty*. As a developer and sysadmin I'm happy with 'pretty enough' but there is an awful lot of hand tweaking that goes into the PDF authoring I've seen to make it beautiful. I haven't spent enough time to see how practical it would be to automate.

Luciano, the great thing about XML is the DTD as it describes the 'shape' of the data expected and provides some degree of certainty. DTDs can be extended with new rules. Other tools like Schematron can layer rules on the top. The problem with XML is that it's difficult to navigate and transform. Many tools exist, but it's a very specialized area. Most interactions with XML and HTML these days are shallow - scraping/extraction and output (generating html/xml for other unlucky souls/browsers to deal with).

Because of the unwieldy nature of XML and it's tooling across the different programming languages we use here at elife, we're specifying a version of article data in JSON using JSON Schema (that also describes data 'shapes', like the DTD). It's a huge undertaking and while it doesn't remove our dependency on XML, DTDs, Schematron, etc it is allowing us to move faster with development knowing a whole host of our applications 'downstream' from the article JSON generator can ignore XML entirely.

As Melissa said, XML is our VOR, but, if anything, JSON is the likely contender for our VOR in the (distant) future, with JATS XML, HTML, PDF, etc as derivatives from it.

Luciano Panepucci

unread,
Sep 1, 2016, 2:40:28 PM9/1/16
to Luke Skibinski, elife-continuum-list
I agree with Kaveh in that the publishers should only publish the XML and nothing else. But this is a looong path... This however, would only be conceivable is anyone could pick any XML and read it as if it was just a piece of paper (nice talk Kaveh!). Well.. for this to become a reality the XML should be minimally standardized AND there should be a free XML Reader just like there is are free PDF Readers. (Note: If anyone is interested in funding this project I have a bunch of ideas of how such a reader should work, look and feel...).

The minimally standardized XML is everyday closer as more publishers agree AND USE JATS and JATS4R.
As for the free XML Reader, if anyone is interested in funding such a project I have a bunch of ideas of how such a reader should work, look and feel... anyone?

Luke: I have a huge problem with the "nicely formatted" requirement. Don't get me wrong. I do like things to look pretty and I understand that editors want their PDF "byproduct" to be as much nicer that their competidor's journals as possible. However "nice" is a very subjective thing. You are completely right that XML is a great thing because it has rules (DTD). This is why you citing JSON as being a possible contender for future VOR gets me confused... JSON is schemaless and does not have a DTD counterpart.. 

Kaveh Bazargan

unread,
Sep 1, 2016, 3:18:32 PM9/1/16
to Luciano Panepucci, Luke Skibinski, elife-continuum-list

​Sorry for late reply from me...

Thanks for your thoughts Luciano...

Please allow me to agree with all of you, including Luke's requirement for pretty PDFs! I am actually a typography fan and I love good page design and beautiful PDFs. It is not correct to think that automated PDF from XML cannot be beautifully typeset too. We are doing that daily. Here are some PDFs that have been generated by our pagination system fully automatically from client XML:

Design Science Journal (fully Open Access)
Journal of Fluid Mechanics (this page is Open Access)
Nature Materials (Subscription)

In case you are interested, the XML is converted to a LaTeX file on the fly and typeset through TeX. 

And the big question of how we cater for pagination niceties, e.g. 
  • Place figure at top of page
  • Do not hyphenate this word
  • Keep with next line
  • etc
well, we embed these in the XML as processing instructions or comments, and they pop out in the pagination stage. So the XML is correct and the PDF looks nice too. ;-)

Of course we have to keep working to get the XML to be standardized, and recording data, not just a way to get PDFs. That way we can have true exchangeable XML that can be converted to any readable format on the fly. (And Melissa's to be thanked for driving forward Jats4R.)

Love to discuss XML rendering with you, Luciano...

Regards
Kaveh


Luke Skibinski

unread,
Sep 2, 2016, 4:31:59 AM9/2/16
to elife-continuum-list
Kaveh - is your work available under an open source licence? We're really not interested in more vendor lock-in if it's not.

> for this to become a reality the XML should be minimally standardized

JATS is a standard. JATS4R is helping to tighten things up. We may argue that the standards in JATS are too loose but are there other efforts to describe article structure in existence?

> JSON is schemaless and does not have a DTD counterpart..

JSON is just a data serialization method. Data is data and there are any number of tools for describing data schemas. Relational databases are famous for that. Data represented as JSON has JSON Schema available to it: http://json-schema.org/

And our own humble efforts at defining an article schema for eLife: 

As for the XML reader, most browsers support displaying XML natively. I suspect what you're after is not so much an XML reader as a JATS XML reader? In which case there are efforts underway to make our Lens project usable offline as well as this effort by the people at Sciencefair: https://github.com/codeforscience/sciencefair

Kaveh Bazargan

unread,
Sep 2, 2016, 4:47:18 AM9/2/16
to Luke Skibinski, elife-continuum-list
On 2 September 2016 at 09:31, Luke Skibinski <l.ski...@elifesciences.org> wrote:
Kaveh - is your work available under an open source licence? We're really not interested in more vendor lock-in if it's not.

No, it is not open source, Luke. I was just describing the technology we are using. My point is that we have built an engine that goes from pure XML to a beautiful PDF. And I have described how we do it. I am surprised that we still seem to be the only people doing this. If there were others doing it, no one would be locked in as you simply switch to another vendor doing XML > PDF. 

I love open source and we have released a lot of open source software, but it has taken some 15 years to build this system. You may know how competitive the market is and how prices are forced down by big publishers every year, so I cannot release this as open source.

As it happens I am just in the middle of a blog post arguing that to prevent vendor lock-in, open standards are more important than open source!
 

> for this to become a reality the XML should be minimally standardized

JATS is a standard. JATS4R is helping to tighten things up. We may argue that the standards in JATS are too loose but are there other efforts to describe article structure in existence?

We have just implemented our XML ProofCheck system for a big publisher who said all their content was standard JATS. So far we have implemented 20 journals, and every one is done differently. So it is a standard in name only as it was deliberately made to be flexible. Jats4R is a great initiative to fix that. 

I do like Scholarly HTML as a possible alternative. Time will tell.
 

[...]

--
Kaveh Bazargan 
Director
River Valley Technologies

Luke Skibinski

unread,
Sep 2, 2016, 5:02:47 AM9/2/16
to Kaveh Bazargan, elife-continuum-list
I didn't know about Scholarly HTML, I'll give that link a read.

I appreciate the need to make a profit and applaud you for your open source contributions.
--

Luke Skibinski

Formally defined,
Extensively tested,
Cryptic, abstract, thing.

--

eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address First Floor, 24 Hills Road, Cambridge CB2 1JP.

Kaveh Bazargan

unread,
Sep 8, 2016, 4:28:44 AM9/8/16
to elife-continuum-list
Hi Luke

A short blog post from me for your interest. ;-)


Regards
Kaveh
Reply all
Reply to author
Forward
0 new messages