Exploring Scholarly HTML – the Hackfest
To some of us it's pretty clear that HTML is THE document format for scholarly documents – and the web is the glue that will bind together C21 research. In March We're going to bring together hackers and thinkers in Cambridge, England to riff on the topic “Exploring and defining Scholarly HTML”.
What should scholarly documents look like on the web? What do we call the research object of the future? How do we link to data files in robust, sustainable ways, and allow research objects to come alive when placed in the right environment? How do we embed metadata, scientific semantics, formal semantic statements, citations, et al in a research object? How do we write these documents? How do we let machines contribute, annotate, and take over writing the boring bits? How do we ship research to mobile devices, the tablet de jour, and to discipline and institutional repositories? How do we get rid of empty rituals like citation formatting? (hint, a semantic link to a reliable shared bibliography would do)  How do we seed documents with annotation points for comment and review and have that work across all the places research reaches?
We're sure that the hackfest will bring together people and their pet technologies: word processors, WordPress, repositories like ePrints and Dspace, file management like Subversion, Git and DropBox ™, ePub, SWORD, and the Blue Obelisk Chemistry toolkit.
The goal? Running code ready for early adopter researchers, with an emphasis on chemistry implementing and defining "Scholarly HTML".
I'm new to BtPDF, but working in a tiny way on some similar issues.
Though I dabble in the peer-reviewed lit., most of my work is lower-
class reports and writing for the public. I'm currently working with
a Django programmer to build tools for publishing technical content
Effectively, they already did. PLoS Currents for example (published by
PLoS obviously) which runs of google knol. PLoS currents is a great
idea, I think, although the dependency on knol is less so. Ultimately, I
think, we need to fit in with existing scientific tooling.
> I think that if there were some stunning examples of publications that used
> this sort of approach, some scientists would follow. I wonder if it would
> be possible to publish in a conventional journal, but provide an interactive
> version of the publication as an online supplement?
This would be twice the work. Publishing is already a hell of a lot of
> I've found it difficult to get across the idea that if I build an
> interactive report, a .pdf version is necessarily incomplete. And it can be
> hard to communicate to readers that they can do things like zoom in on a
> I can definitely see reasons to focus on just getting scientists into html
> as a first step. But maybe in parallel we can explore some further steps
> outside the inertia of the peer-reviewed literature. The nonprofits and
> consultants I work with are really interested to make their reports readily
> accessible, so maybe we can come up with some tools that will ultimately be
> useful to academia as well.
This is part of the idea behind my knowledgeblog.org project. A
light-weight publishing platform suitable for formal publications.
Wordpress already gives us publishing, tool integration, RSS, media
handling, word clouds, and the rest. We've borrowed and repurposed a
peer-review system. We've added citation, maths and some cite indexing
support. Archiving and DOIs come from colleagues. It's all pluggable --
you can take what you want and you can add what you want to it.
If you want zoomable maps, then you could add for instance this...
It's all just a plugin away. This is the advantage of adapting commodity
software to fit, as opposed to the current bespoke system. Some one has
already done most of the work for you.
http://ontogenesis.knowledgeblog.org has been running for a year now.
Our primary focus for this was getting a useful resource out; that is,
it is the content that counts. It's a small experiment, to see whether
we could replace the existing academic book publishing process. We're
got 12k reads in the first year. A book wouldn't have been published
> The WordPress map insertion relies on Google's My Maps... There are
> severe limitations here. The main ones are that the size and number
> of data objects is limited, and there's no facility for sophisticated
> data management... e.g. dynamic content generation. It's fine for a
> traditional blog, but if we're talking scientific publications then
> you'd want to have your maps talking to datasets.
Yes, but then it would be good enough for many purposes. It's always
possible to find "what if" examples where commodity technology is not
enough. The solution, then, is that you have to build something more
Given that the current technology is a) take a picture of the map, b)
stick into a PDF, any advance is a good thing.
> Two versions of a paper isn't really twice the work... make the
> online/interactive version and then generate a static image of it
> (this is what I'm doing for a client who insists on a .pdf
Yes, and we've achieved both .pdf and epub versions in the same way. Of
course, this format translations are not lossless or perfect.
> Analysis tools (e.g. an analytical machine at some far-away lab) are
> linked to an online database as soon as they begin running samples.
This already happens in some areas, although it's not the norm.
> This sort of system would also be a natural setting to handle a series
> of revisions and reviews, possibly extending beyond a publication that
> has entered the scientific literature proper. As soon as the author
> is comfortable with the public seeing their work, they publish a
> draft. This draft is given a unique permanent url, and people can
> reference it. After peer review (which might be publicly posted) a
> final draft is concocted.
Yes. Ontogenesis works in this way.
> Anyone who navigates to the original draft sees a prominent note at
> the top that there's an updated version. Similarly, if mistakes are
> discovered or additions concocted after final publication, a new final
> publication could be generated that is referenced from the original
> "final" publication.
We have a variation on that theme, but yes all versions are accessible.
> Given the limitations of what is available right now, open-source
> custom software seems like not a bad way to go. Presumably there will
> be many visions of what is "beyond the .pdf" before we actually do
> move beyond the .pdf.
Absolutely. It's a slow process.
Bretwood <hig...@gmail.com> writes:
> I like the open peer reviews on Ontogenesis. I know some are
> skeptical of review being open, I think in part because that creates
> social pressure to hold back controversial criticism. But to me it
> seems like there are as many issues with closed peer review. And
> reviews can include important content that it's a shame to hide.
I agree with this. I think it can hold back criticism, but mostly
unsupported criticism. There are some questions remaining in my mind for
reviewing -- for ontogenesis, our authors have generally said that they
prefer a more collaborative relationship. Perhaps, reviewers should in
future be considered to be "one-step-removed" authors? Open review, as
it stands, though, also makes it more worthwhile for reviewers; as they
are no longer anonymous, they can demonstrate their own work.
We choose open review in the first instance for pragmatic reasons --
it saves messing around with privacy and access control. I think it has
academic benefits also.
> I'm a geologist which puts maps at a high-priority. Google provides
> lots of great functionality in its maps API, so it's not that far out
> there to make a more functional version.
> I have a couple somewhat interactive reports up that uses some of the
> elements I'm interested in. The first hand-built report looks a
> little nicer and has some elements the newer one lacks, but hopefully
> within months the admin-supported report will catch up in styling and
> functionality. The biggest hurdle is supporting the maps, but we're
> getting there.
> 2009 report hand-built in HTML: http://www.groundtruthtrekking.org/Reports/FaultHunt01/
> 2010 report supported by a Django admin:
These look really nice. If you are interesting in working on something
similar for knowledgeblog please let me know me. If not, I'd just be
interested in the sort of functionality you need.
We're getting a bit of topic here -- if you want to talk, can I suggest
knowledgeb...@knowledgeblog.org, which is publicly accessible.
Bretwood <hig...@gmail.com> writes:
> Interesting stuff. Can you point to some prime examples of
> KnowledgeBlog in action?
The general plan is to CREATE something during the time that PT is here. PT runs a world class team in University of Southern Queensland which has created a proven Open toolset based on WordPress for high quality scholarly documents (e.g. course materials, papers, theses). Martin has likewise pioneered many plugins for WordPress.
We shall invite Peter and Martin to give presentations (but this will need to be on a weekday)
The theme is Scholarly HTML with particular emphasis on data publication. It is to give authors the freedom to author as they wish, not as they are constrained but the recipient. A consequence is that all data should be semantic (i.e. understandable by machine). This means that bitmaps such as PNG should be replaced or augmented by – say – SVG or HTML5. Much of the impetus for the meeting came from “Beyond the PDF” run by Phil Bourne and Anita de Waard.
In general we would like to be able to publish:
- Semantic (mainly rectangular) tables where columns have defined semantics
- Semantic graphs where axes are semantic and points, lines, bars etc are first-class objects
- Maths (MathML)
- Semantic bibliography (technically solved, but we’d like to include online OPEN resources (e.g. from Open Bibliography)
- Scalable diagrams (probably SVG)
- Chemistry/crystallography as CML
There will be many ideas but as a focus we have come up with a unifying project. After discussion with Simon Hodson (JISC) and Brian McMahon (IUCr) we plan to implement the following idea in our JISCXYZ project and to start this during the hackfest. (Simon and Brian hope to be present for some of the time).
A data-journal for crystallography
Every week Crystaleye aggregates (automatically) a few hundred structures and creates fully semantic CML. These are currently published as HTML pages with embedded CML and PNGs (http://wwmm.ch.cam.ac.uk/crystaleye) . A typical page (there are ca 250,000) is http://wwmm.ch.cam.ac.uk/crystaleye/summary/acta/c/2008/01-00/data/av3113/av3113sup1_I/av3113sup1_I.cif.summary.html (you can twiddle the molecule and create the unit cell by clicking). We wish to create a “data publication” from this material.
The proposed data journal will automatically select ca 10 interesting structures per week and publish these as a Scholarly HTML blog. The hackfest will educate us to the best ways of representing these as Scholarly HTML and allowing the best modes of presentation. Because we shall be using a blog readers can comment on these structures using the blog mechanism and also add their own ideas about interesting structures that we have not included. In this way we hope to build up a sense of publication and comment.
There is also the possibility for readers to submit their own structures which will be automatically validated during the submission process. We’ll work very closely with the IUCr during this. We can add to the interest by having ranking tables for authors or contributors and having various “records” such as largest structure.
I added the hackfest to the BtPDF calendar. I'd like to encourage people to put events on the calendar because I think it is helpful to see with just a glance what everyone else is up to even if we can't attend all the events. I wish I could be there in person, but I'll definitely jump on etherpad during the event.
We are planning to continue the impetus of the Beyond the PDF "Writing" group in an informal hack event in Cambridge UK in mid-march. We are making arrangements for Peter Sefton and Martin Fenner to be physically present in our group during the dates Mar 11-Mar 20 2011. The precise dates that either comes will be posted later.
A hackfest consists of a collection of geeks fuelled by geek food and drink . Dates and times depend on local availability of rooms, etc. The possibilities are some or all of
* 2-day hackfest on Mar 12-13. Sat/Sun
* 1-day hackfest on Mar 20. Sun (it's science open day on Sat so we might go to the Panton Arms or the Open Knowledge Foundation)
* an ad hoc seminar given by whomever.
In January we ran a hackfest for #pmrhack which was very successful and had about 16 people over a Sat and Sun. It consisted of some tables, free wifi, laptops, and some geek food. Our group will be involved - and it's likely that we'll be looking at integrating web services such as OSCAR (chemistry annotation) and OPSIN chemistry name2structure. Ben O'Steen is also on our projects and we hope he'll be here for significant chunks. We shall also welcome visitors during the week but we may have to specify particular days as this is in term and rooms are scarce.
The goal of the coming hackfest is to create a prototype of "Scholarly HTML". This prototype will allow us to create a (probably declarative) approach to compound scientific documents, with embedded behaviour such as chemistry using CML, or data visualisation. It is platform-agnostic but the likely tools are Wordpress and ePrints although others can be used.