Authoring tools / Deliverables

Peter Sefton

unread,

Nov 12, 2010, 5:28:44 PM11/12/10

to Beyond the PDF

Hi all,

Lots of talk here about RDF and NLM XML but we have not go to talking
about how these might be captured when someone sits down to write
their paper ("narrative", people are calling it).

There are several dimensions to the toolset needed (data management
framework with IDs for data, packaging mechanisms for bundling
research inputs and outputs), but one important one is an authoring
environment which fits with researcher's practice. As Lee noted, lots
use MS Word - but in an environment where we also have people wanting
WordPress and others working in OpenOffice,org and using text based
formats like LaTeX what can we build?

I am planning to demo some ideas that would take us towards
word-processor and web-based authoring tools in which authors can make
explicit links between their articles (and pre-articles) and data,
disambiguate terms, and embed metadata and structural labels
(abstract, intro, method, etc) from the start. I think some variant of
the MS Research Ontology and NLM Add-in for Word would be a potential
deliverable, although both have some serious limitations, not the
least of which is interoperability with other tools (even other MS
tools) in their current form. I will post a suggestion I put to MS
Research about what we could do to improve the Ontology Add-in soon.

Phil - I'd still like to get my hands on the author's Word manuscript
(or a similar one) - one of the things I think is important is closing
hte gap between author tools and publisher tools. At the moment we
have people talking about doing stuff with XML - but where is the XML
coming from?

(One more deliverable I'd like to see, but which I think might be
contentious, would be a series of microformats and conventions that
together allowed us to build an HTML 5 based schema for research
communications - I am dubious about the value of very hierarchical XML
schemas like NLM or DocBook in this world where authors are using
tools that are not optimized for such deep hierarchies. Converting
arbitrary text to XML is expensive and hard - and no, Lemon8 XML does
not do the job at all well.)

--
Peter Sefton
Manager, Software Research and Development Laboratory,
Australian Digital Futures Institute,
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA

Work: sef...@usq.edu.au
Private: p...@ptsefton.com

IM accounts:
Gmail: ptse...@gmail.com
Yahoo: peter_...@yahoo.com
MSN: p...@ptsefton.com
AIM: ptsefton

p: +61 (0)7 4631 1640
m: +61 (0)410 326 955

USQ Website: http://www.usq.edu.au
Personal Website: http://ptsefton.com

Paul Groth

unread,

Nov 12, 2010, 5:46:36 PM11/12/10

to beyond-...@googlegroups.com, Beyond the PDF

Hi Peter,

Just one comment on your idea for microformats.. there's really no reason not to use rdfa there and leverage all the work into things like salt, swan etc. If you want simple html authoring this can be easily accomplished using rdfa profiles.

Paul

Sent from my iPhone

Phil B

unread,

Nov 13, 2010, 2:55:44 PM11/13/10

to Beyond the PDF

Peter - I added a EarlyDraft.zip file under ""Files" on the Workshop
Website that hopefully meets your needs.

Cheers../Phil

> Gmail: ptsef...@gmail.com
> Yahoo: peter_sef...@yahoo.com

Jodi Schneider

unread,

Nov 13, 2010, 3:03:29 PM11/13/10

to beyond-...@googlegroups.com

On Sat, Nov 13, 2010 at 7:55 PM, Phil B <pebo...@gmail.com> wrote:

Peter - I added a EarlyDraft.zip file under ""Files" on the Workshop
Website that hopefully meets your needs.

Direct link:

https://sites.google.com/site/beyondthepdf/file-cabinet/EarlyDraft.zip?attredirects=0&d=1

-J

Peter Sefton

unread,

Nov 13, 2010, 9:13:11 PM11/13/10

to beyond-...@googlegroups.com

Hi Paul,

We will need some kind of profile, for sure - but even this simple
example on the RDFa profiles wiki is impossible to author in a word
processor:

<div class="haudio">
<span class="fn">Start Wearing Purple</span> by
<span class="contributor">Gogol Bordello</span>
</div>

I am working on examples that encode triples in URLs - a kind of
'nanoformat' - see my blog:
http://ptsefton.com/2010/11/14/before-beyond-the-pdf-authoring-tools-for-document-semantics.htm

Peter

Pgroth

unread,

Nov 14, 2010, 3:56:04 AM11/14/10

to beyond-...@googlegroups.com, beyond-...@googlegroups.com

Hi Peter,

Cool thanks for the blog, I see what you are trying to do. So you don't need users to write their own rdfa (a good thing), you need to mint new urls identifying some metadata.

Question: do you care if the urls are meaningful to people?

I'm thinking of a URL shortener like thing that produces these "metadata urls"

Cheers,
Paul

Sent from my iPad

Peter Sefton

unread,

Nov 14, 2010, 4:01:40 AM11/14/10

to beyond-...@googlegroups.com

Paul,

I think they need to be readable by machines so you can decode them
without resolving them - can be read like a triple.

http://ontologize.me?triplink=http://perl.org/triplink/v/0.1&p=http://purl.org/dc/terms/creator&o=http://some-exmple-id

Here the subject is implicit (the page that holds the assertion). p is
the predicate or property and o is the object.

I would not expect users to mint these by hand, but they should be
usable in a wide variety of contexts.

Peter

Martin Fenner

unread,

Dec 6, 2010, 2:07:19 AM12/6/10

to Beyond the PDF

I do see a lot of value in Wordpress as an authoring tool, and I hope
that there is some discussion around that topic at the workshop. There
are already several initiatives based on Wordpress (e.g. Code4Lib,
Knowledge Blogs), and I have just created the website "Blogging Beyond
the PDF" (http://blogs.xartrials.org) to further work on this. My
first attempt using the PLoS Comp Biology paper provided by Phil gave
interesting results. I'm still gathering best practices, Plugins,
Themes, etc., so please drop me a line if you are also interested. A
good starting point might be my blog post
http://blogs.plos.org/mfenner/2010/12/05/blogging-beyond-the-pdf/ My
main arguments for Wordpress are that it is cheap, easily modified or
extended, is based on a workflow concept, and is used by millions of
people - a much larger user base than any tool developed specifically
for scientists will ever have.

Martin

Hannover Medical School
Germany

On 14 Nov., 10:01, Peter Sefton <ptsef...@gmail.com> wrote:
> Paul,
>
> I think they need to be readable by machines so you can decode them
> without resolving them - can be read like a triple.
>

> http://ontologize.me?triplink=http://perl.org/triplink/v/0.1&p=http:/...

>
> Here the subject is implicit (the page that holds the assertion). p is
> the predicate or property and o is the object.
>
> I would not expect users to mint these by hand, but they should be
> usable in a wide variety of contexts.
>
> Peter
>
>
>
>
>
> On Sun, Nov 14, 2010 at 6:56 PM, Pgroth <pgr...@gmail.com> wrote:
> > Hi Peter,
>
> > Cool thanks for the blog, I see what you are trying to do. So you don't need users to write their own rdfa (a good thing), you need to mint new urls identifying some metadata.
>
> > Question: do you care if the urls are meaningful to people?
>
> > I'm thinking of a URL shortener like thing that produces these "metadata urls"
>
> > Cheers,
> > Paul
>
> > Sent from my iPad
>

> > On Nov 14, 2010, at 3:13 AM, Peter Sefton <ptsef...@gmail.com> wrote:
>
> >> Hi Paul,
>
> >> We will need some kind of profile, for sure - but even this simple
> >> example on the RDFa profiles wiki is impossible to author in a word
> >> processor:
>
> >> <div class="haudio">
> >> <span class="fn">Start Wearing Purple</span> by
> >> <span class="contributor">Gogol Bordello</span>
> >> </div>
>
> >> I am working on examples that encode triples in URLs - a kind of
> >> 'nanoformat' - see my blog:

> >>http://ptsefton.com/2010/11/14/before-beyond-the-pdf-authoring-tools-...

>
> >> Peter
>
> >> On Sat, Nov 13, 2010 at 8:46 AM, Paul Groth <pgr...@gmail.com> wrote:
> >>> Hi Peter,
>
> >>> Just one comment on your idea for microformats.. there's really no reason not to use rdfa there and leverage all the work into things like salt, swan etc. If you want simple html authoring this can be easily accomplished using rdfa profiles.
>
> >>> Paul
>
> >>> Sent from my iPhone
>

> >>>> Gmail: ptsef...@gmail.com
> >>>> Yahoo: peter_sef...@yahoo.com

> >>>> MSN: p...@ptsefton.com
> >>>> AIM: ptsefton
>
> >>>> p: +61 (0)7 4631 1640
> >>>> m: +61 (0)410 326 955
>
> >>>> USQ Website:http://www.usq.edu.au
> >>>> Personal Website:http://ptsefton.com
>
> >> --
> >> Peter Sefton
> >> Manager, Software Research and Development Laboratory,
> >> Australian Digital Futures Institute,
> >> University of Southern Queensland
> >> Toowoomba Queensland 4350 AUSTRALIA
>
> >> Work: sef...@usq.edu.au
> >> Private: p...@ptsefton.com
>
> >> IM accounts:

> >> Gmail: ptsef...@gmail.com
> >> Yahoo: peter_sef...@yahoo.com

> >> MSN: p...@ptsefton.com
> >> AIM: ptsefton
>
> >> p: +61 (0)7 4631 1640
> >> m: +61 (0)410 326 955
>
> >> USQ Website:http://www.usq.edu.au
> >> Personal Website:http://ptsefton.com
>
> --
> Peter Sefton
> Manager, Software Research and Development Laboratory,
> Australian Digital Futures Institute,
> University of Southern Queensland
> Toowoomba Queensland 4350 AUSTRALIA
>
> Work: sef...@usq.edu.au
> Private: p...@ptsefton.com
>
> IM accounts:

> Gmail: ptsef...@gmail.com
> Yahoo: peter_sef...@yahoo.com

> MSN: p...@ptsefton.com
> AIM: ptsefton
>
> p: +61 (0)7 4631 1640
> m: +61 (0)410 326 955
>
> USQ Website:http://www.usq.edu.au

> Personal Website:http://ptsefton.com- Zitierten Text ausblenden -
>
> - Zitierten Text anzeigen -

Carl Leubsdorf, Jr.

unread,

Dec 15, 2010, 6:04:03 PM12/15/10

to Beyond the PDF

Martin,

On Dec 6, 2:07 am, Martin Fenner <fenner.mar...@mh-hannover.de> wrote:
> I do see a lot of value in Wordpress as an authoring tool, and I hope
> that there is some discussion around that topic at the workshop.

[..snip...]

> main arguments for Wordpress are that it is cheap, easily modified or
> extended, is based on a workflow concept, and is used by millions of
> people - a much larger user base than any tool developed specifically
> for scientists will ever have.
>

Your thoughts are spot on. As someone who has long advocated the use
of WordPress in a variety of content-management and online publishing
applications, the notion of using WordPress with suitable enhancements
as a primary authoring and publishing tool for scholarly journals
resonates very strongly with me.

And, while the details of my project are still being finalized, I can
inform this group that I'll be developing a complete suite of
scholarly authoring, import, export, workflow, and publishing tools
based on WordPress.

I have some initial notes here for what I am tentatively calling
"Annotem" -- an authoring and publishing platform for scholarly
journals.

https://sites.google.com/site/beyondthepdf/workshop-papers/annotem-an-open-source-journal-authoring-and-publishing-platform-based-on-wordpress

I welcome any and all feedback, either via this discussion forum or at
the conference (though my exact role and activities there are of
course TBD).

-C

Peter Sefton

unread,

Dec 15, 2010, 9:35:49 PM12/15/10

to beyond-...@googlegroups.com

Carl,

That's an impressive feature list.

I'm interested in whether you have any ideas about how the validating
XML authoring will work - will this be an adapatation of the built in
WYSIWYG editor, a different existing component, or something
completely new.

Peter Sefton

--

Peter Sefton
Manager, Software Research and Development Laboratory,
Australian Digital Futures Institute,
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA

Work: sef...@usq.edu.au
Private: p...@ptsefton.com

IM accounts:
Gmail: ptse...@gmail.com
Yahoo: peter_...@yahoo.com

Carl Leubsdorf, Jr.

unread,

Dec 15, 2010, 11:00:58 PM12/15/10

to Beyond the PDF

Peter Sefton wrote:
>
> That's an impressive feature list.
>

Well, It is of course an initial list, subject to change (and budget!)
but I think these features are a minimum set for a complete working
system.

> I'm interested in whether you have any ideas about how the validating
> XML authoring will work - will this be an adapatation of the built in
> WYSIWYG editor, a different existing component, or something
> completely new.
>

Most likely an adaptation of an existing WYSIWYG control, if possible.

At present the idea is to take an existing WYSIWYG control and apply a
set of filters (possibly using one of the existing control's plugin
architecture) to do two main things:

1. Lock down formatting to a relatively limited list of predefined
styles. You would for example be permitted to add a level 1 heading,
but not change the font or font size or color. All presentation
detail would be a function of the stylesheet applied in the 'publish',
'preview' or 'export to PDF' CSS. Elements such as tables, exhibits,
equations, or citations would be added only via the toolset provided
-- for example no HTML <IMG /> tags that are not part of a structured
figure or exhibit entity would be allowed, and we would require the
appropriate meta data (caption, number, source, etc.) as well.

2. Have a 'cleanup' function that would strip out any prohibited tags,
whether from pasted MSWord content or authors' monkeying about in the
HTML view or using some other editing method.

This approach may seem somewhat draconian, but it keeps everything in
valid markup and ensures conformance with the DTD for export (and
later import to other systems). Much of the work will lie in making
it very clear what is happening, and prompting the author in an
intelligent yet non-annoying way that their large, bold text should
perhaps be marked as a heading.

As I said, this approach represents only my preliminary thoughts; I
suspect the thinking on this will evolve over time as we get into the
formal requirements phase in the coming weeks.

-C

Peter Murray-Rust

unread,

Dec 16, 2010, 12:17:10 AM12/16/10

to beyond-...@googlegroups.com, Joe Townsend, Alex Wade

On Thu, Dec 16, 2010 at 4:00 AM, Carl Leubsdorf, Jr. <ca...@solvitor.com> wrote:

Peter Sefton wrote:
>
> That's an impressive feature list.
>

Well, It is of course an initial list, subject to change (and budget!)
but I think these features are a minimum set for a complete working
system.

I'd agree. We've (Joe Townsend, copied, and support from Microsoft (Alex Wade)) been working on doing this for chemistry. Joe has insisted that we adhere to strict validation and this goes far beyond the DTD (which is almost uselss in chemistry)

I will annotate the list . I see the Chemistry as a component that can fit into the otehr requirements (editing, reviewing, etc.)

Rich, web-based scholarly article authoring and editing:

“What you see is what you get” (WYSIWYG) authoring with rich toolset (chemical molecules). YES. The chemistry is authored in a Word environment. The code and the format OOXML is Open and could - with resources - be adapted to other tools such as OO or possibly LaTeX editors (but it needs funding)

Multiple import and export formats

YES. We can import from CDX, MOL, SMILES, and a lot more (some through web services)
Import XML formats for “round-tripping” of content.YES we can roundtrip CML. The other formats are broken and it's almost impossible to roundtrip - they lose masses of information. I should emphaisze that IMO ONLY XML can support roundtripping

A set of rich, visually beautiful templates for publishing online journals. YES. We are developing Chemical Stylesheets (ChSS) simlar to CSS so that the chemistry can be rendered in a wider variety of ways.

We would be delighted to be involved in providing the chemistry for this. We cover ca 80% of chemical information in chemical publications and the toolset supports the semantic but not always the user interaction.

Note that much chemical information involves data files which should NOT involve human interaction. Most journals require people to convert correct ASCII into broken Word. A common example is that minus nsigns are converted to em-dash which is then converted to unknown. This means that negative quantities are converted to positive. Most journals don't care - it's a minor point - it looks beautiful rather than being correct.

I should stress that encoding and code points is far more important than beauty. If you don't attend to this you can convert micro (mu) to milli (M) just by cut and paste

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Fenner, Martin Dr.

unread,

Dec 16, 2010, 12:48:11 AM12/16/10

to beyond-...@googlegroups.com

Carl,

this looks great, your workshop paper describes 100% of what I have in mind with Wordpress. Based on the discussion on this list I am positive that we will be able to form a good-sized group of people willing to move this concept forward in the next 12 months. It would be lovely if we also find a publisher willing to work with us on this with one of his journals.

Kind regards, Martin

Peter Murray-Rust

unread,

Dec 16, 2010, 2:01:36 AM12/16/10

to beyond-...@googlegroups.com

On Thu, Dec 16, 2010 at 5:48 AM, Fenner, Martin Dr. <Fenner...@mh-hannover.de> wrote:

Carl,

this looks great, your workshop paper describes 100% of what I have in mind with Wordpress. Based on the discussion on this list I am positive that we will be able to form a good-sized group of people willing to move this concept forward in the next 12 months. It would be lovely if we also find a publisher willing to work with us on this with one of his journals.

I agree.

I would be delighted for a publisher to be involved. I don't want to bash publishers but my (and colleagues) have had uniformly awful experiences with publishers in trying to go anywhere beyond Word/PDF. Many publishers have a workflow which consists of flattening any semantic input to "epaper" and often having it retyped - which completely destroys the process. Publishers in general have very slow-moving production processes which don't map well onto any innovations.

Therefore the publisher has to work with what comes out of this project rather than constrain it to their workflow. This can be done - it happened with TeX where the maths and science publishers have accepted it and tooled up for it rather than imposing their own systems.

Moreover every publisher has their own way of doing things. The way ahead is for US to decide and create what we and to use and then persuade the publishers to work with it. It might even become a minor selling point.

Phil B

unread,

Dec 16, 2010, 9:47:07 AM12/16/10

to Beyond the PDF

I agree we need a visionary publisher, if we cannot find one the
emergent tools will enable us to become one. The key issue in my mind
is to then get the broader scientific community ie the authors to buy
in. New scientific discoveries resulting from new ways of
communicating and analyzing open science will be the driver in my
view.

Phil B.

On Dec 15, 11:01 pm, Peter Murray-Rust <pm...@cam.ac.uk> wrote:
> On Thu, Dec 16, 2010 at 5:48 AM, Fenner, Martin Dr. <
>

Peter Murray-Rust

unread,

Dec 16, 2010, 10:03:28 AM12/16/10

to beyond-...@googlegroups.com

On Thu, Dec 16, 2010 at 2:47 PM, Phil B <pebo...@gmail.com> wrote:

I agree we need a visionary publisher, if we cannot find one the
emergent tools will enable us to become one. The key issue in my mind
is to then get the broader scientific community ie the authors to buy
in. New scientific discoveries resulting from new ways of
communicating and analyzing open science will be the driver in my
view.

I agree completely. In the data-rich subjects, particularly physical science, there will be great opportunities for re-using a semantic journal for making discoveries.

We have put in a grant to work with the Eur. Geophys Union (EGU) which publishes journals such as Atmospheric Chemistry and Physics. This is a prime candidate for semantic markukp. If we are funded we should know before BTPDF and this would make a really excellent example of relevant physical/geo science. From initial dealings they would be very receptive, I think. Another obvious sympathetic and technically oriented publisher is IUCr/Acta Crystallographica. Committed ISUs are among the best places to start - they understand the needs of the communitty, they have a long term view and they are not driven by commercial interests.

We should also consider the reward structure. An author will - at least intially - have to go to greater trouble. So they should get points for others using their data. A very good model for reward is the Stackoverflow system for programming and other questions.

P.

Beck, Jeff (NIH/NLM/NCBI) [E]

unread,

Dec 16, 2010, 10:09:15 AM12/16/10

to beyond-...@googlegroups.com

If we can come up with a tool that works (even one that mostly works), we
shouldn't have any trouble finding a few early adopter publishers.

We've been working with PLoS to get the contents of their PLoS Currents
title into PubMed Central (http://www.plos.org/journals/currents.php).
Currents uses a web-based WYSIWYG editor for all article authoring. Once
the articles are published on the Currents site (part of Google Knol), we
get a (mostly structured) XML submission for PMC.

I hope to get into the gory details of which parts of this workflow are
working and which are not during the Workshop. This leads nicely into why
we are interested in a WYSIWYG (or better WYSIWYM) authoring/editing tool
that writes truly structured content at the beginning - like the one that
Carl is working on.

Jeff Beck
Be...@ncbi.nlm.nih.gov

Carl Leubsdorf, Jr.

unread,

Dec 16, 2010, 10:34:37 AM12/16/10

to beyond-...@googlegroups.com

A significant benefit of Annotem is the possibility for anyone to become a publisher. As Jeff indicated we will work with existing publishers to the extent possible, but any scientist or group will be free to set up their own journal at very low cost. Such self-published journals may generate their own following, a la TechCrunch, recently acquired by AOL, or they may be incorporated into 'legacy' publications.

Consider the case of Nate Silver's FiveThirtyEight.com political/statistics blog. Originally "just" a self-published blog, it is now a key part of the New York Times' political coverage. Both CNN [ http://wordpress.org/showcase/?s=CNN ] and the Wall Street Journal use WordPress itself for their own blog content.

--
_______________________
Carl Leubsdorf, Jr.
Solvitor ► Problem Solved.
+1 240 389 2255
ca...@solvitor.com
Skype: carlthewebmaster

Peter Murray-Rust

unread,

Dec 16, 2010, 10:36:42 AM12/16/10

to beyond-...@googlegroups.com

On Thu, Dec 16, 2010 at 3:09 PM, Beck, Jeff (NIH/NLM/NCBI) [E] <be...@ncbi.nlm.nih.gov> wrote:

If we can come up with a tool that works (even one that mostly works), we
shouldn't have any trouble finding a few early adopter publishers.

I think it will be a few rather than a lot. When for example, MathML came out and was supported in Word, many publishers rejected it. That's probably a higher quality combination than we can possibly hope to manage across the board.

We've been working with PLoS to get the contents of their PLoS Currents
title into PubMed Central (http://www.plos.org/journals/currents.php).
Currents uses a web-based WYSIWYG editor for all article authoring. Once
the articles are published on the Currents site (part of Google Knol), we
get a (mostly structured) XML submission for PMC.

We need to distinguish between XML for structuring the document (chapter, para, refence, etc) and the domain-specific non-textual markup. The general document stuff is addressed by NLM or DocBook or possibly HTML5. The rest - tables, graphs, chemistry, matrices, tensors, polyhedra, geosptial coordinates, scientific units of measurements, dates, ranges, errors etc. is not. That's a multinamespace problem. In some cases the namespaces don't interact - in some they do.

Bioscience is a hard conceptual subject with severe ontological problems and the community is addressing them as fast as humans can. But the concepts are often text/language oriented. How do you mark up a 3D map of time-dependent chemical pollution involving reactions? It can be done (I have started on this, but it's hard). I think we have to be very careful that we come up with solutions for areas in bioscince that don't translate to others.

I hope to get into the gory details of which parts of this workflow are
working and which are not during the Workshop. This leads nicely into why
we are interested in a WYSIWYG (or better WYSIWYM) authoring/editing tool
that writes truly structured content at the beginning - like the one that
Carl is working on.

If it does graphs it will be a great advance. I think graphs is one of the key areas we should concentrate on. It may be that a good tool there will work for people in other areas than authoring papers (e.g. authoring talks)

P.

Carl Leubsdorf, Jr.

unread,

Dec 16, 2010, 11:43:14 AM12/16/10

to beyond-...@googlegroups.com, Joe Townsend, Alex Wade

Peter,

Thanks for your additional comments.

Probably the question of additional markup and content (beyond the written, graphical, article content itself), and how to capture it is a worthy one of further discussion.

I would agree that dealing with the em-dash and curly quote issues involved when MSWord is part of the authoring chain is a core requirement for our system in the initial version. However, I expect that the first version of the tool will defer anything not explicitly part of the NLM Journal Article DTD, such as data, domain-specific markup, and the like.

Such add-ons could be added in a later release, and others would certainly be free to develop additional functionality in the form of plugins or ancillary tools.

Peter Murray-Rust

unread,

Dec 16, 2010, 1:11:14 PM12/16/10

to beyond-...@googlegroups.com, Joe Townsend, Alex Wade

On Thu, Dec 16, 2010 at 4:43 PM, Carl Leubsdorf, Jr. <ca...@solvitor.com> wrote:

Peter,

Thanks for your additional comments.

Probably the question of additional markup and content (beyond the written, graphical, article content itself), and how to capture it is a worthy one of further discussion.

I think this is the critical point! The NLM DTD is designed for bioscience, not for math or physical science. The "additional markup" is just as important for physical scientists as the NLM DTD is for bioscientists. This might constrain us to Beyond the Bioscience PDF.

I would agree that dealing with the em-dash and curly quote issues involved when MSWord is part of the authoring chain is a core requirement for our system in the initial version.

It's not just Word - it's almost any WYSIWG wordprocessor.

However, I expect that the first version of the tool will defer anything not explicitly part of the NLM Journal Article DTD, such as data, domain-specific markup, and the like.

It would be useful to know what NLM DTD added beyond, say DocBook or OOXML (I'm ignorant here). I'm not promoting Word, but what would a WYSIWYG editor add (apart, of course, for Open Source). What might not be done with Open Office?

Such add-ons could be added in a later release, and others would certainly be free to develop additional functionality in the form of plugins or ancillary tools.

I think it would be years before this happened.

Beck, Jeff (NIH/NLM/NCBI) [E]

unread,

Dec 16, 2010, 1:34:41 PM12/16/10

to beyond-...@googlegroups.com, Joe Townsend, Alex Wade

Peter,

It was not our intention when we put together to design the NLM DTD for bioscience. Actually we made significant efforts to do just the opposite – design it for journal articles in general. And I think we have done that pretty well (AND, we are constantly taking user suggestions to do it better with each new version).

But, the general nature of the article model means that there is nothing specifically designed in for physical science or mathematics either. We do have a number of "escape valves" where domain-specific content can be tagged using the existing element and attributes. To be done well (that is, to control your domain-specific application of the general article model), you will need to apply a domain-specific validation layer on top of your basic schema validation.

The advantage of this is that we have a general article model that we can use to archive and exchange information with a shared set of tools. The disadvantage is that you don't get the domain-specific stuff for free with those general tools. One idea I've had recently that I haven't shared with Carl yet is to allow users (groups, really who are defining these domain-specific applications of the general model) … to allow these users to define another validation layer that can be pulled into the authoring/editing tool and used for validation and real-time checking during the authoring process.

In my non-tool-builder mind, this could be done with something as simple as a Schematron. (Schematron allows you to make assertions (with error messages or warnings) about a document and test those assertions.

Jeff

Waard, Anita de A (ELS-AMS)

unread,

Dec 16, 2010, 1:31:06 PM12/16/10

to beyond-...@googlegroups.com, Joe Townsend, Alex Wade

So, in the W3C Health Care and Life Sciences group we are trying to come up with an 'ontology of rhetorical blocks' that does not only work for biology - and incorporates EPub, PRISM, BiBo, FRBR, and other bibliographic standards: http://esw.w3.org/HCLSIG/SWANSIOC/Actions/RhetoricalStructure

Currently a first pass of an 'Ontology of Rhetorical Blocks' is out - http://esw.w3.org/HCLSIG/SWANSIOC/Actions/RhetoricalStructure/models/blocksontology is out, and we are working on a 'medium-grained' model - http://esw.w3.org/HCLSIG/SWANSIOC/Actions/RhetoricalStructure/alignment/mediumgrain

Paolo Ciccarese, Tim Clark, Jodi Schneider and I are all working on this, and very much invite comments, contributions, and discussions at or before the San Diego meeting.

Anita de Waard
Disruptive Technologies Director, Elsevier Labs
http://elsatglabs.com/labs/anita/
a.de...@elsevier.com

-----Original Message-----
From: beyond-...@googlegroups.com on behalf of Peter Murray-Rust
Sent: Thu 12/16/2010 13:11
To: beyond-...@googlegroups.com
Cc: Joe Townsend; Alex Wade
Subject: Re: Authoring tools / Deliverables

Peter,

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677 (The Netherlands)

Carl Leubsdorf, Jr.

unread,

Dec 16, 2010, 1:43:11 PM12/16/10

to beyond-...@googlegroups.com

Good points Jeff, thank you.

Peter, I would add that certainly the issues you raise are important to discuss. But when thinking about actually implementing a tool -- really getting in and building it, I would argue that a certain practical perspective is important to retain.

Software by its very nature "can do" just about anything. Whether it "should do" those things is of course the meat of this discussion, or any discussion about product scope.

However, from a practical standpoint, you have to start somewhere, with a version 1.0. I think there can and should be a vigorous discussion of what should and should not be included in that, but candidly that feature set will not include everything. The key is to make it useful enough that it is in fact used.

With respect to the use of Open Office, Word or other tools -- this project is very explicitly and by choice a web-based, WYSIWYG (better: WYSIWYM) system. Again, other projects may seek to provide DTD-based interoperability via various authoring tools -- this project seeks to enable the blossoming of authoring that the Web 2.0/blog revolution has sparked for non-scientific publishing efforts, albeit with semantic/markup controls that make the content more interchangeable.

-C

Peter Murray-Rust

unread,

Dec 16, 2010, 3:13:03 PM12/16/10

to beyond-...@googlegroups.com, Joe Townsend, Alex Wade

On Thu, Dec 16, 2010 at 6:34 PM, Beck, Jeff (NIH/NLM/NCBI) [E] <be...@ncbi.nlm.nih.gov> wrote:

Peter,

It was not our intention when we put together to design the NLM DTD for bioscience. Actually we made significant efforts to do just the opposite – design it for journal articles in general. And I think we have done that pretty well (AND, we are constantly taking user suggestions to do it better with each new version).

Understood and accepted. On the basis that a journal article is primarily structured text, where the structural components may have publishing-specific semantics representable by text.

But, the general nature of the article model means that there is nothing specifically designed in for physical science or mathematics either. We do have a number of "escape valves" where domain-specific content can be tagged using the existing element and attributes. To be done well (that is, to control your domain-specific application of the general article model), you will need to apply a domain-specific validation layer on top of your basic schema validation.

That is what we do in CML and Joe's toolkit for chemistry. However DTD-validation is incredibly weak for science. It tells you that you have certain chunks present ij a certain order (possibly) and that's about it.

The advantage of this is that we have a general article model that we can use to archive and exchange information with a shared set of tools. The disadvantage is that you don't get the domain-specific stuff for free with those general tools. One idea I've had recently that I haven't shared with Carl yet is to allow users (groups, really who are defining these domain-specific applications of the general model) … to allow these users to define another validation layer that can be pulled into the authoring/editing tool and used for validation and real-time checking during the authoring process.

In my non-tool-builder mind, this could be done with something as simple as a Schematron. (Schematron allows you to make assertions (with error messages or warnings) about a document and test those assertions.

We have found that Schematron is too weak and bloated for our purposes. Joe has built validation using a set of XSLT1.0 stylesheets which does a good deal more.

Validation is a necessary process, but it's small compared with the totality of what has to happen in an editing tool. What validation gives you is the promise that your input will not crash the toolchain, or rather that your toolchain is prepared to read it in and use domain-specific checks.

I am not running down the idea of DTDs for authoring certain types of document, but we found we had to get rid of them for our work - the errors they catch are relatively minor

We also have to distinguish - as I expect you do - between validating (which is automatic) and DTD-enhanced authoring - i.e. suggesting to an author what they can do. My concern is that they tend to constrain - they are useful for formal documents..

There is a more general concern which is that by applying DTDs to BTPDF we shall end up with results that look very like current articles and limit the imagination.

But thanks for the discussion.

Jeff

Message has been deleted

Martin Fenner

unread,

Dec 29, 2010, 12:31:20 PM12/29/10

to beyond-...@googlegroups.com

Those on this list interested in Wordpress as a platform for authoring and publishing scholarly works might be interested in the BibTeX Importer Wordpress plugin (http://wordpress.org/extend/plugins/bibtex-importer/) that I released yesterday. The plugin imports a BibTeX file into the Wordpress Links Manager, from there a reference can then be easily inserted into a blog post or page. More info here: http://blogs.plos.org/mfenner/2010/12/28/wordpress-for-reference-management/

Wordpress is of course only one of several platforms for authoring/publishing, but I was positively surprised how easy it was to write a plugin. A group of people can certainly write a functioning plugin in a day. This makes Wordpress a very interesting prototyping platform for the ideas we will discuss at the workshop.

Kind regards,

Martin

Phillip Lord

unread,

Jan 5, 2011, 8:40:14 AM1/5/11

to beyond-...@googlegroups.com

My apologies for zombie posting on this -- I was on leave at the time of
the discussion, and have just got back.

We've been working on wordpress as the basis for a publication
environment for most of last year. You can see the results at

http://www.knowledgeblog.org or

http://ontogenesis.knowledgeblog.org

which is a journal/book/tutorial for ontologies and their use in
biology. Initially, this started off as a pet project of my own; since
then, we have used funding from an EPSRC Network to generate content and
are now funded by JISC.

We see wordpress primarily as a publication tool. At our first workshop
where much of the content for ontogenesis was developed people moaned a
lot about the editing environment. We recently held another workshop
which produced http://taverna.knowledgeblog.org/. Here people used a
variety of tools -- mostly word, but also live writer and some text
tools (textmate/markdown, asciidoc/blogpost); our experience is that
they managed this will little effort beyond that of authoring the
article.

We've also used google docs, open office and latex. The point is, I
think, people already have their tool chains and already have their
collaborative environments (SVN, email, dropbox). Not that I am against
additions to wordpress to support collaborative editing; I just don't
think it is that important.

Of course, we lose the absolute WYSIWYG element, but then WYSIWYG isn't
really WYSIWIG anymore -- after all the acronym means "What you see (on
screen) is what you get (when you print it on paper)" -- I rarely do the
latter these days. In most cases, it's close enough, and it's easier
that traditional publishing where it takes several weeks to see what the
final form is; here, the author can update their article and see the end
published form automatically and immediately.

So far, from the JISC money, we have produced or repurposed

- lots of documentation (process.knowledgeblog.org)
- a process for formal review, based around the EditFlow plugin
- Support for maths with Mathjax (which gives scalable fonts, rather
than images), using mathml or tex markup
- A site-wide table of contents
- Article level table of contents
- Multiple author support
- Some customized themes
- Some latex to wordpress support (it works but is a little hard at the
moment).
- DOIs from DataCite.
- Archiving from the British Library.
- Content!

Ironically, given my background, at the moment, we haven't added any
support for semantic markup, but we hope this will come. We're also
working on references, both server-side (basically, the idea is, drop a
DOI into your article, wordpress will generate a link, and the full
reference list) and inside the tool chain (so latex users should get
bibtex, word users should get word tools). Our hope is that all of this
can be done "for real" -- that is as a usuable process.

We're hoping to finish of with some slightly more innovative and forward
looking stuff as demonstrators; we'd like to produce a microarray paper,
where the author generates no figures, but submits R and an array
express ID. Figures get generated on-the-fly, with the R still
accessible in the published version. I think this form of customisability,
adding features that are useful for some areas of science, is a key
advantage of this sort of environment.

We actively pursing this; we'd welcome collaboration from anyone else
who is generating tooling for use within wordpress. Like Martin and
Carl, I think wordpress gets us 90% of the way there. We need to plug
the gaps, support more tools, work with scientists existing practices
and exploit the extensibility. I think that the structuring support
mentioned in Carl's paper sounds excellent, for instance, and is not
something we had in sight at the moment.

I hope that the workshop goes well; I would have loved to come, but I
can't travel at the moment.

Phil

Carl Leubsdorf, Jr.

unread,

Jan 13, 2011, 9:37:33 AM1/13/11

to beyond-...@googlegroups.com

Thanks Phillip for your thoughts and comments, and apologies for the delayed response.

You wrote:

<<
The point is, I think, people already have their tool chains and already have their
collaborative environments (SVN, email, dropbox). Not that I am against
additions to wordpress to support collaborative editing; I just don't
think it is that important.

>>

I think a collaborative authoring environment shines is in avoiding the 'which version is this' problem MSWord has very nice tools for comparing versions, but it has to because in most cases involving multiple authors you end up with many, many copies of the same document, often with cryptic filenames like BioChemPaper-v2-dec-21-with-CPL-comments-revised.doc. With hosted collaborative editing (e.g. within WordPress), having the edits in one place makes the experience more like version-controlled software development -- the code [content] is all in one place, and you can easily see who contributed what. The trick is going to be getting the WYSIWYG browser control to display those differences and changes in a meaningful way.

<<

Like Martin and Carl, I think wordpress gets us 90% of the way there. We need to plug
the gaps, support more tools, work with scientists existing practices
and exploit the extensibility. I think that the structuring support
mentioned in Carl's paper sounds excellent, for instance, and is not
something we had in sight at the moment.

>>

That's what Annotum is intended to do - plug those gaps.

Enforcing document structure is a major challenge with the use of current tools including MSWord (and most current WYSYWYG browser controls).

It's not so much a matter of WYSIWYG, but rather WYGIWYW - what you GET is what you WANT. I've worked with browser-based content management systems for a long time, and the challenges always come down to two things: making it as 'easy' or comfortable as the tools authors prefer (today this is likely MSWord; for us years ago at fool.com it was the AOL Mail client), and getting content that the system can use: headings properly marked, exhibits and figures properly captioned and sourced, and so on. This gets back to the point of using a centrally-hosted, collaborative editing paradigm: If all content can be entered via a mechanism that enforces structure, you can be sure that it can be formatted, displayed, and exported consistently.

-C

Phillip Lord

unread,

Jan 13, 2011, 12:26:37 PM1/13/11

to beyond-...@googlegroups.com

"Carl Leubsdorf, Jr." <ca...@solvitor.com> writes:

>>>
> The point is, I think, people already have their tool chains and already
> have their
> collaborative environments (SVN, email, dropbox). Not that I am against
> additions to wordpress to support collaborative editing; I just don't
> think it is that important.
>>>
>

> With hosted collaborative editing (e.g. within WordPress), having the
> edits in one place makes the experience more like version-controlled
> software development -- the code [content] is all in one place, and
> you can easily see who contributed what. The trick is going to be
> getting the WYSIWYG browser control to display those differences and
> changes in a meaningful way.

Sure. But these tools exist as well. Google docs or Live Writer give you
hosted collaboration. Dropbox gives you pass-the-parcel semantics. SVN
gives you concurrent edit and merge (at least with latex). Git gives you
versioning on steroids.

It's a busy application space. My worry would be that anything added
into wordpress would just be the poor relation of these tools.

> <<
> Like Martin and Carl, I think wordpress gets us 90% of the way there. We
> need to plug
> the gaps, support more tools, work with scientists existing practices
> and exploit the extensibility. I think that the structuring support
> mentioned in Carl's paper sounds excellent, for instance, and is not
> something we had in sight at the moment.
>>>
>
> That's what Annotum is intended to do - plug those gaps.
>
> Enforcing document structure is a major challenge with the use of current
> tools including MSWord (and most current WYSYWYG browser controls).
>
> It's not so much a matter of WYSIWYG, but rather WYGIWYW - what you GET is
> what you WANT. I've worked with browser-based content management systems
> for a long time, and the challenges always come down to two things: making
> it as 'easy' or comfortable as the tools authors prefer (today this is
> likely MSWord; for us years ago at fool.com it was the AOL Mail client), and
> getting content that the system can use: headings properly marked, exhibits
> and figures properly captioned and sourced, and so on. This gets back to
> the point of using a centrally-hosted, collaborative editing paradigm: If
> all content can be entered via a mechanism that enforces structure, you can
> be sure that it can be formatted, displayed, and exported consistently.

I agree. There is a tension here. If you use existing tools, you make
the authors life easy, but the presentation harder. If you use bespoke
tools, you make the authors life slightly harder but the presentation
and structuring easier.

We're going for the former approach; I think that this is practical when
using wordpress because most of the hard work for linking between the
tools and wordpress has already been done. MS Word can already do blog
posting, so can live writer, so can google docs. If we had to get word
talking to wordpress ourselves, well, I wouldn't have gone this route.

None of these approaches contradict, of course. A collaborative
environment for wordpress would be excellent.

Phil

Reply all

Reply to author

Forward