Annotation

35 views
Skip to first unread message

Phil B

unread,
Nov 9, 2010, 1:13:53 AM11/9/10
to Beyond the PDF
I see this as a separate thread. I think of annotation in the broadest
sense, that is taking existing data/information and adding additional
data/annotation (ie value) to it in some way. So I can imagine
discussing here:

- semantic tagging
- metadata
- curation

etc.

I have added a section under projects and links on the website to
support content posting.

/Phil

Jakob Voss

unread,
Nov 9, 2010, 6:17:43 AM11/9/10
to beyond-...@googlegroups.com
On 09.11.2010 07:13, Phil B wrote:

> I see this as a separate thread. I think of annotation in the broadest
> sense, that is taking existing data/information and adding additional
> data/annotation (ie value) to it in some way. So I can imagine
> discussing here:
>
> - semantic tagging
> - metadata
> - curation
>
> etc.

One the one hand you are right, but one the other this subsumes
everything under annotation. Ted Nelson called the same concept
"hyperlink", but it was misunderstood and implemented as degenerated,
embedded, one-way link, instead of a bidirectional connection of
document fragments. The crucial points are:

1. annotations are not linked to documents, but to specific segments of
documents. For instance you can highlight a whole paragraph.

2. each annotation is a document itself.

Technically there is no difference between versioning and annotating.
Instead of having a workshop where everyone tries to get his personal
piece of the cake, ignoring the whole picture, you could just try to
implement Xanadu as planned and described 50 years ago.

But I doubt that many will follow and fancy bullshit terms like
"semantic tagging" (subject indexing with controlled vocabularies is
also known for decades) are much more attractive. End of the rant.

Practically it makes sense to separate annotations as one topic. But you
need a clear definition of annotations and use cases, instead of trying
to capture "the broadest sense". The broadest sense of annotations are
hyperlinks in its *original* form, which subsume packaging, provenance,
and versioning).

I think we should at least try distinguish (see point 1 above):

- Annotations that connect whole documents (comments, reviews, tag sets
etc.) to another document as whole

- Annotations that refer to a specific part of a document

Unluckily the distinction is less easy than it seems. For instance
metadata can refer to a whole document, but also to parts of it. As soon
as you deal with a package format, some metadata will likely refer to
single parts (e.g. files) of the package.

Moreover (point 2 above), most systems do not treat annotations as first
class documents at the same level as annotated documents. This is a
design failure that is difficult to get rid of. At least you should
clearly articulate whether annotations you speak of, can also have
annotations, and if the latter have different status than the first.

> I have added a section under projects and links on the website to
> support content posting.

Peter Sefton already pointed [1] to an existing summary of annotation
projects [2], that should be linked.

Jakob

[1]
http://ptsefton.com/2010/11/05/towards-beyond-the-pdf-a-summary-of-work-weve-been-doing.htm/comment-page-1#id9

[2] https://fascinator.usq.edu.au/trac/wiki/Annotate/existing

--
Jakob Vo� <jakob...@gbv.de>, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 G�ttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de

Jodi Schneider

unread,
Nov 9, 2010, 7:07:53 AM11/9/10
to beyond-...@googlegroups.com
On Tue, Nov 9, 2010 at 11:17 AM, Jakob Voss <jakob...@gbv.de> wrote:
...

> you could just try to implement
> Xanadu as planned and described 50 years ago.

That would be nice!

...


> I think we should at least try distinguish (see point 1 above):
>
> - Annotations that connect whole documents (comments, reviews, tag sets
> etc.) to another document as whole
>
> - Annotations that refer to a specific part of a document

Well-said, Jakob!

>
> Unluckily the distinction is less easy than it seems. For instance metadata
> can refer to a whole document, but also to parts of it. As soon as you deal
> with a package format, some metadata will likely refer to single parts (e.g.
> files) of the package.

This suggests the need to model granular parts as individual objects,
in a (hierarchical?) container structure. Perhaps the entire document
is a single object until further granularity is needed.

Even reviews (which you describe as referring to the 'whole document')
frequently refer to parts ("the introduction", "chapters 3-9", quoted
passages,..., etc).

To me, the act of pointing out a particular part or section is what
calls that part into being as a first-class object of interest.

> Moreover (point 2 above), most systems do not treat annotations as first
> class documents at the same level as annotated documents. This is a design
> failure that is difficult to get rid of. At least you should clearly
> articulate whether annotations you speak of, can also have annotations, and
> if the latter have different status than the first.

One question is whether ALL annotations need to be first class objects.

I added these to the project areas page:
https://sites.google.com/site/beyondthepdf/project-areas

Meanwhile I also added the Annotation Ontology [3] and related paper [4].

-Jodi

[3] http://code.google.com/p/annotation-ontology/
[4] Ciccarese P, Ocana M, Das S, Clark T. AO: An Open Annotation
Ontology for Science on the Web. Paper at Bio-ontologies 2010
http://esw.w3.org/images/c/c4/AO_paper_Bio-Ontologies_2010_preprint.pdf

Tim Clark

unread,
Nov 9, 2010, 7:36:25 AM11/9/10
to beyond-...@googlegroups.com
Thank you for adding AO, the Annotation Ontology, Jodi.

I'll just also point out

- AO is a lineal descendant of W3C's Annotea project and is based on extending that previous work.
- AO was developed in tandem with an Annotation Framework (AF) which will be available for use as part of the NIF (Neuroscience Information Framework) and supports both manual and automatic annotation of web documents.

Going back to a previous point, I think it is important to recognize the centrality of scientific papers as the core units of work, credit, and collaboration in the information ecosystem of science. This may seem obvious but I want to suggest that thinking about how we can go "Beyond the PDF" should not decouple us from the central importance of papers & associated social discourse in science (since around the mid-seventeenth century!). Science as a whole has evolved with scientific papers as its fundamental boundary objects and scientific publication as its fundamental social activity.

Therefore I would tend to place a critical importance on annotation, i.e., associating metadata with the original scientific documents, as a means to "unlock" and extend content in the existing scientific paper, and suggest others consider this line of thinking.

Best,

Tim

Waard, Anita de A (ELS-AMS)

unread,
Nov 9, 2010, 9:09:44 AM11/9/10
to beyond-...@googlegroups.com
I completely agree with Tim,
a) "science has evolved with scientific papers as its fundamental boundary objects and scientific publication as its fundamental social activity"
b) "[we should focus on] associating metadata with the original scientific documents, as a means to "unlock" and extend content in the existing scientific paper"

I have not seen meaningful, smaller grained publication formats that offer anything else than 'a machine-readable summary of pertinent results' - but a publication is *not* the same as a set of results.

This is one reason we are hellbent on employing linked data principles with our content - I also think this is the way out of the commercial-open conundrum, and it offers help with provenance (something we should all be worrying about, a lot!).

The way I see it, the XML an author creates (with help from the publisher/journal perhaps, or with tools that just work well) is (after editing and review and such) the final version. The XML allows finely grained entry points that are URI-able (e.g. every paragraph gets a hashtag) and therefore externally accessible and discussable. The XML is frozen at time of 'publication' (however that happens), but the discussions and ensuing interpretations are liquid and open. The trick is how to access research data in that same way, and how research data standards can best be integrated with document standards: how do I point to a point in my data set within my document, using linked data standards?

As far as I know these two worlds are currently disjunct; uniting them would be a lofty goal.

-a.

Anita de Waard
Disruptive Technologies Director, Elsevier Labs
http://elsatglabs.com/labs/anita/
a.de...@elsevier.com

Best,

Tim


Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677 (The Netherlands)

Phil B

unread,
Nov 9, 2010, 1:15:19 PM11/9/10
to Beyond the PDF
Personal comment:
A good scientific paper is a beautiful thing; take 5 minutes or less
to read http://www.nature.com/nature/dna50/watsoncrick.pdf to be
convinced. We certainly do not want to attempt to hobble the ability
to communicate science. Good papers are a literary work and a story to
be preserved. Having said that they are also supposed to speak to
reproducibility and suggest how the work can be built upon. These two
fundamentals are what is missing in most cases. Data and methods
(where available in a digital form) integrated with the discourse
would go a long way to solving the first problem and creating more of
a living document where any time point can be easily viewed a long way
towards resolving the second.

Cheers../Phil

On Nov 9, 6:09 am, "Waard, Anita de A (ELS-AMS)"
<A.dewa...@elsevier.com> wrote:
> I completely agree with Tim,
> a) "science has evolved with scientific papers as its fundamental boundary objects and scientific publication as its fundamental social activity"
> b) "[we should focus on] associating metadata with the original scientific documents, as a means to "unlock" and extend content in the existing scientific paper"
>
> I have not seen meaningful, smaller grained publication formats that offer anything else than 'a machine-readable summary of pertinent results' - but a publication is *not* the same as a set of results.
>
> This is one reason we are hellbent on employing linked data principles with our content - I also think this is the way out of the commercial-open conundrum, and it offers help with provenance (something we should all be worrying about, a lot!).
>
> The way I see it, the XML an author creates (with help from the publisher/journal perhaps, or with tools that just work well) is (after editing and review and such) the final version. The XML allows finely grained entry points that are URI-able (e.g. every paragraph gets a hashtag) and therefore externally accessible and discussable. The XML is frozen at time of 'publication' (however that happens), but the discussions and ensuing interpretations are liquid and open. The trick is how to access research data in that same way, and how research data standards can best be integrated with document standards: how do I point to a point in my data set within my document, using linked data standards?
>
> As far as I know these two worlds are currently disjunct; uniting them would be a lofty goal.
>
> -a.
>
> Anita de Waard
> Disruptive Technologies Director, Elsevier Labshttp://elsatglabs.com/labs/anita/
> a.dewa...@elsevier.com
>
>
>
> -----Original Message-----
> From: beyond-...@googlegroups.com on behalf of Tim Clark
> Sent: Tue 11/9/2010 7:36
> To: beyond-...@googlegroups.com
> Subject: Re: Annotation
>
> Thank you for adding AO, the Annotation Ontology, Jodi.  
>
> I'll just also point out
>
> - AO is a lineal descendant of W3C's Annotea project and is based on extending that previous work.
> - AO was developed in tandem with an Annotation Framework (AF) which will be available for use as part of the NIF (Neuroscience Information Framework) and supports both manual and automatic annotation of web documents.
>
> Going back to a previous point, I think it is important to recognize the centrality of scientific papers as the core units of work, credit, and collaboration in the information ecosystem of science.  This may seem obvious but I want to suggest that thinking about how we can go "Beyond the PDF" should not decouple us from the central importance of papers & associated social discourse in science (since around the mid-seventeenth century!).  Science as a whole has evolved with scientific papers  as its fundamental boundary objects and scientific publication as its fundamental social activity.
>
> Therefore I would tend to place a critical importance on annotation, i.e., associating metadata with the original scientific documents, as a means to "unlock" and extend content in the existing scientific paper, and suggest others consider this line of thinking.  
>
> Best,
>
> Tim
>
> On Nov 9, 2010, at 7:07 AM, Jodi Schneider wrote:
>
> >>http://ptsefton.com/2010/11/05/towards-beyond-the-pdf-a-summary-of-wo...

Eduard Hovy

unread,
Nov 9, 2010, 4:21:26 PM11/9/10
to beyond-...@googlegroups.com

Hi all,

As the various recent contributions make clear, there are lots of
different perspectives one can take on the problem of 'beyond' PDF.

One can evaluate a work or some piece of it; one can extend, correct,
change..., some piece of work or datum, one can link additional
papers, authors, data, etc., to a work or some piece of it, one can
interpret [sections of] a work and assign labels or types to it, etc.
etc.


But always, there are these three things:
- the central piece of work, as Theme
- the comment/extension/evaluation/etc., as Annotation
- the commenter/author/interpreter (which may be a machine), as Annotator


It might be helpful to organize our discussion (now, and at the
meeting) along these three themes.
Viz.:

1. Theme.
We seem agreed that the Theme is in PDF. But of course it might be
raw data files as well, or video. Does it matter? Isn't all we need
a reliable storage medium and delivery mechanism, with appropriate
linkage an archiving capabilities? Yes? No?

2. Annotation.
We seem agreed that the details of the Annotation are important. But
perhaps we should decouple issues about the _content_ of the
Annotation from the _format_ and _mechanism_ of it. The AO is a nice
[beginning] for the kinds of content one can use; various other sorts
of document metadata are another; Gully's abstraction of experimental
design provides more; etc. What of popular tags and folksonomy-style
labels, using crowdsourcing? Though it might be fun to see an
inventory of just what kinds of annotation contents one could
consider useful, it is perhaps more useful in the near term to focus
on what we'd like the format and mechanism of Annotations to look
like. Should a timestamp be mandatory? Would it help to develop a
generic Annotation tool? And so on.

3. Annotator.
We seem to be arguing that _anyone_ can make an Annotation, but hold
short of openly saying so. Perhaps there should be some typology of
Annotator types, ranging over formal Reviewers (as in program
committees and funding reviewers) to Colleagues (and super-colleagues
of some elevated status) to popular PageCounts and citation counts to
(even?) things like an academic Consumer's Report. Also, some
indication of the Annotator's Qualifications, Motivations,
Conflict-of-Interest, etc., at the time of annotation? What else?
State-of-Inebriation?

Regards,
E

--
Eduard Hovy
email: ho...@isi.edu USC Information Sciences Institute
tel: 310-448-8731 4676 Admiralty Way
fax: 310-823-6714 Marina del Rey, CA 90292-6695
http://www.isi.edu/natural-language/

Steve Pettifer

unread,
Nov 10, 2010, 4:31:50 AM11/10/10
to beyond-...@googlegroups.com
I also completely agree with Tim, Anita and Phil on this thread (and to avoid the danger of us collapsing into violent agreement even before we have the workshop, I'll act as Devil's Advocate a little later in this email).

Reflecting on my own use of the literature (which I don't think is atypical), I use articles in one of two ways:

a) As a means of understanding some new concept. Assuming I've selected a particular article as being appropriate, I want well crafted, honed linear narrative that I can read on the bus. I want it to have been written by the world's expert on Subject X, and to have an idea explained to me clearly. I would expect to read it in totality, from top to bottom. At this stage the data is important to back up the narrative, but I'm likely to take whatever subset of data has been presented (somewhat) at face value, and am unlikely explore it right away. I think Anita's description of an article as a 'Story that persuades with data' hits this nail perfectly on its head.

b) As a means of finding evidence for an idea of my own. Here I'm treating an article much more like a reference work: does it tell me that A interacts with B or that X is a kind of Y, and does the data that's associated with the article really back this up (or provide me with a way of drawing my own conclusions). In this phase I'm treating an article much more like a database entry, or as a mixed bag of facts, the validity of which I want to establish as painlessly as possible. I also want to be able to ask of the literature in general 'is there any article that claims that X is a kind of Y'.

I think both are important. For example, the journal Cell has two recent initiatives; the graphical abstract (a single diagram that summarises the paper's contents in a standard fairly stylistic way), and their new online article format, which turns an article into a mini website, and separates out pages with Introduction, Discussion, Data and such. I'm not a cell biologist, but I'd imagine that the graphical abstract (whilst being totally opaque in itself to machines) is a Good Thing for human readers. The 'mini website' presentation of a paper, however, I just find difficult to use for both a) and b) since it breaks up the linear narrative across multiple sections, and makes even 'Alt+F' finding of terms in an article much harder. It seems a shame, given that the author has presumably spent a lot of effort crafting a linear story to then try to pretend that what they actually built was a website. As an auxiliary point, I would hate to see a publishing model where authors are not forced (to some extent, at least) to generate a linear narrative. There is something about the restriction of linear discourse that causes thoughts to be clarified in a way that generating a hyperlinked bunch of stuff doesn't do. I for one have had many fantastic ideas over a beer, that when I try to commit them to paper simply fall apart.

I'd also like to ask the question (and apologies if I've missed the answer to this in previous posts) "What is the scope of this workshop". It's called "Beyond the PDF", but most of our discussion so far has been focussed on scientific (perhaps with a bias towards Life Science and related?) articles. Do we want to consider other sciences (e.g. Social Science)? Humanities? Scholarly publishing in general? I suspect that in reality, focussing on a particular scientific domain is already sufficiently challenging, but it might be useful to keep an eye on whether our musings are applicable in a broader context.

And so, for those of you who don't know me, and in the knowledge that I'm in a friendly and supportive group....

Hello, my name is Steve, and I like PDFs.

.
.
.

I like PDFs for a number of reasons: they are self contained in a small single file, and thus easy to manage. I can read them on every platform that I use. I can store them, safe in the knowledge that no one will ever revoke my right to read them (even if I change institution, job etc). They are generally well typeset, drawing on centuries of typographical and stylistic craft. They are typically devoid of clutter. They are an Article Of Record -- I know this is the original unadulterated publication, and the fact that thousands of other people have their own copies of it makes it very hard for anyone to claim otherwise. And they give me excellent access as a human to the author's original linear narrative, and thus to their argument and thought processes involved in its creation. I don't believe I'm alone in this; publishers admit that in spite of all the various initiatives and online enhancements, over 80% (and probably much more) of their content is simply downloaded as PDF.

Of course, I don't like PDFs for all the reasons that have been mentioned in this list so far, but I would like to suggest that the problem is not that PDFs are an 'insult to science' (as they have been called on a number of occasions), but rather that we are trying to use them for the wrong purpose. A PDF is not a good mechanism as the primary vehicle for storing metadata, or linked data, or for including nanopublications, or anything of that kind. Neither is HTML. These, I would claim, are merely mechanisms for presenting a view on an underlying article *for a particular purpose*. If you like, for projecting an underlying article into a particular space. So rather than worry about whether PDF is a good thing in and of itself, I would like to suggest that we think of the PDF merely as a projection of an underlying article into 'Reading On The Bus Space'; HTML is another projection into 'Viewing and Browsing On Line Space', and RDF yet another projection into 'Machine Readable Linked Data Space'. The challenges then become i) what is the nature of the underlying article (much discussion on this on the list already), and ii) how do we keep the various projections/views up-to-date, 'live', and in-sync with the original. [At this point I have to declare more than a passing interest in this viewpoint: our first attempt at doing some of this synchronisation between PDF and underlying data -- a tool called Utopia Documents -- is available at http://www.getutopia.com, and there's a video of it in action at http://www.scivee.tv/node/17389].

In the future, I can imagine that some alternative 'Research Object' format supersedes the PDF in terms convenience, and also adds the richness we desire. But we're not there yet and I think if treated correctly there's still a fair amount of life left in the PDF. There's undoubtedly a significant amount of momentum behind its use, and I think it would be helpful in terms of getting our ideas exposed to a wider audience if whatever model we propose / develop can work alongside the existing format-of-choice.

Cheers,

Steve

Alexander Griekspoor

unread,
Nov 10, 2010, 6:10:23 PM11/10/10
to beyond-...@googlegroups.com
Hi, I'm Alex and I like PDFs as well**.

Hear Hear! I couldn't agree more with Steve and was hoping someone
would play the Devil's advocate here.

Steve has already pointed out the most important points. I'd like to
add some and also raise some more questions from a non-believer in
science switching to any kind of "object model" soon.

First about PDFs themselves.

I think we have to keep in mind where the PDF came from and what we do
with it. We READ PDFs. The PDF is a digital representation of a
journal article as it is supposed to be read, focused on layout rather
than content (hence the pain extracting text). PDFs are the result of
content laid out by a professional, designed by a professional, with a
focus on things like readability and typography. I yet have to see any
decent replacement for that. Anyone who thinks you can just extract
the text from a PDF, throw it in an html page, or squeeze your PDF
through some ePub converter and get something pleasant to read really
has no clue. In order to be able to get the equivalent you have to fix
the layout, someone makes a decision where to cut the sentence to the
new line. It's pretty simple: flexible layout, great typography, great
reusability, pick two.

I love great layouts and great looking PDFs, I have not yet seen a
scientific publisher come anywhere close. Reading on the web starts to
fall apart the moment you get anything longer than 1 or 2 pages.


Then there's the discussion around moving to alternative models of
representing research findings.

I'm skeptic we can ever move to a simple world of connecting objects
with lines, having computers then help us to figure it all out. Why?
Well, because it takes us 9 freaking pages in a Cell paper to set the
stage, the context, etc to properly setup what is basically 9 pages of
disclaimer surrounding some very cautious statements. This is then
followed by a review process in which the language is scrutinized and
we're sometimes debating commas and periods so at least 3 people can
agree.

How can we ever summarize that into hard triples of data?!
Protein A interacts with Protein B? Nobody dares to say that, what
follows is 9 pages of "under this condition, in this cell line, only
when the moon is rising and the blot is incubated 3.5 minutes on a
sunday afternoon". Point is that we need the narrative, we need the
full power of the (english in most cases) language to describe our
findings and for anyone else to properly judge them. The models people
talk about, simple entities that we can define in committee's and
study groups, will IMHO never be flexible enough to describe any real
world results, nor will we be able to judge whether the claims are
valid.

It's not all doom and gloom in my corner though, I do think the type
of overview diagram Cell is promoting adds value, a lot of value. But
these are graphical abstracts, not more. This is not very new, most
review articles and text books contain simplified models. They work
well individually, and convey the point (with the whole narrative to
back it up and add all the disclaimers!). But to think you could
connect them or converge into one big "model of the truth" is an
illusion IMHO. In no time you would need such a high dimensionality
that it becomes more complex than the current system of writing a
story each time you find something.

I personally see a big value for alternatives to the "PDF" as
supplements, but don't think we'll ever be able to get rid of the
storytelling and linear narrative as the main body of evidence.

Cheers,
Alex

(** no surprise there perhaps for those who know my background ;-).

--
****************************************************
         ** Alexander Griekspoor  PhD **
****************************************************
                  mekentosj.com

 Papers - Your Personal Library of Science
     Winner of the Apple Design Awards
      Best Mac OS X Scientific Solution
         http://mekentosj.com/papers

 New: Papers for iPad - all your papers
           available wherever you go:
       http://mekentosj.com/papers/ipad
****************************************************

Rebholz

unread,
Nov 10, 2010, 6:59:34 PM11/10/10
to beyond-...@googlegroups.com, Alexander Griekspoor
Hi,

good to see that you have not dropped you vigor. :-) Only to
counterbalance a bit of your "black or white" reasoning ...

Ok, PDF always looks the same and is meant for optimal layout and
printing. It could even contain links, if processed properly and others
have achieved to combine different data types with it (see Utopia).

On the other side, the achievements such as indexing, improved document
retrieval, categorisation of document sections, qualification of
hypothetical statements in the document, and information extraction can
only efficiently be achieved, if the document is available in a less
printer-friendly format, e.g. in XML or similar format. It is not
surprising that the publishers currently keep scientific publications in
their archives in XML formats and make the publications available in Web
pages rendered in HTML. If PDF is the solution, why supporting XML or
HTML at all? Only for the reason of direct access to the content, not
only for rendering.

Whether or not all facts will be present in triples in the future is a
different question. And there is no doubt that the discourse of a paper
("narrative") is essential to convey the message.

Only assume you would get all your emails in PDF or even as a FAX.

Cheers,
Dietrich

>>>> Going back to a previous point, I think it is important to recognize the centrality of scientific papers as the core units of work, credit, and collaboration in the information ecosystem of science. This may seem obvious but I want to suggest that thinking about how we can go "Beyond the PDF" should not decouple us from the central importance of papers& associated social discourse in science (since around the mid-seventeenth century!). Science as a whole has evolved with scientific papers as its fundamental boundary objects and scientific publication as its fundamental social activity.

--
Dietrich Rebholz-Schuhmann, MD, PhD - Research Group Leader
EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD (UK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TM support:www.ebi.ac.uk/Rebholz-svr | tm-su...@ebi.ac.uk


Peter Sefton

unread,
Nov 10, 2010, 7:18:26 PM11/10/10
to beyond-...@googlegroups.com
A few points:

The PDFs in many journals are NOT laid-out by experts, in many cases amateur editors do a Save as PDF from Word, and many are automatically generated using LaTeX. I think that PDF is a useful format to have, but as Dietrich notes, having richer, structured versions upstream is also important and useful. (One idea that I'd like to come back to is coming up with ways to add semantics to documents via meaningful links so that you can express things like authorship, or the subject of a paper in a way that will work in Word, WordPress, Media Wiki, PDF etc.)

While some people do care about layout, many don't or are happy to trade it off for flexibility consuming automatically re flowed content on ebook readers, iPads, phones, and in web browsers. 

For some really useful articles on this issue from someone who does understand typography and design see Craig Mod's site, for example this one on eBooks: http://craigmod.com/journal/ebooks/

Or this on how the reading experience should work: http://craigmod.com/satellite/bad_ereaders/

The question is how can we provide the best possible reading experience on the range of devices and modes that people want AND make machine processing possible.
--
Peter Sefton
Manager, Software Research and Development Laboratory,
Australian Digital Futures Institute,
University of Southern Queensland
Toowoomba Queensland 4350 AUSTRALIA


Work: sef...@usq.edu.au
Private: p...@ptsefton.com

IM accounts:
Gmail: ptse...@gmail.com
Yahoo: peter_...@yahoo.com
MSN:  p...@ptsefton.com
AIM: ptsefton

p: +61 (0)7 4631 1640
m: +61 (0)410 326 955


USQ Website: http://www.usq.edu.au
Personal Website: http://ptsefton.com


Leonard Rosenthol

unread,
Nov 10, 2010, 10:32:41 PM11/10/10
to beyond-...@googlegroups.com, Alexander Griekspoor
PDF supports all of those things you listed:

indexing, improved document retrieval, categorisation of document sections, qualification of
hypothetical statements in the document, and information extraction

However, it does require that the producer of the PDF actually take advantage of the features and not just produce a "for print only" document. A recent blog post at <http://www.appligent.com/talkingpdf-eachpdfpageisapainting> highlights this...

(and I'd be VERY happy to get all my emails as PDFs ;).

Matthew Cockerill

unread,
Nov 11, 2010, 12:51:47 AM11/11/10
to beyond-...@googlegroups.com, meke...@gmail.com
Nice to see Alex's contribution, which mirrors some of my own slight concerns about the phrase 'Beyond the PDF'.

PDFs work because:
1. they are single encapsulated files which you give you security of access, on all platforms in all circumstances, and are easy to manage
2. they make for easy/attractive reading, presenting a narrative in a digestible way.

(Though I'd point out that these days many publisher PDFs are created with very little human input, via automatic layout algorithms, with the odd manual tweak. And still look pretty good)

But in terms of ensuring adding value/semantics/data, I think it's worth considering how these can best be embedded in/ linked from PDFs rather than envisioning them as mechanisms to kill the PDF.

BMC has recently been seeking to do more to embed additional structured metadata into PDFs to allow software like Papers and Mendeley to deal with those PDFs in a smarter way, and I think there's potential to do more of this in terms of semantics and linked data.

Alex's points about the 8 pages of qualification which precede a couple of sentences of assertion, in a cell paper, are reasonable, and I too am a bit pessimisitic about building a universal model oif biomedical knowledge by combining all the papers computationally.

Having said which, less ambitiously, I think that capturing semantics for those couple of sentences of assertion in principle could be a useful aid to discovery (esp for filtering to find papers which support or contradict the paper you are looking at). Although like MeSH headings, it is somewhat difficult to imagine such structured info being used directly by many end users. But behind the scenes it could certainly allow for smarter discovery.

Cf on a trivial level, I was impressed the other day that when I was searching for
"Rat proof cat5 cable" (don't ask...)
Google came back with good results, having converted the search into 'rodent proof cat5 cable'. Ie it was doing thesaurus substitution, but in a smart context-aware fashion.

A key challenge is how to identify those few key sentences of assertion. Unclear whether experiments in getting authors to explicitly provide them as a structured abstract have proved viable. And whether there is sufficient info in the standard abstract of a paper to capture the relevant
assertions is also a bit doubtful I think.

Matt

Alexander Griekspoor

unread,
Nov 11, 2010, 3:01:47 AM11/11/10
to Rebholz, beyond-...@googlegroups.com
That is indeed the thing that still baffles me, the PDF format has
become so amazingly complex and bloated (fighting it every day with
the spec open ;-) and yet why did it never evolve in preserving the
content properly? Nowadays it can contain complete protein structures,
images, links, annotations etc. The programs used to create these PDFs
(converters, Word, InDesign, what ever) know perfectly well about word
boundaries, column orders etc etc. So why oh why didn't the PDF format
never evolve to properly preserve this and give us the best of both
worlds, both a perfect layout AND easy accessibility to the content in
an XML type format?! That's where for me both the PDF format, and the
publishers (who should have demanded this) failed.

Cheers,
Alex

--

Alexander Griekspoor

unread,
Nov 11, 2010, 3:20:22 AM11/11/10
to beyond-...@googlegroups.com
Hi Peter,

> While some people do care about layout, many don't or are happy to trade it
> off for flexibility consuming automatically re flowed content on ebook
> readers, iPads, phones, and in web browsers.

Fully agreed, hence my statement: flexible layout, great typography,
great reusability, pick two.

> The question is how can we provide the best possible reading experience on


> the range of devices and modes that people want AND make machine processing
> possible.

Fully agreed. As Leonard and Matt pointed out, PDF in principle CAN do
this. Yet, the file format is everything but "computational end user"
friendly.
I guess for it's primary use case as a reading document there's very
little that's wrong with PDF, as many have pointed out here.
What many would like however if it was trivial to just ask the
document for an XML like representation of the content. We would need
at least two things:

1) the publishers have to agree on a common XML representation (keep
it simple, that's just fine). If we get a dozen different ones we can
just as well not do it. they have to embed it in the PDFs properly

2) tool builders have to work on making XML extraction a non-issue and
super simple, at the moment it's just too much hassle (e.g. XMP)

At least the publishers now start to get the importance of being able
to identify PDFs in the first place, and start to bake in XMP metadata
(hooray!). For the longest time this never happened, I guess because
of the classic chicken and egg problem where the publishers would say
"but there's nothing out there that uses it, why bother". Once tools
like Papers start to come out that actually make use of it, you see
suddenly see publishers jumping on board. So why not do the same with
the XML content? We need publishers to agree on a XML standard and
embed it in their PDFs, and we need tools that actually make it
accessible. Both to end users and to power users (name your favourite
scripting/programming languages to "one-line" extract the data).

Cheers,
Alex

On Thu, Nov 11, 2010 at 12:18 AM, Peter Sefton <ptse...@gmail.com> wrote:
> A few points:
> The PDFs in many journals are NOT laid-out by experts, in many cases amateur
> editors do a Save as PDF from Word, and many are automatically generated
> using LaTeX. I think that PDF is a useful format to have, but as Dietrich
> notes, having richer, structured versions upstream is also important and
> useful. (One idea that I'd like to come back to is coming up with ways to
> add semantics to documents via meaningful links so that you can express
> things like authorship, or the subject of a paper in a way that will work in
> Word, WordPress, Media Wiki, PDF etc.)

>

--

Velterop

unread,
Nov 11, 2010, 4:40:50 AM11/11/10
to beyond-...@googlegroups.com, Alexander Griekspoor
I haven't read all the contributions to this thread yet, but I hope someone has already commented that "Beyond the PDF" shouldn't be read as "Let's get rid of PDFs", but rather as "Not by PDFs alone".

In reality we need both PDFs and triples (single assertions, or clusters of those). Qualified triples, that is, in order to take account of conditionals – see "The Anatomy of a Nanopublication" http://iospress.metapress.com/content/ftkh21q50t521wm2/?p=bbef1b0556ae428da886da104fd67e53&pi=4

There's too much literature to take in, in many areas, and triples can give us a rough overview of the lay of the land (dynamically, i.e. including the continual changes in that picture) upon which we can decide what to invest time in to read in detail. Triples do not serve as replacements of linear text, but as a powerful complement, to enable us to deal with massive amounts of knowledge.

An analogy of how to use triples may be the hologram. Shine the right light on what looks like a sheet of 'white noise', and you'll get a 3D picture, the resolution of which varies with the information content of the sheet of 'white noise' one has. Approach a large amount of assertions (triples) with the right tools, and the resultant overview of how they are connected (more information content = higher resolution) gives us potentially strong clues as to where it may be worth our while to drill down to – and read linearly – the articles from which the triples come that define our detailed area of interest.

Horses for courses.

Best,

Jan Velterop




On 10/11/2010 23:10, Alexander Griekspoor wrote:

Jakob Voss

unread,
Nov 11, 2010, 5:19:24 AM11/11/10
to beyond-...@googlegroups.com
Alex wrote:

> Hear Hear! I couldn't agree more with Steve and was hoping someone
> would play the Devil's advocate here.

Well, I try to. Let me first summarise some statements from this thread:

Tim Clark stressed "the centrality of scientific papers as the core
units of work" and argued to "place a critical importance on annotation,

i.e., associating metadata with the original scientific documents, as a
means to 'unlock' and extend content in the existing scientific paper,
and suggest others consider this line of thinking."

Anita de Waard agreed, but also stated: "I have not seen meaningful,

smaller grained publication formats that offer anything else than 'a
machine-readable summary of pertinent results' - but a publication is
*not* the same as a set of results."

I could not agree less. Let me repeat the second of my crucial points
about annotations:

"Each annotation is a document itself."

Phil praised the value of a "good scientific paper". Yes, great papers
are great. But great comments, revised editions, responses, collections
etc. of papers are great as well. In many cases both coincide. What
makes you treat a paper, that summaries, rephrases, corrects ... other
papers, as paper and not as annotation? The distinction between "a
paper" and "an annotation" is purely arbitrary. If you look at the
history of academia, scientific journals and papers evolved from letters
between scholars. You would reject treating this mail posting, or a good
blog article as publication at the same level as a scientific paper? I
bet, but this distinction is only based on formal criteria.

Steve Pettifer asked about the "role of linear narrative". There will
always be good narration, but the form is changing. He also wrote "most

of our discussion so far has been focussed on scientific (perhaps with a
bias towards Life Science and related?) articles. Do we want to consider
other sciences (e.g. Social Science)? Humanities? Scholarly publishing

in general?" In fact the view on new (digital) forms of scientific
publishing is often narrowed to the "hard" sciences and a very naive,
uncritical and ahistoric view of scientific communication.

I often observe such ignorance in computer science, but you can also
find it in life sciences and elsewhere. Pretending to do progress and to
find new stuff, authors repeat and apply old ideas without even
mentioning earlier works. Many papers are full of bullshit, to justify
funding and to increase publication output. The top of this stupidity is
the idea of extracting the "results" of a publication as data triples.
But the true extract will always be only "author X tells a story about
his ideas and works". I don't reject data driven science, but the idea
to fit human communication into triples, is nonsense. Reality is always
socially constructed to a large degree, and it is permanently reproduced
by communication. In science, most of this communication takes place in
publications, but it's still communication.
One participant writes a paper, another cites and quotes it, and so on.

But this referencing is not just a single pointer to another document.
You refer to specific statements, phrases, arguments, and results, that
are included *inside* another paper. So let me repeat the first of my
crucial points about annotations:

"Annotations are not linked to documents, but to specific segments of
documents."

I must admit, that this view goes too far for most people - they always
want things to stay as they are. That's why PDF, and other techniques as
simulation of paper, were invented, while Ted Nelson had already shown
how to do better. If you do not want to revolutionise digital
publication (which is fine, for practical reasons), you should at least
try to define basic concepts like "paper" and "annotation" without
pretending that your definition catches the true spirit of the concepts.
It is just a simplification for practical reasons.

So coming back to annotations and PDF, you could for instance just
define, that a paper must be an PDF file and every PDF file is a paper.
And you could say, that an annotation always refers to a whole paper
only, and that everything that refers to a paper while not being a paper
itself, is an annotation. These are purely random definitions. But
unless you agree on some definition, you will not get any practical
outcome, but a (hopefully nice and inspiring) discussion.

Cheers
Jakob

--
Jakob Voß <jakob...@gbv.de>, skype: nichtich


Verbundzentrale des GBV (VZG) / Common Library Network

Platz der Goettinger Sieben 1, 37073 Göttingen, Germany

Alexander Griekspoor

unread,
Nov 11, 2010, 6:57:32 AM11/11/10
to beyond-...@googlegroups.com
Hi Jakob,

I couldn't agree more, and IMHO you're spot on with the fact that a
paper is in essence a human communication tool that can't be fit in
triples.

> So coming back to annotations and PDF, you could for instance just define, that a paper must be an PDF file and every PDF file is a paper. And you could say, that an annotation always refers to a whole paper only, and that everything that refers to a paper while not being a paper itself, is an annotation. These are purely random definitions. But unless you agree on some definition, you will not get any practical outcome, but a (hopefully nice and inspiring) discussion.

Great point and it struck me that this is exactly what happened with
our Mac app "Papers", it indeed defines a paper as a PDF file and
every PDF file is a paper. Although you could discuss whether this
makes sense and is semantically true, it does work because it sets up
the correct mental modal for the user and the expectations he/she has.
The simpler the definition the better it often works.

As far as I can see, most scientists really see a paper as a PDF. And
if we want them to ever move to something else as the leading way to
communicate science it should be something that is at least as
trivially defined and simple as that or it won't work.

This is clearly what's going on with trying to get other forms of
communication credited, like blog posts, db entries, tweets, etc. It
breaks the simple world we live in where 1 PDF equals 1 Paper, leading
to discussion and confusion -> syntax error -> reject reject.

Cheers,
Alex

--

Leonard Rosenthol

unread,
Nov 11, 2010, 7:04:59 AM11/11/10
to beyond-...@googlegroups.com
>I guess for it's primary use case as a reading document there's very
>little that's wrong with PDF, as many have pointed out here.
>What many would like however if it was trivial to just ask the
>document for an XML like representation of the content.
>

PDF DOES supports the ability to embed "an XML-like representation of the content" (called structure or tagging), however, it's not "easy/trivial" to obtain that representation (or embed it at authoring time, for example). Some tools, such as Acrobat, Word, InDesign, FrameMaker and newer versions of pdfTeX support this at authoring time, but most do not and many users turn the option off to keep file size down.

The other alternative is to use PDFs ability to embed any file and simply put an XML representation of the content into the PDF so that it can be found and used instead of the PDF when necessary. OpenOffice, for example, does this to enable their "hybrid PDF" functionality. However, PDF consumers have no way to identify this content as anything other than an attachment.

To improve on both of these things, one of the features that is part of PDF 2.0 (ISO 32000-2) is something called "Source Content" which is an extension to "embedded files" and designed to address both this problem as well as some others that have been mentioned. It enables the embedding of "source content" (be it the original source or some other semantically rich version) and associating it with one or more graphics objects in the document. One of the big use cases for it was scientific publication. For example, the ability to directly associate MathML with the visual representation of the equation or ChemML with the picture of a molecule (etc. etc. etc.).


>2) tool builders have to work on making XML extraction a non-issue and
>super simple, at the moment it's just too much hassle (e.g. XMP)
>

XMP is for metadata - and it does an excellent job at that - but it shouldn't be used to carry actual data. That's another reason for the "source content" work above, to keep XMP as strictly metadata and not misused for other things.

I will also mention, FWIW, that Adobe has recently turned XMP over to the ISO and it is currently being standardized as an ISO standard!


>So why not do the same with
>the XML content? We need publishers to agree on a XML standard and
>embed it in their PDFs, and we need tools that actually make it
>accessible. Both to end users and to power users (name your favourite
>scripting/programming languages to "one-line" extract the data).
>

AMEN!!!


Leonard

Alexander Griekspoor

unread,
Nov 11, 2010, 7:44:50 AM11/11/10
to beyond-...@googlegroups.com
Hi Leonard,

Great to see the PDF format itself continue to evolve and trying to
allow for these needs. I agree that putting the content in the XMP
metadata wouldn't be a good thing, but what I tried to say is that we
need something similar, and the source content stuff you talk about
seems to be exactly that. That's definitely an enabling move, now we
need the publishers on board and a good set of tools.
Cheers,
Alex

--

Jodi Schneider

unread,
Nov 11, 2010, 7:53:49 AM11/11/10
to beyond-...@googlegroups.com
On Thu, Nov 11, 2010 at 8:20 AM, Alexander Griekspoor
<meke...@gmail.com> wrote:

>> The question is how can we provide the best possible reading experience on
>> the range of devices and modes that people want AND make machine processing
>> possible.

'range of devices' is particularly important here.

> I guess for it's primary use case as a reading document there's very
> little that's wrong with PDF

I disagree. Perhaps you mean "for its primary use case as a PRINT
reading document there's very little that's wrong with PDF".
Sure, I agree with that. :)

My main struggle with PDFs is as a reader. I want to read on my laptop
screen, I want to read on my iPhone. I want to annotate documents
*easily* and save the annotations. You could argue that this is just a
matter of time, for appropriate screens and software to emerge:
Portable A4/letter-size devices are coming to the consumer market
(e.g. the Kno textbook reader).

I think there's a larger issue: producer control versus consumer control.

Producer control, the fixed format that means (for the most part) 'you
see what I see' is a key advantage of PDF. Yet this control over
page-level formatting is incompatible with reflowability. That makes
PDF's fixed format a disadvantage when I want to read on screens of
different shapes and sizes.

As Leonard points out, it's not necessarily PDF (the standard) that's
the problem: production values and producers' choices cause some of
the problems. For instance, landscape PDF's are far more readable
onscreen; the ASIS Bulletin is a great example
http://www.asis.org/Bulletin/

Embedded XML and smart PDF reading software could potentially allow
consumers to have some control over display (providing reflowable
versions for different-sized screens) while keeping the printable
producer-endorsed view. Somewhat like the ability to zoom in and zoom
out, showing different levels of magnification.

-Jodi

Jodi Schneider

unread,
Nov 11, 2010, 8:08:30 AM11/11/10
to beyond-...@googlegroups.com
Jakob,

If I understand you correctly, you think that we should be talking
about communication, not just publications. And that to get further we
may want to define 'paper' and 'annotation'. :)

You assert that "great comments, revised editions, responses,
collections" have value, and should be treated as (perhaps informal)
publications. I think you're calling all of these annotations--in so
far as they relate to a part or all of a document, and add context,
commentary, or relations to other documents.

Is that right?

I think you're setting "a set of results" up against "the story". But
I thought you were rejecting both of those. So this part confuses me,
since you seem to take "the story" point of view:

> "The top of this stupidity is the idea of extracting the
> "results" of a publication as data triples. But the true extract will always
> be only "author X tells a story about his ideas and works". I don't reject
> data driven science, but the idea to fit human communication into triples,
> is nonsense. Reality is always socially constructed to a large degree, and
> it is permanently reproduced by communication. In science, most of this
> communication takes place in publications, but it's still communication.
> One participant writes a paper, another cites and quotes it, and so on."

I don't think that we can fit all human communication into triples
(humans are messy; definitive statements don't allow for ambiguity).
But I do think that CiTO shows one way of moving towards the kind of
annotation you're talking, expressed in triples. We need more granular
identifiers to get there fully: right now it's only easy to express
what you are citing, and why, at the full document level.

I guess overall, I'm positing that you see a difference between
overall scholarly communication (annotations of any sort) and
scholarly papers (which must tell a story). Not sure if I got what you
were saying, though.

-Jodi

Jodi Schneider

unread,
Nov 11, 2010, 8:12:14 AM11/11/10
to beyond-...@googlegroups.com
On Thu, Nov 11, 2010 at 10:19 AM, Jakob Voss <jakob...@gbv.de> wrote:
> ...Let me first summarise some statements from this thread:

I've been trying to 'scribe' this discussion on the Projects & Links page
https://sites.google.com/site/beyondthepdf/project-areas

Here's the core. Please edit and improve it! If this is useful, we may want to break it out from the projects & links.

-Jodi

What PDFs do right:
File and format
  • self-contained in a small single file, and thus easy to manage
  • readable on every platform that I use
  • no one will ever revoke my right to read them (even if I change institution, job etc)
  • well-typeset
  • typically devoid of clutter
  • Article Of Record
    • lots of independent copies--tamper-resistent
  • ability to search entire file (Ctrl-F)
  • can see entire thing with minimal clicking between sections
  • "they are single encapsulated files which you give you security of access, on all platforms in all circumstances, and are easy to manage"
  • "they make for easy/attractive reading, presenting a narrative in a digestible way."

    Rhetoric
    • linear narrative, author's argument and thought processes
    What PDFs do wrong:
    • sized for printing, not laptop-screen (let alone hand-held)
    • challenging to annotate on-screen with most tools
    • not granular -- for citing and reading
    Hope for PDFs:
    • production values vary
        • "indexing, improved document retrieval, categorisation of document sections, qualification of hypothetical statements in the document, and information extraction can only efficiently be achieved, if the document is available in a less printer-friendly format, e.g. in XML or similar format"
      • embed metadata in PDFs ("BMC has recently been seeking to do more to embed additional structured metadata into PDFs to allow software like Papers and Mendeley to deal with those PDFs in a smarter way, and I think there's potential to do more of this in terms of semantics and linked data.")
      Challenges and opportunities for PDFs/communicating paper contents themselves:
      • Data and methods sections need to be digital for reproducibility
      • Living documents, that are auto-updated with new information
      • "flexible layout, great typography, great reusability, pick two."
      • "Reading on the web starts to fall apart the moment you get anything longer than 1 or 2 pages."
      • argument-based indexing (smarter discovery by capturing assertions, etc.)
      • "Embedded XML and smart PDF reading software could potentially allow consumers to have some control over display (providing reflowable versions for different-sized screens) while keeping the printable producer-endorsed view."
        Challenges and opportunities beyond the PDF itself, for science/IR/etc in general:
        • Tracking reuse
        • Motivation for sharing data, 
        •  'Domain Specific Reasoning Models' - provide a mechanism whereby one can reason over a model and generate hypotheses that can be tested experimentally. 
        • context-aware thesaurus substitution 

        Ways of using papers:
        • Honed linear narrative (story that persuades with data): As a means of understanding some new concept. Assuming I've selected a particular article as being appropriate, I want well crafted, honed linear narrative that I can read on the bus. I want it to have been written by the world's expert on Subject X, and to have an idea explained to me clearly. I would expect to read it in totality, from top to bottom. At this stage the data is important to back up the narrative, but I'm likely to take whatever subset of data has been presented (somewhat) at face value, and am unlikely explore it right away. I think Anita's description of an article as a 'Story that persuades with data' hits this nail perfectly on its head.
        • Reference work: As a means of finding evidence for an idea of my own. Here I'm treating an article much more like a reference work: does it tell me that A interacts with B or that X is a kind of Y, and does the data that's associated with the article really back this up (or provide me with a way of drawing my own conclusions). In this phase I'm treating an article much more like a database entry, or as a mixed bag of facts, the validity of which I want to establish as painlessly as possible. I also want to be able to ask of the literature in general 'is there any article that claims that X is a kind of Y'.
        • Publication venue as a proxy for quality, used for promotion/tenure/etc ("quantifying of
          impact albeit implicitly by "sorting" publications into journals that have pre-defined windows of impact")
        • Annotation connecting various ideas or papers?
        • ...

        Tim Clark

        unread,
        Nov 11, 2010, 8:18:59 AM11/11/10
        to Jakob Voss, beyond-...@googlegroups.com
        Jacob brings up important points worth discussing, so I'll add my two cents worth.

        1 - "Each annotation is a document itself". 

        That is true from the standpoint of the web, and as developers of web communities such as PD Online http://pdonlineresearch.org, clearly we are committed to supporting high-velocity and high-bandwith online scientific discussion.  

        But it is also NOT true from the standpoint of the accepted documentary record of science, which is what people are socially / materially rewarded for participating in.  You cannot cite your blog posts in your CV to get a promotion. 

        Therefore, in developing web discussion communities, we pay very close attention to the ways we motivate participation in the discussion and create social validity for it.   You cannot sidestep the social process of science.  

        2 - "Scientific journals and papers evolved from letters between scholars"

        Fjällbrandt [1] identifies “artifact closure" in scientific communications as having occurred at the end of the seventeenth century with the selection of the scientific journal, with peer review, “historical” presentation of experiments, and all the associated paraphernalia becoming “locked in” to science communications up until the present day.  

        The relevant question is - WHY did the journal article become dominant over the personal letter as what Steven Shapin [2] calls the "literary technology" of the "new experimental philosophy", i.e. what has been called since the 19th century, "science"? 

        This brings us to:

        3 - "You would reject treating this mail posting, or a good blog article as publication at the same level as a scientific paper? I bet, but this distinction is only based on formal criteria."

        This mail posting is a contribution to scientific discourse, but it does not rise to the same level as a scientific paper, at least in experimental sciences, because it neither reports directly or indirectly (i.e. a review paper) upon the results of experiments in a way that justifies belief in the reported results by details to allow reproduction of the experiment, nor does it situate the interpretation of the experimental results within (or counterpose it to) a citable body of existing theory.  

        These requirements have to do with the epistemic basis of scientific knowledge.  Scientific knowledge is "socially constructed" but not independently of the properties of the material world, and we should remember that this social construction is done by interrogating that material world.  So science requires discussion, but cannot be built on discussion alone - the content of the discussion has to meet certain requirements.

        At the same time I would claim that we have a problem in current biomedical research with reproducibility, so the current scientific paper is not really living up to its own standards.  And we definitely have a problem in terms of the capabilities of the Web, which are not nearly fully exploited as journal articles are presented today.

        4 - "Annotations are not linked to documents, but to specific segments of documents."

        Agree completely.  In our work in the Annotation Ontology, (AO), we explicitly link annotations to particular sections of text (or images). 

        5 - "Stupidity of ... extracting the "results" of a publication as data triples"

        Whether extracting results as triples is mistaken or not depends in part upon whether those triples are understood to be "knowledge" .  

        If they are classed as "knowledge" and the triples are asserted to contain "knowledge", not only is it a naive position because of the massive rhetorical qualification and shading and hedging that takes place in real scientific writing,  - but you get into an intractable truth-maintenance problem and you get away from the most interesting science, which is usually controversial. 

        However, I do think it useful to extract scientific *claims and evidence* from papers.

        I think we have to be clear  - and here I agree with Jacob - that a scientific paper is a "communication", i.e. information, not "knowledge".  The reduction of sets of scientific papers to "knowledge" is a social process in continual development, and it is this development that is where science lives and that we have to better support with technology.

        Best

        Tim

        [1] Fjällbrandt, N. (1997) Scholarly Communication: Historical Development and New Possibilities. In International Association of Scientific and Technical University Libraries (IATUL). IATUL, Trondheim, Norway. http://www.iatul.org/conferences/pastconferences/1997proceedings.asp

        [2] Shapin, S. (2003) Pump and Circumstance: Robert Boyle's Literary Technology. In The Scientific Revolution (Hellyer, M., ed.). Blackwell, Oxford.



        Jodi Schneider

        unread,
        Nov 11, 2010, 8:56:35 AM11/11/10
        to beyond-...@googlegroups.com
        On Thu, Nov 11, 2010 at 1:18 PM, Tim Clark <tim_...@harvard.edu> wrote:
        > You cannot cite your blog posts in your CV to get a promotion.

        I think we need to step back and ask about the scope of the
        discussion. Are we covering science? Or all fields?

        In science, you cannot cite your blog posts for promotion and tenure.
        In new media, even listserv posts may count [1]. Professors in some
        fields (e.g. writing) are already citing blog posts and twitter as
        professional development and/or service [2,3]. In humanities, it's
        becoming common to select items posted in social media for more formal
        publication [4].

        There's also a very well-thought out series, “Making Digital
        Scholarship Count”, about some of the issues, particularly what's
        scholarship, what's service, and what's teaching, in the digital
        realm; it's written by Mills Kelly, from a digital humanities
        perspective [5,6,7].

        > Therefore, in developing web discussion communities, we pay very close
        > attention to the ways we motivate participation in the discussion and create
        > social validity for it.   You cannot sidestep the social process of science.

        This point remains. :)

        -Jodi

        [1] University of Maine, New Media Promotion & Tenure criteria.
        January 2007. http://newmedia.umaine.edu/interarchive/new_criteria_for_new_media.html
        [2] http://chronicle.com/blogs/profhacker/talking-about-blogging-in-tenureapplication-documents/22748
        [3] http://williamwolff.org/composingspaces/on-blogging-tweeting-professional-course-web-sites-and-tenure/
        [4] http://www.nytimes.com/2010/08/24/arts/24peer.html
        [5] http://edwired.org/2008/06/13/making-digital-scholarship-count/
        [6] http://edwired.org/2008/06/16/making-digital-scholarship-count-2/
        [7] http://edwired.org/2008/06/27/making-digital-scholarship-count-3/

        Paul Groth

        unread,
        Nov 11, 2010, 9:02:20 AM11/11/10
        to beyond-...@googlegroups.com
        I agree with Jodi here. While a blog post or a tweet may not get you
        promotion , they are very important emerging parts of scientific
        communication. I doubt they will every be considered on par with a peer
        reviewed publication but f we can measure these alternative forms of
        communication and show that they have impact that will be a boon to
        promoting these quicker forms of communication. See
        http://altmetrics.org/manifesto/ for one take on this.

        Paul

        P.S. thanks for the links Jodi

        Jodi Schneider wrote:
        > On Thu, Nov 11, 2010 at 1:18 PM, Tim Clark<tim_...@harvard.edu> wrote:
        >> You cannot cite your blog posts in your CV to get a promotion.
        >
        > I think we need to step back and ask about the scope of the
        > discussion. Are we covering science? Or all fields?
        >
        > In science, you cannot cite your blog posts for promotion and tenure.
        > In new media, even listserv posts may count [1]. Professors in some
        > fields (e.g. writing) are already citing blog posts and twitter as
        > professional development and/or service [2,3]. In humanities, it's
        > becoming common to select items posted in social media for more formal
        > publication [4].
        >

        > There's also a very well-thought out series, �Making Digital
        > Scholarship Count�, about some of the issues, particularly what's


        > scholarship, what's service, and what's teaching, in the digital
        > realm; it's written by Mills Kelly, from a digital humanities
        > perspective [5,6,7].
        >
        >> Therefore, in developing web discussion communities, we pay very close
        >> attention to the ways we motivate participation in the discussion and create
        >> social validity for it. You cannot sidestep the social process of science.
        >
        > This point remains. :)
        >
        > -Jodi
        >

        > [1] University of Maine, New Media Promotion& Tenure criteria.

        Waard, Anita de A (ELS-AMS)

        unread,
        Nov 11, 2010, 9:30:13 AM11/11/10
        to beyond-...@googlegroups.com
        Hi, ok to fork off this topic (which I'm very much enjoying and so happy to see Alex involved; also, I think the Utopia project has raised the bar in terms of making PDFs useful, so it'd be good to get Steve Pettifer's take on this...)

        Two things to add to this 'how useful is PDF?':

        - Just read a paper in ACM communications that states that yet again, usage of PDFs far outstrips use of any other format on their websites - this is very much what all publishers I know experience, despite desperate (and costly) attempts to make the html/xml documents more machine and user-friendly

        - Elsevier has been including xmp in all PDFs for a little over a year, using PRISM and DC standards. I absolutely agree with:


        > XMP is for metadata - and it does an excellent job at that - but it shouldn't be used to carry actual data. That's > another reason for the "source content" work above, to keep XMP as strictly metadata and not misused for other things.

        I do think this battle has been fought, and won, by PRISM and Dublin Core and XMP supports these. So although perhaps not exactly the same as that of other publishers, our XMP typically looks something like the snippet below (sorry for the pointy brackets, but I suppose this audience won't mind?

        More on identifying narrative components in a separate email, in an attempt to focus these fascinating discussions...

        <?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
        <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39">
        <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:pdfx = "http://ns.adobe.com/pdfx/1.3/" xmlns:pdfaid = "http://www.aiim.org/pdfa/ns/id/" xmlns:xap = "http://ns.adobe.com/xap/1.0/" xmlns:xapRights = "http://ns.adobe.com/xap/1.0/rights/" xmlns:dc = "http://purl.org/dc/elements/1.1/" xmlns:dcterms = "http://purl.org/dc/terms/" xmlns:prism = "http://prismstandard.org/namespaces/basic/2.0/">
        <rdf:Description rdf:about="">
        <dc:format>application/pdf</dc:format>
        <dc:title>Selfsimilarity of pedotaxa distributions at the planetary scale: A multifractal approach</dc:title>
        <dc:creator> <rdf:Seq> <rdf:li>J. Caniego</rdf:li> <rdf:li>J.J. Ibáñez</rdf:li> <rdf:li>F. San José Martínez</rdf:li> </rdf:Seq> </dc:creator>
        <dc:subject> <rdf:Bag> <rdf:li>Pedodiversity</rdf:li> <rdf:li>Pedotaxa-abundance distributions</rdf:li> <rdf:li>Multifractal analysis</rdf:li> <rdf:li>Singularity exponents</rdf:li> <rdf:li>Rényi dimensions</rdf:li> </rdf:Bag> </dc:subject>
        <dc:description>Geoderma, 134 (2006) 306-317. doi:10.1016/j.geoderma.2006.03.007</dc:description>
        <prism:aggregationType>journal</prism:aggregationType>
        <prism:publicationName>Geoderma</prism:publicationName>
        <prism:copyright>Copyright © 2006 Elsevier B.V. All rights reserved.</prism:copyright>
        <dc:publisher>Elsevier B.V.</dc:publisher>
        <prism:issn>0016-7061</prism:issn>
        <prism:volume>134</prism:volume>
        <prism:number>3-4</prism:number>
        <prism:coverDisplayDate>15 October 2006</prism:coverDisplayDate>
        <prism:coverDate>2006-10-15</prism:coverDate>
        <prism:issueName>Fractal Geometry Applied to Soil and Related Hierarchical Systems - Fractals, Complexity and Heterogeneity</prism:issueName>
        <prism:pageRange>306-317</prism:pageRange>
        <prism:startingPage>306</prism:startingPage>
        <prism:endingPage>317<prism:endingPage>
        <prism:doi>10.1016/j.geoderma.2006.03.007</prism:doi>
        <prism:url>http://dx.doi.org/10.1016/j.geoderma.2006.03.007</prism:url>
        <dc:identifier>doi:10.1016/j.geoderma.2006.03.007</dc:identifier>

        Anita de Waard
        Elsevier

        -----Original Message-----
        From: beyond-...@googlegroups.com on behalf of Alexander Griekspoor
        Sent: Thu 11/11/2010 7:44
        To: beyond-...@googlegroups.com
        Subject: Re: The role of linear narrative?

        Hi Leonard,

        Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677 (The Netherlands)

        Waard, Anita de A (ELS-AMS)

        unread,
        Nov 11, 2010, 10:40:59 AM11/11/10
        to beyond-...@googlegroups.com, Jakob Voss
        Great points, Tim and Jakob!

        Some comments:

        Re. 1 and 3: Annotation = document?

        A research paper contains, well, research. This means both a restatement of existing ideas and identification of the issue addressed (what Swales calls 'creating a research space'), several experiments to test this issue, and a discussion of what the experimental evidence means, in light of accepted knowledge. The way I usually phrase this is that it is like a fairytale, where the research question is the protagonist (or hero), the experiments are the dragons/witches/battles, and the conclusion describes the renewed, changed protagonist, who will set forth on a new adventure in future work. I think as a whole, it would be very interesting for us to connect more with the community investigating narrative structures in mythology and fairytales, since a lot of this work is very relevant! (See e.g. [1])

        By this process of putting up claims, holding them to existing knowledge, and 'hardening' claims by quoting them, 'knowledge' is created - and with Tim, I agree to Bruno Latour's quote: "A fact is a claim, agreed upon by a committee." Here, "fact" is a component of a model, a representation of the experimental reality, reflected in a model. (See e.g. [2] for an illustration of this process.)

        Next to this stands data. Data is a non-textual (almost always) representation of experimental results, which underpins the claims the author makes. The link between claim and data is almost always the issue of greatest critique on papers. This is where argumentation and hedging takes place, in 90% of the cases (in biology, at least) in the form of a clause like 'These results suggest that...'.

        So - there is annotation in a research paper, but a research paper is not (just) annotation. This is why I still think a paper is not *just* a collection of research results: it is also the motivation to do the research, and the interpretation of the results in light of the canon of known, well, accepted fact.

        There are other paper types, of course, such as Letters to the Editor, Review papers, etc. which can be 'only' annotation.

        Re. 5: How can we represent papers vs. triples.

        I firmly believe 'triples are not enough' (to adequately represent scientific discourse (see e.g.[3]) and am really delighted to see the emphasis and acceptance of things like narrative structure and reasoning and author stance on this list.

        I tried in 2004 to make a simple model of a Cell paper as a set of entities and relationships (see [4] for an incredibly simplistic demo: if you click on the 'things' in the picture you to their representation in the NCBI Map Viewer, if you click on an arrow you go to the point in the text where this relationship is argued) and the thing is, as Alex also pointed out, that blobs-and-lines just don't represent the complexity contained inside the text. This model was just not scalable - you can make useful graphs for people, and you can link to document portions, but the two are separate, and can work either for people, or for computers, but not both.

        5b. Representing 'Claims and Evidence': so, yes, that seems still the best way forward: see the list of projects under 'Hypothesis/Claim-Based representation of the Argument Structure of a Scientific Paper' on [5], and links on [6].

        Currently, there is a small-scale project underway to compare various annotation formats (including NaCTeM's BioEvent representation, EBI's COreSC annotation, and our Segment Type annotation) on a small corpus of full-scale text; and we (all) hope to use Harvard's Annotation Framework as a way to compare and contrast those annotations and offer them to 'real' users, to see if we can eke out the claims and evidence from a set of documents in a mutually acceptable and feasibly way.

        So we come back to Phil's earlier point: apart from making authors do this (as has been mentioned before), we don't seem to have any hard evidence that any of it (fun though it is as an academic exercise) is of actual use to readers of the papers. Does anyone know of any user studies where actual readers benefited from this type of representation? Maybe defining, as Phil said earlier, what success looks like in terms of new forms of publishing, would be one of the main contributions (and goals?) of the workshop.

        Although I have to say, this discussion in itself is an awful lot of fun.

        Anita

        [1] http://ilk.uvt.nl/amicus/amicus_ws2010.htm
        [2] http://elsatglabs.com/labs/anita/SWASDHyper/epistemics.html_files/epistemics.002-006.jpg
        [3] http://www.slideshare.net/anitawaard/a-de-waard: Talk at C-SHALS 2010: "Representing scientific discourse, or: why triples are not enough"
        [4] http://elsatglabs.com/labs/anita/demos/xIedemo062004/index.htm
        [5] https://sites.google.com/site/beyondthepdf/project-areas
        [6] http://elsatglabs.com/labs/anita/SWASDHyper/

        Anita de Waard
        Disruptive Technologies Director, Elsevier Labs
        http://elsatglabs.com/labs/anita/

        a.de...@elsevier.com


        -----Original Message-----
        From: beyond-...@googlegroups.com on behalf of Tim Clark

        Sent: Thu 11/11/2010 8:18
        To: Jakob Voss; beyond-...@googlegroups.com
        Subject: Re: The role of linear narrative?


        This brings us to:

        Best

        Tim


        Alex wrote:

        winmail.dat

        barend mons

        unread,
        Nov 11, 2010, 11:33:05 AM11/11/10
        to beyond-...@googlegroups.com, Jakob Voss
        Hi  Barend here..

        I am enjoying the lively discussion from the background, but unfortunately all days in meetings and not able to respond regularly but I think that Tim (and Jan) have put it very precisely:
        we need to (even when we get carried away a bit) make a good distinction between data, databases and papers as well as between Information and Knowledge. 

        Let me take (for clarity) the extreme view that 'knowledge is only in the heads of people', so indeed, both the paper and 'triples' are both in the information category, where 'triples' (as in simple subject predicate object representations) even border at 'data'.

        Here is the take I follow in a perspectives paper I am writing on this issues: the classical paper should stay among us as the 'minutes of science' (Jan Velterop). However, as also stated in various forms by others in this thread, the classical paper is lousy for knowledge discovery, as we need more and more CPU's to do this and computers hate text (as in ambiguous and redundent symbol collections) 

        As one of the early movers in the field of in-line semantic enrichment, I can be self-critical to my field and say that we were 'only partly right' (nice euphemism for 'wrong').

        Here is my plea: leave the paper as it is and ONLY improve it further in the direction of 'findability' and especially 'readability' (which might include some reader directed semantic enrichment/links/popovers etc.) but do NOT treat it as a 'launching platform of web services and applications, so that it becomes an unreadable christmas tree.

        now on 'triples': unless they are richly annotated (say, nanopublications) they do have very little added value, particularly because the redundancy problem in the information on the Web is not solved by putting the same assertional content in 'RDF'. 
        IMHO there is NO other way in 'my' field biology than to go 'computational' for knowledge discovery. Papers are, as said utterly inconvenient for that' and 'enriching them' will NOT help much.
        We should therefore SEPARATE very clearly the collection, cleaning, dedubing, sharing and using of rich assertional, computer readable elements.

        My prediction is that the fine line between 'data publication', curation into databases and classical naarative publications will blur once formats such as nanopublications will become widely used. 
        However, we will ALWAYS need nicely written argumentative narrative texts to underpin simple assertions in the 'triple space' 

        VERY central (for me at least) in this discussion is that we shoudl not see 'triples' as a new form of 'metadata' of papers, but we should leave the paper centric thinking alltogether, and focus on a network of inteconnected concepts (a evidence cloud) if you wish, where individual elements (predictions, factoids etc.) are supporte by papers which I want to download and quietly study during a train ride. Once I am in reading and reflection mode I do usually not wish 40 hyperlinks and pop overs in the paper to distract me from that queste.

        Therefore, let's shift some perspective and see the data, databases and 'cardinal assertions' in papers as the primary scientific exchange mechanism that feeds computational biology and treat the paper as the 'minutes of science' with its own role, which may hardly chenge over the decade to come. The challenge over 'reverse search engines' or rather 'assertion-projectors' is now to use the provenance associated with singel asertions on the backgroudn resources that support (or contest) them, which can be data sets, curated databases or papers, or anything.

        Last point: I am talking intensively to the established 'impact' players about assertion-citation, so that datasets, databases and papers can be cited alike (and community reviewed as well as cited at the individual assertion level.   

        So, can we separate the two lines of thinking and be less paper centric ?

        now I will be silent again for a while....Phil, better implement some post-workshop-post-PDF trauma teams... 

        **************************************
        Dr. Barend Mons
        Scientific Director 
        Support and external relations
        Netherlands Bioinformatics Centre (NBIC)
        and Biosemantics Group
        Leiden University Medical Centre
        Phone:  +31 (0)24 36 19 500
        Fax:       +31 (0)24 89 01 798

        Mail: Netherlands Bioinformatics Centre
        260 NBIC
        P.O. Box 9101
        6500 HB Nijmegen

        Visiting address:
        LUMC building 2, Einthovenweg 20
        2333 ZC Leiden, The Netherlands










        Steve Pettifer

        unread,
        Nov 12, 2010, 5:02:52 AM11/12/10
        to beyond-...@googlegroups.com
        > Hi, ok to fork off this topic (which I'm very much enjoying and so happy to see Alex involved; also, I think the Utopia project has raised the bar in terms of making PDFs useful, so it'd be good to get Steve Pettifer's take on this...)

        I think I've succeeded in my ploy to create some controversy on the list, and failed in my attempt to explain my view on the relationship between PDFs and articles.

        In computer science we have the idea of design patterns; these are 'recipes' for helping us spot common, well, patterns in data and architecture, with lists of handy hints and tips for good ways to build things when you've spotted a pattern (and pitfalls too).

        The one that seems relevant here is the 'Model View Controller', which separates out underlying concepts from their representation. (http://en.wikipedia.org/wiki/Model–View–Controller)

        I think a lot of the disagreement about the role of the PDF can be put down to trying to overload its function: to try to imbue it with the qualities of both 'model' and 'view'. Its clear that publishers have been doing this for some time (historically of course it was the only viable solution, when all you have is a hammer, every problem looks like a nail). One of the things that software architects (and I suspect designers in general) have learned over the years is that if you try to give something functions that it shouldn't have, you end up with a mess; if you can separate out the concerns, you get a much more elegant and robust solution.

        My personal take on this is that we should keep these things very separate, and that if we do this, then many of the problems we've been discussing become more clearly defined (and I hope, many of the apparent contradictions, resolved).

        So... a PDF (or come to that, an e-book version or a html page) is merely a *view* of an article. The article itself (the 'model') is a completely different (and perhaps more abstract) thing. Views can be tailored for a particular purpose, whether that's for machine processing, human reading, human browsing, etc etc. The relationship between the views and their underlying model is managed by the concept of a 'controller'. For example, if we represent an article's model in XML or RDF (its text, illustrations, association nanopublications, annotations and whatever else we like), then that model can be transformed in to any number of views. In the case of converting XML into human-readable XHTML, there are many stable and mature technologies (XSLT etc). In the case of doing the same with PDF, the traditional controller is something that generates PDFs. The thing that's been (somewhat) lacking so far is the two-way communication between view and model (via controller) that's necessary to prevent the views from ossifying and becoming out of date (i.e. there's no easy way to see that comments have been added to the HTML version of an article's view if you happen to be reading the PDF version, so the view here can rapidly diverge from its underlying model). Our Utopia software is an attempt to provide this two-way controller for PDFs. I believe that once you have this bidirectional relationship between view and model, then the actual detailed affordances of the individual views (i.e. what can a PDF do well / badly, what can HTML do well / badly) become less important. They are all merely means to channeling the content of an article to its destination (whether that's human or machine).

        The good thing about having this 'model view controller' take on the problem is that only the model needs to be pinned down completely (and I would hope that this is something we can address in some depth at the workshop0; after that you can have multiple controllers and views, and they can all live happily in the same ecosystem.

        I personally believe that getting the model right (which is where most of the discussion was focussing before I rather derailed it) is far more important than worrying about the individual views or controllers (though improving those is a Good Thing too of course).

        Perhaps separating out our concerns in this way -- that is, treating the PDF as one possible representation of an article -- might help focus our criticisms of the current state of affairs? I fear at the moment we are conflating the issues to some degree.

        > - Just read a paper in ACM communications that states that yet again, usage of PDFs far outstrips use of any other format on their websites - this is very much what all publishers I know experience, despite desperate (and costly) attempts to make the html/xml documents more machine and user-friendly

        Anita, could you post the reference to this please, I'd like to follow it up.

        Best wishes

        Steve

        Alexander Griekspoor

        unread,
        Nov 12, 2010, 5:13:48 AM11/12/10
        to beyond-...@googlegroups.com
        Hi Steve,

        I very much like this approach and I agree with your view. The thing
        where it might be getting confusing and where you might want to
        further clarify things is the following. You mention:

        "So... a PDF (or come to that, an e-book version or a html page) is
        merely a *view* of an article. The article itself (the 'model') is a
        completely different (and perhaps more abstract) thing."

        however at the end you come to the conclusion:

        "The good thing about having this 'model view controller' take on the
        problem is that only the model needs to be pinned down completely (and
        I would hope that this is something we can address in some depth at
        the workshop0; after that you can have multiple controllers and views,
        and they can all live happily in the same ecosystem"

        and

        "I personally believe that getting the model right (which is where
        most of the discussion was focussing before I rather derailed it) is
        far more important than worrying about the individual views or
        controllers (though improving those is a Good Thing too of course)."

        Am I right in that I have the feeling you are talking about two
        different things here? The first "model" is more a mental or
        conceptual model of what an article actually is. In the second part it
        seems you are talking rather about the more classical computer related
        concept of a model as a data container/storage form. In general this
        discussion has been mixing these two freely I must admit (switching
        between "Paper == PDF and the concept of a scientific article.

        Perhaps you can elaborate on this aspect?
        Cheers,
        Alex

        barend mons

        unread,
        Nov 12, 2010, 5:32:12 AM11/12/10
        to beyond-...@googlegroups.com
        Steve, Alex, love following this on the side when I am supposed to
        listen to evaluation talks at FIZ Karlsruhe...
        can you also reflect on your take(s) on 'projecting' individual
        assertions 'when needed' on their context (read: the paper/PDF?)
        so going from a 'triple store' > back to the source paper> rather
        thinking about 'enriching text' through mining or annotation (only?)
        sorry of this comes accross messy, but I am trying to multitask....

        **************************************
        Dr. Barend Mons
        Scientific Director
        Support and external relations
        Netherlands Bioinformatics Centre (NBIC)
        http://www.nbic.nl
        and Biosemantics Group
        Leiden University Medical Centre
        http://www.biosemantics.org
        Mobile: +31-624879779
        E-mail: Baren...@nbic.nl
        Phone: +31 (0)24 36 19 500
        Fax: +31 (0)24 89 01 798

        Mail: Netherlands Bioinformatics Centre
        260 NBIC
        P.O. Box 9101
        6500 HB Nijmegen

        Visiting address:
        LUMC building 2, Einthovenweg 20
        2333 ZC Leiden, The Netherlands

        Waard, Anita de A (ELS-AMS)

        unread,
        Nov 12, 2010, 10:15:43 AM11/12/10
        to beyond-...@googlegroups.com
        >>> - Just read a paper in ACM communications that states that yet again, usage of PDFs far outstrips....
        > Anita, could you post the reference to this please...

        Sure, it's behind closed walls unfortunately but it's http://cacm.acm.org/magazines/2010/11/100637-a-preference-for-pdf/

        > The one that seems relevant here is the 'Model View Controller'...

        I am not sure I quite get the model view... I do think there are three realms reflected in a scientific paper:
        - The conceptual realm (the models that are created and cocreated between researchers, this is described in figures (pictures, schema's) and text, and generally shared)
        - The experimental realm (the methods and results, these are created by labs, generally not shared; represented in figures and tables, generally)
        - The discourse realm (this is the text connecting and creating all this communication, it is essentially in text)

        The paper reflects moves between these three realms.

        Then, on a more pragmatic level, of course there can be the representations of text/images/etc. These can be human or machine-palatable. I'd say the real issue here is who sees what and how - what is made available to whom, what formats can be exchanged etc.

        I have a hard time imagining an abstract, conceptual 'article', that does not exist as narrative text, although I can imagine a workflow/eLab output that feeds into an authoring tool (and perhaps intermediary steps are recorded for later access). But in my experience, as you write a paper you shape the story - and reflect on, and reassess, even what happened in the lab. You select data points and references to support your story - and should be allowed to do so, since the story is what transmits the higher-order knowledge...

        Do you think we can find a united view, Steve, or are we coming from too different a background here?

        Anita


        -----Original Message-----
        From: beyond-...@googlegroups.com on behalf of Steve Pettifer
        Sent: Fri 11/12/2010 5:02
        To: beyond-...@googlegroups.com
        Subject: Re: Was: The role of linear narrative? Is: How to enrich PDFs?

        > Hi, ok to fork off this topic (which I'm very much enjoying and so happy to see Alex involved; also, I think the Utopia project has raised the bar in terms of making PDFs useful, so it'd be good to get Steve Pettifer's take on this...)

        I think I've succeeded in my ploy to create some controversy on the list, and failed in my attempt to explain my view on the relationship between PDFs and articles.

        In computer science we have the idea of design patterns; these are 'recipes' for helping us spot common, well, patterns in data and architecture, with lists of handy hints and tips for good ways to build things when you've spotted a pattern (and pitfalls too).

        The one that seems relevant here is the 'Model View Controller', which separates out underlying concepts from their representation. (http://en.wikipedia.org/wiki/Model-View-Controller)

        Alexander Griekspoor

        unread,
        Nov 12, 2010, 10:29:44 AM11/12/10
        to beyond-...@googlegroups.com
        Hi Anita,

        Steve is referring to the Model View Controller (MVC) paradigm used in
        computer programming (see
        http://en.wikipedia.org/wiki/Model–View–Controller ). Model refers
        here to the data model / data container, not so much the conceptual
        model, or "the model" the authors distill from their results and
        present as a e.g. de visual abstract that Cell not tries to promote.

        Having said that, as I outlined in my previous email I did have the
        feeling as well that in Steve's argument he was referring to / mixing
        both roles, hence my question to elaborate on exactly this aspect.
        Cheers,
        Alex

        Gully Burns

        unread,
        Nov 12, 2010, 7:04:54 PM11/12/10
        to beyond-...@googlegroups.com
        I think it's worth attempting to link our discussion explicitly to a the
        currently existing structures in scientific work generally. Since we're
        looking a lot at the content of an individual primary research paper
        (each one of which creates a 'story' that not only contextualizes but
        justifies the research work), I think it's worth thinking of the
        structure of a review paper or a textbook description of a field.

        There are a lot of very different 'types of assertion' within the
        narrative text of a review and these will be based on the domain
        (gene-specific, neuroscience-specific, etc) and may invoke different
        levels of acceptance or validity (implicit, cited, debatable,
        well-accepted, etc.). There is a well-defined relation between primary
        research papers (which contain both original observations and
        contextualizing interpretations), review articles (which contain
        aggregated interpretations that act as a mid-level summary) and
        textbooks (which contain aggregated interpretations that act as a
        high-level summary). Thus, I think that it is important to somehow
        include (and improve upon) this pre-existing framework.

        Gully

        Steve Pettifer

        unread,
        Nov 13, 2010, 4:48:16 AM11/13/10
        to beyond-...@googlegroups.com
        > I am not sure I quite get the model view... I do think there are three realms reflected in a scientific paper:
        > - The conceptual realm (the models that are created and cocreated between researchers, this is described in figures (pictures, schema's) and text, and generally shared)
        > - The experimental realm (the methods and results, these are created by labs, generally not shared; represented in figures and tables, generally)
        > - The discourse realm (this is the text connecting and creating all this communication, it is essentially in text)
        >
        > The paper reflects moves between these three realms.

        As Alex suggests in his followup email, I was trying to apply the Model / View / Controller (MVC) approach at a rather more fundamental technological level than what you're suggesting here (though the nature of the MVC means that it can quite reasonably applied hierarchically, so your breakdown above is also entirely valid). What I was trying to get to is perhaps a 'MVC Knowledge Architecture' for storing and representing an article.

        The model in this instance relates to the fundamental components of an article, and how they relate to one another. At this level, it doesn't matter much what the article is about (though I'm assuming it is some kind of scientific article, rather than a bunch of words in general). At this level I would say that the Model component of an article has properties something along the lines of:

        For a 'traditional' article:

        - A title
        - Some authors and their institutions
        - A series of words, which are broken down into sections (which in turn have titles) [we could go further and say 'there are characters that make words that make sentences that make paragraphs, that make sections if we really wanted to]
        - Links to internal components (such as tables, figures, cross references within the article)
        - Links to external documents (which could be data, or citations etc)
        - Images / Figures and their captions

        A more advanced article can conceivably then be annotated with:

        - a unique article identifier
        - identifiers for authors, institutions etc
        - identifiers for other things
        - annotations that further identify or enhance any of the traditional components, for example Named Entities, or regions of text that are in themselves assertions or information.
        - annotations that confer credit

        [These lists are not intended to be exhaustive, just examples of things that would go in a model]

        The point I'd like to make is that when we are talking about the Model of an article, we are free to store this data in any format we like, without having much regard to how it will be transformed for consumption by an outside agent (whether that be machine or human). So at this level, we can represent these concepts in RDF or XML or whatever mechanism we like. Indeed, we could actually store these things in various different formats, as long as they are described using some kind of controlled and agreed vocabulary that would allow stable translations between different formats of the same model.

        You'll notice that at this Model level, I've avoided talking about pages, or fonts, or layout, or what one might do when one views the article or processes it electronically -- these are all matters for consideration when we are thinking about the various Views. Also, I'm not talking about the scientific domain of an article; so it could be about Cheese Graters, Astro Physics or Protein Interactions, and the above Model (or some variation of it) would remain valid [and annotations could be added that are domain specific]. For example, in the Model, there could be a set of annotations that define/describe the Conceptual Realm, Experimental Realm and Discourse Realm.

        So that's the Model aspect.

        The View then requires us to think about how that model is used and consumed. So for printing out on to paper, the view will need to deal with fonts, and layout of text (e.g. the classic two column presentation), and the placing of images in sensible locations relating to their reference in the text, and so on. For reading on a large screen, the view's requirements are similar (though perhaps might require a slightly different layout). For reading on a small screen, we may want to transform the Model's underlying text in to a single column view, with different image placement. Note that here I'm not talking about 'reflowing' the text since beyond the natural ordering of words and sentences the model shouldn't contain any layout information; if we found out that we'd included some information in the model that was 'tailored' for one particular view (e.g. which page some text was on, or which column it was in), then we would have our model wrong, since all views are essentially to be considered equal, with no special treatment. For processing by machine, it may well be that the view IS simply the model represented in RDF/XML.

        The Contollers are merely the bits of technology that take the model and transform it into a particular view, whilst retaining in the views a link to the underlying model.

        For example, it is increasingly common for articles to be given a DOI, which could easily be a mechanism of relating the view of an article (e.g. a PDF file) to its underlying model. However, that link is often not made. Some publishers embed the DOI as text in the article's view, which means it is accessible to humans. Others put the DOI in the PDF's metadata fields, which means it the link back to the underlying model can be made by machine. Some do both. Many do neither (even though they have given the article a DOI); this means that the 'backward link' from view to model gets broken.

        I should add two more points. First, many of the concepts of the MVC pattern are *already* present in the way articles are generated and manipulated (e.g. the use of NLM XML in combination with HTML and PDF views). So I'm not at all suggesting that I've invented this idea, merely proposing that we make it a more explicit part of our discussions. Second, in the interest of fairness I should point out that there are well known downsides and costs to thinking in terms of MVC (these are documented in Design Patterns: Elements of Reusable Object-Oriented Software, ISBN 0-201-63361-2); I'm confident that the pros outweigh the cons in this case though.

        My reason for introducing this idea is primarily as a mechanism of helping us structure our discussions; are we talking about concepts that affect the Model, the View or both? I'm certainly not trying to dictate the contents, technology or structure of either the model or view, but I hope it will help target our discussions. For example, I think that on the whole the PDF file makes a lousy mechanism for storing the Model, a lousy view for consumption by machine, but a pretty good View for Printing An Article To Read On The Bus.

        > Do you think we can find a united view, Steve, or are we coming from too different a background here?

        I think the two views are entirely complimentary!

        Steve

        Dave Argue

        unread,
        Nov 18, 2010, 10:17:19 AM11/18/10
        to beyond-...@googlegroups.com
        I think Steve is spot-on with the Model-View-Controller approach.  Abstracting out the model storage into whatever agreed-upon model we come up with (RDF, XML, etc.) provides infinite flexibility with view presentation on any current device (laptop, iPhone/iPad, mobile Android platform, etc.), yet-to-be-released future devices/platforms, as well as yesterday's technologies (Internet Explorer 6) likely to be found within libraries, which Anita so accurately pointed out that we need to support.  The challenge will be funneling data into the model when scientists are using such a broad range of tools today that aren't necessarily setup for exporting into a model format (e.g. Microsoft Word).

        David Argue
        Analyst/Programmer
        Mayo Clinic
        Rochester, MN  USA

        Peter Murray-Rust

        unread,
        Nov 18, 2010, 12:06:48 PM11/18/10
        to beyond-...@googlegroups.com
        I also support the MVC approach. I have other concerns with PDF which I'll address in that thread. However to emphasize here that the XML-vision of the Web, which I attribute to Jon Bosak ("the father of XML") is very close to MVC.

        The XML mantra (derived from its predecessor SGML) is separating presentation from content (http://en.wikipedia.org/wiki/Separation_of_presentation_and_content ). The design is to represent the abstraction (Math, Chemistry, text) as formal objects to which style was applied (CSS or XSL-FO). Now not everybody agrees on the same abstraction but in many fields it's pretty good. It's certainly good enough for scientific document structure (chapters, tables, citations, etc.). Therefore what many scientists want is increasingly the semantic content, not the styled presentation.

        Adding style normally destroys some of all of the semantics. It doesn't have to but for most publishers their conversion to PDF destroys almost everything, corrupts some of the rest. [1].

        In CML Henry Rzepa and I set out to produce a complete chemical abstraction. That was 15 years ago. Most people still don't understand the value of content/presentation. We see tools that create "wiggly bond" , "large blob". The reader may (or often may not) guess what these mean. But Linus Pauling did not write "the Nature of the Wiggly Bond".  or the "green part of the molecule". These are annotations on reality.

        Some of you may not know that publishers actually use XML in their publication. It has the complete semantics of the document structure. It doesn't do actual scientific content like chemistry ( very few might do a very little bit). It would be very useful for scientists to have this XML. They could use it for text- and data mining, and for many other things.

        But the publishers absolutely refuse to make their XML available. I've asked them publicly and privately. There was a silly gesture where Nature published bits of its text but with all sentences in alphabetic order so you couldn't make any sense of them (http://opentextmining.org/wiki/Main_Page) The point of this was to stop scientists actually reading the complete paper in XML. They thought that simply giving a wordlist of content was useful. It isn't.

        A typical problem in machine reading of scientific papers is interpreting tables. They can be very valuable. PDF COMPLETELY destroys tables. You cannot even tell there are tables in the paper, unless perhaps there is the word "Table". vast amounts of human endeavour are spent trying to reconstruct tables from PDF. If the publishers treated readers with any respect they would make the tables available as XML (they have them of course) but they subject us to this appaling insult. All in the name of commerce.

        Yes - I feel strongly because I have spent many months of my life writing software that "reads" PDF.  That's why I never want to see a publisher's PDF again. If I want a PDF, give me the XML and it's trivila to creat PDF. But the other way around is like "trying to rutn a hamburger back into a cow"




        [1] If you can't accept this look at http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2735 whihc is obviously corrupted (numbers to pixels - in a PDF) and http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2666 which shows the corruption of minus signs to garbage. The destruction of minus signs is one of the commonest and most insidious effects of presentation over content. Undiscovered corruption is one of the worst aspects of semantic destruction.

        --
        Peter Murray-Rust
        Reader in Molecular Informatics
        Unilever Centre, Dep. Of Chemistry
        University of Cambridge
        CB2 1EW, UK
        +44-1223-763069

        Gully Burns

        unread,
        Nov 18, 2010, 1:57:33 PM11/18/10
        to beyond-...@googlegroups.com
        On 11/18/10 9:06 AM, Peter Murray-Rust wrote:
        > I also support the MVC approach. I have other concerns with PDF which
        > I'll address in that thread. However to emphasize here that the
        > XML-vision of the Web, which I attribute to Jon Bosak ("the father of
        > XML") is very close to MVC.
        >
        My view of the MVC approach is not so much that the 'model' is based on
        XML / RDF or even OWL. Those are really just formats (or knowledge
        formulations, they can be made interchangable and are pretty much
        ambivalent to their content). The key here is *what* the design of the
        model should be. Probably more sensibly, we should ask:

        "What framework of these formulations for models of the *underlying
        science* should we be thinking about?"

        In other words, let's imagine that this will form a framework for
        nanopublications that provide a mechanism for citing each other and
        creating a chain of reasoning, for reviewing the existing body of
        knowledge and evaluating new knowledge, for argumentation, for credit
        attribution and (most importantly) for making predictions that may be
        tested experimentally. These mechanisms are going to have to be
        formulated so that they're not completely new, they need to be
        recognizable in the existing framework of modern scientific publishing
        and they need to scale up very efficiently to be workable.

        I think we should be able to begin to formulate these ideas given the
        current toolset and given the combined expertise of this community. I
        have my own ideas on this and would welcome some dialog on the matter.

        :-)

        Gully

        Leonard Rosenthol

        unread,
        Nov 18, 2010, 6:07:03 PM11/18/10
        to beyond-...@googlegroups.com

        Let me start by saying that I am in full agreement with you.  That the separation of presentation from content (and/or structure) is quite important and I fully support the use of structured content (be it in XML or something else) for authoring/editing the content.  I also believe that is extremely important that that structure be carried along with the presentation to enable a single document that offers BOTH…

         

        Which is why PDF 1.3 (Acrobat 4.0, 2000) introduced the idea of “Tagged PDF” (ISO 32000-1, 14.7 & 14.8) where an XML-like tagging structure was added to the PDF language to enable it to carry the structure of the content ALONG WITH the presentation.  The original set of tags match those of HTML 4.x, though it has been extended over the years to include tags from other standards such as DITA and MathML.   Therefore, a properly prepared PDF document includes all the necessary tags that delineate such things as tables, figures, etc.  You can easily create such documents with tools from Adobe, Microsoft, OpenOffice and modern even versions of pdfTeX.  

         

        Now granted, not every PDF is prepared in this manner (in fact, it’s unfortunate that most are not).  However, that is NOT a limitation of PDF, but a limitation of the tooling that is used to produce the documents.  So PLEASE put blame where blame is due – the “quick and dirty” PDF creation tools and NOT the file format itself. There are also products on the market (including Adobe Acrobat itself) that can convert an untagged PDF into a tagged PDF using machine heuristics.   Granted, it’s not perfect – but it’s also better than having no tagging at all.

         

        I will also point you to my previous posts about the forthcoming “Source Content” feature of PDF 2.0 which takes this to the next level…

         

        As for your Unicode rant – I agree 100% - but again, I don’t see what that has to do with PDF.  PDF has supported Unicode since 1.2 (Acrobat 3, 1996).  (and Adobe Acrobat/Reader are also smart about all the issues you raise, including ligature search, copy/paste of both Unicode and non-Unicode info, etc.)

         

        Leonard

         

        From: beyond-...@googlegroups.com [mailto:beyond-...@googlegroups.com] On Behalf Of Peter Murray-Rust
        Sent: Thursday, November 18, 2010 12:07 PM
        To: beyond-...@googlegroups.com
        Subject: Re: Was: The role of linear narrative? Is: How to enrich PDFs?

         

        I also support the MVC approach. I have other concerns with PDF which I'll address in that thread. However to emphasize here that the XML-vision of the Web, which I attribute to Jon Bosak ("the father of XML") is very close to MVC.

        Peter Sefton

        unread,
        Nov 19, 2010, 4:38:56 AM11/19/10
        to beyond-...@googlegroups.com
        Right, Gully, the Model we seek needs to be abstracted away from XSD /
        OWL etc, I think we seek an abstract model of research practice (which
        is maybe not quite as abstract as 'underlying science' that can be
        mapped on to technologies like XML schemas etc.

        I have been unsure about which thread to jump back into, but this one
        seems to be as good as any. One thing we could do is look at what
        research communications might look like if we designed the system now
        (other have said this in other threads).

        If we tear down all the assumptions about how things work now with
        journals and peer review and libraries and repositories and so on,
        then what do you actually _need_ to do science in a web-connected
        world?

        1. Identification and capture of data. (This is wildly different for
        different disciplines but there will be some commonalities) - as Peter
        M-R points out in some disciplines the data can function on their own
        without really needing the next bits to do with articles and so on.

        2. Articles and theses that report on them, with some way for readers
        (human and machine) to judge their reliability. Traditionally, this
        judgement involves what journal they're in and the like.

        3. Processes to establish the trust ratings needed for 2.
        Traditionally that's peer review or examination, but it could be an
        open system where annotation and ratings (see 4) allow a blog post to
        turn into the equivalent of a refereed paper maybe after some
        (tracked) changes. Cameron Neylon talked about the importance of this.

        4. The system of citation and annotation and trust needed to support 2
        & 3, and to provide the metrics that lead to the all-important reward
        structures.* Citation in particular (which is a kind of annotation)
        could be vastly simplified in the web-world (and yes, in PDF too); we
        have this enormously costly legacy of formatting citations just-so,
        left over from the days when citations needed to be carefully
        constructed by hand for a print world. In some of the sciences we
        could throw that out and use simple links to DOIs with a formal
        relation such as http://purl.org/spar/cito/cites. Machines can compile
        reference list and display them to your taste. (In the humanities it's
        much, much more complicated than that as I know from many interactions
        with Bruce D'Arcus).
        5. Somewhere(s) to keep it all - like a repository or a federation.
        There are already some subject repositories that serve as models
        although large data sets might have to live back in the institutions
        that produce them.


        As Gully points out this needs to be overlaid on the existing systems
        we're using, but I gather that some of the participants here have the
        credibility to pull something like this off and have it count in the
        old-research economy maybe as a new journal, as well as to start
        creating the new research economy/ecology.

        My interest and expertise is in how to co-opt the existing tools
        people use to support these new models, assuming that we are going to
        be dealing with word processors (including stuff like Google Docs) and
        blog platforms and the like. I'll be showing some demos using the
        samples.

        (One thing I think is often lost in this kind of conversation is that
        the abstract Model of a generic document is pretty well supported by
        the likes of MS Word, which evolved over a number of years to support
        outlines and tables, and embedded data (via spreadsheets) etc. And
        there are long-standing ways to add semantics to Word processing
        documents using styles and fields and so on. Likewise for the web,
        where there are increasingly mainstream efforts to add basic generic
        structure to documents via HTML5, Microformats and RDFa. I think the
        research community would do better to align with this work rather than
        attempt to go off in search of The One True Model about which there is
        not likely to be agreement anyway.)

        * I know I've left out the possibility of embedding scientific
        semantics in the Model, I don't understand how that would work.

        ptsefton (you can call me pt or ptsefton or
        <http://nla.gov.au/nla.party-541658> to avoid confusion with other
        Peters).

        --
        Peter Sefton
        Manager, Software Research and Development Laboratory,
        Australian Digital Futures Institute,
        University of Southern Queensland
        Toowoomba Queensland 4350 AUSTRALIA


        Work: sef...@usq.edu.au
        Private: p...@ptsefton.com

        IM accounts:
        Gmail: ptse...@gmail.com
        Yahoo: peter_...@yahoo.com
        MSN:  p...@ptsefton.com
        AIM: ptsefton

        p: +61 (0)7 4631 1640
        m: +61 (0)410 326 955


        USQ Website: http://www.usq.edu.au
        Personal Website: http://ptsefton.com

        Peter Murray-Rust

        unread,
        Nov 19, 2010, 6:00:23 AM11/19/10
        to beyond-...@googlegroups.com, Joe Townsend
        On Fri, Nov 19, 2010 at 9:38 AM, Peter Sefton <ptse...@gmail.com> wrote:
        My interest and expertise is in how to co-opt the existing tools
        people use to support these new models, assuming that we are going to
        be dealing with word processors (including stuff like Google Docs) and
        blog platforms and the like. I'll be showing some demos using the
        samples.

        (One thing I think is often lost in this kind of conversation is that
        the abstract Model of a generic document is pretty well supported by
        the likes of MS Word, which evolved over a number of years to support
        outlines and tables, and embedded data (via spreadsheets) etc. And
        there are long-standing ways to add semantics to Word processing
        documents using styles and fields and so on. Likewise for the web,
        where there are increasingly mainstream efforts to add basic generic
        structure to documents via HTML5, Microformats and RDFa. I think the
        research community would do better to align with this work rather than
        attempt to go off in search of The One True Model about which there is
        not likely to be agreement anyway.)

        I agree with this. I've spent a considerable time hacking Word/XML. I declare an interest in that part of my research (but not me) is supported by Microsoft (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2407) .

        My objective opinion is that WordXML (OOXML) is a reasonable container for semantic science. You will not get anything much simpler (it's possible to hack XHTML for some things but Word is more powerful). So the Word document per se is a good place to start. It's Open in that the spec is published and there are also Open Source libraries. For what I wanted I wrote my own.

        What is more difficult is adding interactive functionality within the Word/Office environment. We (Cambridge + MS) have made good progress but you really have to be an expert in Word APIs
         
        * I know I've left out the possibility of embedding scientific
        semantics in the Model, I don't understand how that would work.

        This is a relatively good aspect of OOXML. It supports  semantic links and foreign namespaces. We can add CML into OOXML either from within Word or completely independently using Java APIs. That's not pretty but it can be done and it only has to be done once.

        As pt says - there is little point in reinventing it. The problem is the toolchain. I think we have to create a toolchain completely separate from Microsoft. the simplest thing would be to inject our semantics into Word containers. That will only work for geeks but it would certainly get data wrapped and transported and much better than PDF/XMP. The next would be to add functionaility into Open Office. But that seems a major disappointment. PT knows more about it. So I don't have any very clear suggestion.

        But whatever we do we have to create XML components for our science. There is currently no other semantic alternative.

        Kevin B. Cohen

        unread,
        Nov 19, 2010, 11:45:35 AM11/19/10
        to beyond-...@googlegroups.com
        > 1. Identification and capture of data. (This is wildly different for
        > different disciplines but there will be some commonalities) - as Peter
        > M-R points out in some disciplines the data can function on their own
        > without really needing the next bits to do with articles and so on.

        I'm trying to think through how this would work out in different
        fields...I work in text mining, where we deal with perennial problems
        related to copyright,
        access, etc. Making data available is often not an option, and more
        or less never an option for people who work on clinical data. Not
        sure how these
        barriers to letting the data "function on [its] own"...

        Kev

        --
        Kevin Bretonnel Cohen, PhD
        Biomedical Text Mining Group Lead, Center for Computational
        Pharmacology, U. Colorado School of Medicine
        and
        Lead Artificial Intelligence Engineer, The MITRE Corporation, Human
        Language Technology Division
        303-916-2417 (cell) 303-377-9194 (home)
        http://compbio.ucdenver.edu/Hunter_lab/Cohen

        Peter Murray-Rust

        unread,
        Nov 19, 2010, 1:01:54 PM11/19/10
        to beyond-...@googlegroups.com
        On Fri, Nov 19, 2010 at 4:45 PM, Kevin B. Cohen <kevin...@gmail.com> wrote:
        > 1. Identification and capture of data. (This is wildly different for
        > different disciplines but there will be some commonalities) - as Peter
        > M-R points out in some disciplines the data can function on their own
        > without really needing the next bits to do with articles and so on.

        I'm trying to think through how this would work out in different
        fields...I work in text mining, where we deal with perennial problems
        related to copyright,
        access, etc.  Making data available is often not an option, and more
        or less never an option for people who work on clinical data.  Not
        sure how these
        barriers to letting the data "function on [its] own"...

        Making data is often simple but can be complex. There are often legal or ethical reasons that mean that data cannot be distributed. In some cases funders and researchers provide mechanisms where appropriate researchers can access them.

        There are other cases where data could be simply published and are not. This is sometimes laziness on the part of the author, sometimes ignorance, sometimes cost-cutting on behalf of the publisher (e.g. the recent decision of J.Neuroscience to stop publishing supporting information). It is almost universal that non-OpenAccess publishers will allow humans to read papers but forbid you to extract data by machine (there is no legal or ethical reason for this). As a text-miner I share your frustration.

        There are several reasons for text mining.
        (a) to recover semantics that were lost on publication.
        (b) to analyse human language style and try to extract intent, co-reference, etc.
        (c) to build data models out of human language
        (d) to analyse human discourse

        (a) could be avoided by effort in publishing. (b), (c) and (d) are serious fields of scientific endeavour.


        P.

        Kevin B. Cohen

        unread,
        Nov 19, 2010, 1:49:48 PM11/19/10
        to beyond-...@googlegroups.com
        On Fri, Nov 19, 2010 at 11:01 AM, Peter Murray-Rust <pm...@cam.ac.uk> wrote:
        > There are several reasons for text mining.
        > (a) to recover semantics that were lost on publication.
        > (b) to analyse human language style and try to extract intent, co-reference,
        > etc.
        > (c) to build data models out of human language
        > (d) to analyse human discourse
        >
        > (a) could be avoided by effort in publishing. (b), (c) and (d) are serious
        > fields of scientific endeavour.

        There are many more use cases for text mining than this, hence many
        situations where the lack of an ability to make the input data
        available is sub-optimal for science.

        Reply all
        Reply to author
        Forward
        0 new messages