capturing workflows and embedding in word documents

10 views
Skip to first unread message

Jill Mesirov

unread,
Nov 12, 2010, 6:12:04 PM11/12/10
to beyond-...@googlegroups.com
For those of you who haven't seen the Science piece I wrote on
accessible reproducible research it discusses the need
for 2 things - a system to capture the analysis automatically and then
an easy way to embed in the manuscript itself - accessible in this case
means to someone who doesn't program and never wants to.

I'm enclosing the relevant links for your amusement - there's a video
that shows the doc in action.

http://www.broadinstitute.org/cancer/software/genepattern/grrd/AddIn.html
- Has all the relevant links available.
In particular - even if you don't have a Science subscription you can
get to the paper from there.

The video for a quick peak is at
http://www.broadinstitute.org/cancer/software/genepattern/grrd/WordAddInDemo.mov

Best,
J


--
Jill P. Mesirov, Ph.D.
Associate Director and Chief Informatics Officer
Director, Computational Biology and Bioinformatics

Broad Institute of MIT and Harvard
7 Cambridge Center
Cambridge MA 02142
phone: 617-714-7070
fax : 617-714-8991
email: mes...@broad.mit.edu

Leonard Rosenthol

unread,
Nov 13, 2010, 8:26:05 PM11/13/10
to beyond-...@googlegroups.com
This is GREAT - and exactly the type of thing that we were envisioning for the "Source Content" feature of PDF 2.0.

I'd personally love to see a companion plugin for Adobe Acrobat and/or Reader to enable...the Word plugin would embed the necessary information into the produced PDF which could be picked up in Acrobat/Reader and enable the same views, reruns, etc.

Leonard

Jodi Schneider

unread,
Nov 14, 2010, 5:18:55 AM11/14/10
to beyond-...@googlegroups.com
Thanks, Jill. I'm really impressed that you're embedding interactivity in a way that's both easy for the author and seems to suit the science perfectly! Native Mac and Linux versions of this plugin would be interesting; I took a look but don't run Parallels or VMWare. 

Leonard, you might be interested in looking at the Utopia Documents PDF viewer and enhanced PDFs. I took a look yesterday which I wrote about here:
There are 2 short screencasts of some of the interactive content.

-Jodi

Leonard Rosenthol

unread,
Nov 14, 2010, 9:51:43 AM11/14/10
to beyond-...@googlegroups.com

The ideas of Utopia are excellent, but their implementation isn’t (IMO) the right approach.  The PDF itself doesn’t contain any of that rich information, so that it can be used/mined/extracted – instead, it appears to be sitting in one (or more) databases or data repositories online that Utopia is able to “magically” locate and then enable. 

 

I’d prefer to see the same user experience (which is quite well done!) applied to a PDF with that type of rich semantics embedded…

 

Leonard

Tim Clark

unread,
Nov 14, 2010, 10:18:56 AM11/14/10
to beyond-...@googlegroups.com
Hi Leonard,

This "in the PDF" approach is all good if you are talking only about contributions / annotations of a single person, or about something that is both completely authoritative and public.  But there is a strong use case for multiple, shareable perspectives - for example within a lab or collaboration. This is why Steve with Utopia and our MGH + NIF with Annotation Framework use a standoff metadata model, and why we (speaking for myself but I believe Steve is likely to be in agreement) advocate standardizing and opening the model of metadata, which can be done using a fairly simple ontology model.  

IMO what you are advocating with the "baked in the PDF" approach is simply the addition of more and more metadata to the existing stuff already there.  But if this is not *fully authoritative* metadata, for example?  If it is discussion, or comes from multiple sources, or contains contradictory views?  You will end up with more and more bloating, among other negative results.  Also, what about private metadata, i.e. notes?  

If you have a PDF and I have a copy of the same PDF, and we make notes about the same content, we should be able to share - or not share - them freely without getting into all the mess of multiple file copies etc.  If there is a group of ten people working on the same problem, they should be able to share equally.  This should not require that they all have shared access to a single file copy. 

But I also realize PDF has always been about a "self contained" model of information.  If you disconnect from the Web, PDFs still work. So is there perhaps a way to implement this as a spectrum where the metadata can exist within or outside of the PDF?  In fact, "outside the PDF" means on the Web, and there are various existing and emerging standards for how to do this. 

I believe that if we were to achieve agreement on a model of annotation metadata that could exist in the same form within the PDF, or outside the PDF, or both, that would be ideal.  

Also - ideally when I open a PDF that contains annotation referencing some entity that is commonly studied or used outside the document itself - e.g. a protein, a database, a reagent, a computational tool, a workflow - my Web browser should just natively be able to connect to all other sources of information about that entity wherever they are on the Web, and use these connections to enhance the information I see without jumping all over the place.   Annotation itself is or at least should be, an independently sharable boundary object.

Best

Tim

Jill Mesirov

unread,
Nov 14, 2010, 11:02:45 AM11/14/10
to beyond-...@googlegroups.com
I agree about the Mac/Linux versions - Microsoft sponsored the
implementation for Windows only but did make the code open source and
available. I think the implementation is very .net dependent and we
don't have that kind of expertise in house. We'd love to collaborate
with Apple on a native Mac version - know anyone who might be interested?
J

> email: mes...@broad.mit.edu <mailto:mes...@broad.mit.edu>

Leonard Rosenthol

unread,
Nov 14, 2010, 12:33:31 PM11/14/10
to beyond-...@googlegroups.com

There are definitely two different, but potentially related, items here…

 

1 – richer material included natively into the PDF at the time of publication

                This is where the actual data (XML, XLS, etc.) would be “attached” to the visual table/graph/chart in the PDF, or the MathML or ChemML (or whatever) associated with a given equation or molecular structure, etc.  This would enable the type of extended UI that Utopia present on various elements in the PDF – which are indeed the types of things you want to be able to do, whether you are connected to the internet (or some subset thereof) or not.

 

2 – annotations, added after publication

                At Adobe, because we believe that a PDF should be “self-contained”, we’ve approach document collaboration via a “synchronization model”.   Everyone can work on their own copies of the document, and their comments are submitted (when they want) up to a “repository”.  At any time, each person can either manually (or automatically) synch their comments with all others in the repository.  This gives you the “best of both worlds” as you get individual copies of documents, private and public comments, collaboration on comments (replies, etc.) AND they can also live in the PDF itself for offline viewing/processing.

 

Which also takes us to another issue, and that’s archiving (esp. long term archiving) – which is another reason that the above solutions also work well – in that when the document needs to be “archived off” (be it for personal use, organization use or submission to something like NARA or LOC – or even submission to the FDA) you already have all the necessary pieces.

 

Leonard

Steve Pettifer

unread,
Nov 14, 2010, 1:50:06 PM11/14/10
to beyond-...@googlegroups.com
This "in the PDF" approach is all good if you are talking only about contributions / annotations of a single person, or about something that is both completely authoritative and public.  But there is a strong use case for multiple, shareable perspectives - for example within a lab or collaboration. This is why Steve with Utopia and our MGH + NIF with Annotation Framework use a standoff metadata model, and why we (speaking for myself but I believe Steve is likely to be in agreement) advocate standardizing and opening the model of metadata, which can be done using a fairly simple ontology model.  

Tim, we are in total agreement! [And Leonard, thanks for the positive comments about Utopia -- I'm sorry we disagree on the rest!] 

I would be very happy indeed to see a PDF in which is both possible (I believe much of it is already) and common practice (sadly, not currently very common) to add additional metadata ('baked in' as Tim nicely puts it). However, this metadata can only ever refer to either a) the article of record, or b) to external data captured at that moment in time. For example, let's say you wished to refer from the PDF to a particular database entry: you have the choice of a) copying that database entry in to the PDF in some form (which could be bloated, but would work offline and be reliable etc), or to b) include  a link to whatever the up to date version of the record may be on line, and rely on this being resolved at 'read time', or c) both. If you are prepared to accept b) or c) as a sensible option, then why not use the same mechanism for accessing all the richer data / metadata associated with the article, since the infrastructure for doing b) or c) will be much the same (and reductio ad absurdum, the only thing you need to store in the PDF is a unique ID that allows the rest to be fetched at runtime).

I believe that if we were to achieve agreement on a model of annotation metadata that could exist in the same form within the PDF, or outside the PDF, or both, that would be ideal.  

My view is that metadata for the Article of Record goes in the PDF, size permitting, but also that links are kept to data outside the PDF, which can be resolved at 'read time' to make sure that the PDF is kept as a both an Article of Record (JV's 'minutes of science') and as a 'Living Document' with links to up-to-date data, comments etc. [and referring to my previous whitterings on the subject, as much as I like PDFs and I think they make an excellent 'View', I don't believe they make good vehicles for storing an articles 'Model')

Best wishes 

Steve

Leonard Rosenthol

unread,
Nov 15, 2010, 8:26:33 AM11/15/10
to beyond-...@googlegroups.com

I have no problem with there being external references and other information in a PDF – but as you note, there is the related issue of standardizing where/how such information is referenced.  What format?  Where in the PDF?  Etc.   If we could all agree on what goes in and how, then we now have interoperability and that’s the most important aspect.

 

Today there is no question that PDF is just the ‘view’ of the MVC model.  However, our goal going forward is to add the ‘model’ to that – so that not only do you have a specific view, but you also have all the necessary pieces to go back to “edit mode” with the model and perhaps even recreate a view.  (this may be the entire PDF or just some subsection of it)  PDF already has all the necessary components (and many nice-to-have optional) for doing this – but it’s all about standardizing how it gets done and then building the tooling…

 

Leonard

Juliana Freire

unread,
Nov 15, 2010, 10:05:28 PM11/15/10
to beyond-...@googlegroups.com, Juliana Freire, Claudio Silva
Hi All,

I thought I would mention some of the work we are doing on reproducible publications in the context of VisTrails project.

Some background: VisTrails (http://www.vistrails.org) is an open-source data analysis and visualization tool that combines and extends features of scientific workflows and visualization systems. A distinguishing feature of VisTrails is its provenance infrastructure: VisTrails maintains provenance of data products (e.g., visualizations, plots), of the workflows that derive these products and their executions.

We have developed a VisTrails package that allows the publication of reproducible results. You can see a video that demonstrates some of its features at:

Essentially, as you explore data and create visualizations, VisTrails captures all the steps transparently. Once you get a result you like, you can 'publish' it in different ways. The video shows how this is done for a LateX document,  wiki, and powerpoint
presentation.

We have also integrated this capability with CrowdLabs (http://www.crowdlabs.org), a social Web site where users can share not only their results, but the specifications of the analysis that derived the results and their provenance.  Through CrowdLabs, it is also possible to publish mashups that allows users to interactively manipulate the results (e.g., try different parameters) without having to install and run VisTrails on their desktop. For an example, see http://www.crowdlabs.org/vistrails/medleys/details/24

VisTrails and the reproducible publication package run on Mac, LinuX and Window.

Best,
Juliana

Paul Groth

unread,
Nov 14, 2010, 2:49:26 PM11/14/10
to beyond-...@googlegroups.com
I agree with Steve and Tim here: while some metadata can and should be
stored with the pdf but not all of it can be.

In particular, I'm thinking about the provenance of the work. Provenance
by it's very nature goes beyond the pdf itself and can be much much
bigger than the pdf. For example, we've done some work where we maintain
a reproducible representation of the results of an astronomy workflow by
maintaining a virtual machine image along with the workflow itself.
Obviously, this is an extreme case, but I doubt people are going to want
to embed a 3GB virtual machine image in their pdfs.

Essentially, we need both embedding in the pdf and linking to the
outside and we need some nice guidance for how to do this.

cheers,
Paul

>> annotation metadata that could exist in the same form _within the
>> PDF_, or _outside the PDF_, or _both_, that would be ideal.


>
> My view is that metadata for the Article of Record goes in the PDF, size
> permitting, but also that links are kept to data outside the PDF, which
> can be resolved at 'read time' to make sure that the PDF is kept as a
> both an Article of Record (JV's 'minutes of science') and as a 'Living
> Document' with links to up-to-date data, comments etc. [and referring to
> my previous whitterings on the subject, as much as I like PDFs and I
> think they make an excellent 'View', I don't believe they make good
> vehicles for storing an articles 'Model')
>
> Best wishes
>
> Steve
>
>
>> Also - ideally when I open a PDF that contains annotation referencing
>> some entity that is commonly studied or used outside the document
>> itself - e.g. a protein, a database, a reagent, a computational tool,
>> a workflow - my Web browser should just natively be able to connect to
>> all other sources of information about that entity wherever they are
>> on the Web, and use these connections to enhance the information I see
>> without jumping all over the place. Annotation itself is or at least
>> should be, an independently sharable boundary object.
>>
>> Best
>>
>> Tim
>>
>>
>> On Nov 14, 2010, at 9:51 AM, Leonard Rosenthol wrote:
>>

>>> The ideas of Utopia are excellent, but their implementation isn�t
>>> (IMO) the right approach. The PDF itself doesn�t contain any of that
>>> rich information, so that it can be used/mined/extracted � instead,


>>> it appears to be sitting in one (or more) databases or data

>>> repositories online that Utopia is able to �magically� locate and
>>> then enable.
>>> I�d prefer to see the same user experience (which is quite well
>>> done!) applied to a PDF with that type of rich semantics embedded�
>>> Leonard
>>> *From:*beyond-...@googlegroups.com
>>> <mailto:beyond-...@googlegroups.com>[mailto:beyond-...@googlegroups.com]*On
>>> Behalf Of*Jodi Schneider
>>> *Sent:*Sunday, November 14, 2010 5:19 AM
>>> *To:*beyond-...@googlegroups.com
>>> <mailto:beyond-...@googlegroups.com>
>>> *Subject:*Re: capturing workflows and embedding in word documents

>>> email:mes...@broad.mit.edu <mailto:mes...@broad.mit.edu>
>>>
>>
>

Jill Mesirov

unread,
Nov 29, 2010, 9:13:06 AM11/29/10
to beyond-...@googlegroups.com
With respect to provenance I would encourage you all to review Atul
Butte's recent piece in Nature Biotech
Nat Biotechnol. 2010 Nov;28(11):1181-5
where he discusses leveraging cloud resources for this process
Best,
J
Reply all
Reply to author
Forward
0 new messages