Sarah
we are moving to DOIs for myExperiment
but here is an immediate issue here for persistent URLs
Finn, Don and Sean -- this is urgent
Carole
Dear Carole,
We have just got reviewers comments back for our paper, and
one of them was:
'I was unable to access the
workflows at the supplied URLs (ll.147 &
149) to evaluate the details of
the workflows. I would suggest
providing PURLs to be able to
move web-content in the future without
breaking the link between the
information in the manuscript and the
website.'
Now everybody is discussing this and there does not seem to
be a consensus, see the emails below. Do you have a simple
solution for this?
Many thanks,
Sarah & Sonja
Dr. Sarah J. Bourlat
Department of Biological and Environmental Sciences
University of Gothenburg
Box 463
SE-405 30 Göteborg
Sweden
Mobile:
+46 (0)702147811
Office:
+46 (0)317863827
Begin forwarded message:
Subject:
Re: URLs of workflows for
publication
Date:
November 11, 2013 11:24:00 AM
GMT+01:00
We are discussing this now (as in this second) on the
myExperiment skype call.
Alan
On 11/11/2013 10:20, Robert Haines wrote:
Hi all,
I've added Alan to the CC list.
To keep my reply short for now, I
think we should be looking at DOIs for
this (http://www.doi.org/).
DOI is an ISO standard, libraries
use them, publications have them and
we should be treating our
published workflows like published papers. But
I do mean only *published*
workflows. Assigning a DOI would be part of
the formal publishing process for
our workflows and would get us in the
mindset of versioning things more
carefully.
The SEEK platform can (or soon
will be able to) assign DOIs and register
them so we will have experience of
working with them in the myGrid team
soon.
Rob
On 08/11/2013 17:04, Francisco
Quevedo wrote:
Hi Sarah,
Good constructive feedback from
the reviewer, although as I will
explain later on, I'm not sure
that even using PURL, we can solve all
the problem we can have of this
type. For that reason, I've just added
Jon, Abraham and Rob to the
conversation so they can give us their
opinion.
To be honest, I didn't know
nothing about PURLs, but after reading a
little bit I can see how by
using PURLs, we can give to our users a
permanet URL (PURL) to access
the content we want them to access, but
also offering us the possibility
to move the real place where the
content is without the need to
change the url that we gave to our users.
According to its definition:
"A /Persistent/ URL is an
address on the World Wide Web that points to
other Web resources. If a Web
resource changes location (and hence URL),
a PURL pointing to it can be
updated. A user of a PURL always uses the
same Web address, even though
the resource in question may have moved. "
http://purl.oclc.org/docs/help.htm
Technically this can be achieved
by using a PURL server (basically a url
resolver), where our users
access this PURL server and depending on the
path, it will redirect
automatically the request to where the content
is. This type of architecture is
used amongst others by the U.S.
Government Printing Office to
provide stable URLs to online Federal
information (http://purl.fdlp.gov/docs/index.html)
As an example, imagine that we
published a paper in which we say that
the BioVeL workflows can be
accessed in the BioVeL portal by going to
the http address "http://tavlite1.biovel.eu/workflows".
Then after a
while we decide that this name
is not suitable and we change it to
"http://portal.biovel.eu".
So, in the case we would have shut down the
tavlite1 server, a reader of our
paper will not be able to access the
workflows described in the paper
because the url we gave them it doesn't
any longer exist. Although, I
want to say that this is not the situation
we have because we have not shut
down tavlite1. But I can see the case
that, if instead of giving them
the specific address, we would have give
them a PURL address, eg:
http://purl.biovle.eu/portal/workflows, they
will still be able to access the
workflows independently if we had move
or not the machine. For example
initially the PURL server will map that
address to "http://tavlite1.biovel.eu/workflows",
so if any user type
http://purl.biovle.eu/portal/workflows
will be redirected to
http://tavlite1.biovel.eu/workflows,
but if we change the portal url, it
will be enough to update the
PURL entry and the users will still have
access to it by typing the same
PURL address.
However, I want to say that we
have achieved something similar to this
feature in the new portals that
we have deployed in Amazon by using DNS.
Jon and Rob have set up a DNS in
Amazon (Amazon Route 53) that redirect
the users' requests to the exact
machine where we have our applications.
For example, when we write "http://portal.biovel.eu"
the DNS resolves
that name and redirect the
request to "https://portal1.at.biovel.eu/"
that is the real machine (well
not exactly but for our case imagine it
is). So basically the DNS server
is doing something similar to what the
PURL server would do. With this
DNS is easy to change the machine to a
different one. Imagine we move
portal1 to a more powerful machine called
portal5, by changing the entry
in the DNS table, they user can still
using the url "http://portal.biovel.eu"
but this time they will be
redirected to "https://portal5.at.biovel.eu/"
instead of portal1.
Anyway, why have I said at the
beginning of the email that even by using
PURLs or using the DNS we can
not solve all the problems? Well, I don't
know what is the specific case
of the URLs (ll.147 & 149) in the paper,
what were they? But I can
imagine the following scenario in which a URL
can be innaccesible even if we
have used PURLs or DNS, and this is
basically if the resource is
deleted instead of moved. Imagine the
following:
1) Renato publishes in
myExperiment the version 18 of his ENM workflow,
and after the evaluation process
(process that we haven't fully defined
yet) the workflow passes in
myExperiment from the BioVeL internal group
to BioVeL group making it
publicly available the 23th of October. At
that point, the workflow can be
accessible at
http://www.myexperiment.org/workflows/3355.html
for anyone who wants to
see it or download it.
2) Then let's say Matthias the
25th of October, add this workflow from
myExperiment in the BioVeL
portal as one of his private workflows. Every
time that a new workflow is
added to the portal, it gives a unique
worfklow id. Let's suppose the
portal gives to new workflow the id 87,
so the workflow can be reach at
"https://portal1.at.biovel.eu/workflows/87"
or its equivalent
"permanent" address "http://portal.biovel.eu/workflows/87".
Matthias
then spends some days testing
the workflow in the portal and let's say
the 1st of November he decides
to made it public. Now, any user who
enters in the portal will see
that workflow in the list of public
workflows and can run it.
Matthias then also by that time writes a paper
in which he says that the
results shown in that paper were obtained
executing the workflow "https://portal1.at.biovel.eu/workflows/87".
Here is my first point, the url
we should have written in the paper it
should be the "permanent" one "http://portal.biovel.eu/workflows/87"
and
not the "temporal" one "https://portal1.at.biovel.eu/workflows/87"
because if tomorrow, for
example, we move the machine from portal1 to
portal5, and once the DNS entry
(or the PURL entry if we used PURL) is
updated, the "permanent" url
will still be valid whereas the temporal
no. But let's suppose that we
have written the "permanent" url in the
paper.
3) Then imagine that Renato the
15th of December releases a new version
of the ENM, the version 19,
which has a really nice cool new features,
like allowing the user to set up
the number of cross-validation to be
made depending on the number of
unique occurrence points provided. This
new version is uploaded in
myExperiment and after a while it is made
public. Then Matthias decided to
have a go with it and test it in the
portal adding it as a new
workflow in his private workflows. He adds it
as private workflow because
until it is not fully tested he still want
that rest of the user see the
previous version of the ENM (v18) in the
public workflow. So, the portal
assigns then a new id to this new
workflow, lets say the id 93.
After some testing , Matthias is happy
with the workflow and he makes
it public. But now we have 2 public ENM
workflow in the portal, one for
the ENM v18
(http://portal.biovel.eu/workflows/87)
and other for ENM v19
(http://portal.biovel.eu/workflows/93),
so we decide to delete the v18.
However, this is something we
shouldn't do and instead we should have
make it private or perhaps
create a new status like superseded or
something like that, in which
the workflow will still be able to be run
for anyone by writing its url
but it will be not shown in the list of
public workflows. I say this
superseded status because I'm not sure if
its a private worklow, any user
by writing the url could run it or only
its owner.
Anyway, if we delete the
workflow from the portal, there isn't a way for
the users to access that
workflow again, even if we use DNS or PURLs,
merely because it has been
deleted and not moved, so we can not point to
it. Ok we can add an entry in
the DNS or in the PURLs so if the user
want the workflow 87 (ENM v18)
we give them workflow 93 (ENM v19), but I
think this is something wrong
and it shouldn't be done. Becasue what
happens if in the paper it says
that the workflow gave 10 output and now
it only gave 7 because the new
version process them differently?
My question is, Is it that the
situation with the URL's in the paper? Is
because the worfklow has been
deleted or it is because the machine has
been changed?
I have been talking with Abraham
about all of this, and he thinks and I
agree with him, that perhaps
what we should reference in a paper is
where the workflow is in
myExperiment or the service in BioCatalogue
rather than to the portal,
mainly because 2 reasons:
a) The portal was conceived as a
pilot project to show how to run
workflows easily in a web
browser environment, and although it can keep
different versions of workflows.
Its goal was not to act as a repository
where the workflows can be found
there forever. That is the aims of
myExperiment and BioCatalgoue
(for workflows and services respectively).
b) MyExperiment and BioCatalogue
apart form being repositories, they
also offer long term support. In
other words, after the BioVeL project
finishes we don't have the
commitment to keep the server running more
time of which is specified in
the DoW, whereas the MyExperiment and
BioCatalogue should still be
there longer than that.
So, perhaps the best thing is to
say that results of the paper were
obtained using the version X of
the workflow M that can be found at
http://www.myexperiment.org/workflows/XXXX.html
and it was run using the BioVeL
portal at "http://portal.biovel.eu",
and
perhaps mention also a wiki page
that describes how to run it in the
portal or in the workbench. By
this way a reviewer or a reader of our
papers will be get always the
workflow and if they want, follow the
documentation and run the
workflow in the environment they desire.
However, this implies that
workflows shouldn't been deleted in
myExperiment under any
circumstances, unless 100% sure they are not
refereed in any place. Anyway,
this is only a suggestion.
To conclude, I don't know if
after this long text I has been able to
clarify your doubt about PURL's
or creates new one. However, just to sum
up I want to say that, by using
PURLs or by using our current DNS
system, we can solve partially
the problem of the URLs for the workflows
in the BioVeL portal as long as
the resource is moved but not deleted.
Best wishes,
Fran
On 08/11/2013 10:01, Sarah
Bourlat wrote:
Dear Fran,
We just got reviewers comments
back for one of our papers on Niche
modelling, which we need to
return within 2 months. One comment was:
'I was unable to access the
workflows at the supplied URLs (ll.147 &
149) to evaluate the details
of the workflows. I would suggest
providing PURLs to be able to
move web-content in the future without
breaking the link between the
information in the manuscript and the
website.'
What is a PURL and how can we
provide it to avoid breaking the link
between the information in the
manuscript and the website, every time
a page gets updated?
Many thanks for your help,
Sarah
Dr. Sarah J. Bourlat
Department of Biological and
Environmental Sciences
University of Gothenburg
Box 463
SE-405 30 Göteborg
Sweden
Mobile: +46 (0)702147811
Office: +46 (0)317863827
Email: sarah....@bioenv.gu.se
<mailto:sarah....@bioenv.gu.se>
http://www.bioenv.gu.se/english/staff/sarah-bourlat/
www.mg4u.eu <http://www.mg4u.eu>
www.biovel.eu <http://www.biovel.eu>
--
Professor Carole Goble FREng FBCS CITP
School of Computer Science
University of Manchester
Manchester, UK
tel: +44 161 275 6195
email: carole...@manchester.ac.uk