progress update... and some suggetions re. SWORD

6 views
Skip to first unread message

Katherine Fletcher

unread,
Sep 16, 2012, 5:21:47 PM9/16/12
to dataflo...@googlegroups.com, dataflo...@googlegroups.com

Hello!

 

I would like to publicly thank our developers for finding time to keep working towards a release: it has been a rough few weeks for the DataFlow team, with changed priorities for us all.  Thanks for keeping at it, guys.

 

We're getting there... please see this message on the dataflow-devel mailing list for a progress update from Thursday’s hackathon. 

 

In the meantime, the irreplaceable Ben O’Steen has written in with an update (and some tips) on SWORD-related questions.

 

Thanks,

 

Katherine… and now over to Ben:

 

 

I'm not sure what the root of the current SWORD errors are. I know that there are issues with the authentication libraries but this may have nothing to do with the errors you are experiencing. Richard or Anusha (when she is back) may be able to tell you more.

 

My first task for Oxford is to migrate the DataBank codebase to use Django, instead of Pylons. I've hit a dead-end experimenting with handling large uploads with pylons and can't see a real fix. Add in that the current community and momentum around Django is comparably huge, it makes a lot of sense to do this. As part of this migration, I'll be (internally) re-plumbing in the SWORD interface. I've allocated days ahead to this (around 0.6FTE as I have other work on at the moment), and project that in around 5 weeks, I'll be done on both these tasks. It should be ready for testing in 3 to 4 weeks depending.

 

As for the use of the SWORD2 interface, you might find some of the http tests useful, as least as examples, rather than explanations for why certain URIs are used [1] I'll try to keep it short:

 

1 - https://github.com/swordapp/python-client-sword2/blob/master/tests/http/test_sss.py

 

Basics: A SWORD interface exposes a service document, that lists the workspaces the service manages and the collections within those workspaces. Given acceptable authentication/authorisation, a user can create, amend, read and delete containers (think 'labelled bags of content') in these collections. A container is represented by an Atom document - essentially a list of resources, with some top-level metadata (title, attributions, provenance, etc) about the container. You can add, read, delete and update the resources held by this container and typically these resources are called payloads in the python client. These are the files in your dataset for example.

 

So, long story short (w/ python code):

 

- get the service document from the server

 
 
 

from sword2 import Connection, Entry

conn = Connection("http://example.org/service-doc")

 

- Create an Entry (the Atom entry for the container)

 

e = Entry(title="My Dataset Title", id="IMPORTANTID", dcterms_appendix="blah blah", dcterms_title="Dataset Title")

 

(NB parameters here correspond to the Atom namespace - title, author, id, etc. Those prefixed with 'dcterms_' are put into the dcterms XML namespace as part of the Entry document.)

 

Additional fields can be added later, and from additional namespaces (using the same underscore syntax as before):

 

 
 
e.register_namespace("oxds", "http://databank.ox.ac.uk/terms/")
 
 
e.add_field("oxds_whatever", "whatever")
 

 

- Upload the Entry with an attached payload (your dataset for example)

 

with open(PACKAGE) as pkg:

    resp = conn.create(payload = pkg,

 
 
                      metadata_entry = e, 
 
 
                      mimetype = "MIMETYPE/HERE+PLEASE", 
 
 
                      filename = "FILENAME_HERE.PLEASE", 
 
 
                      packaging = 'http://purl.org/net/sword/package/Binary',
 
 
                      workspace='Main Site', 
 
 
                      collection= 'MyCollection'
 
 
                      in_progress=True)
 
 
 
 
assert resp.code == 201 # check that a new resource has been created by this
 

 

For a zipped package, the mimetype would be 'application/zip' and I believe 'http://purl.org/net/sword/package/SimpleZip' although I do not know if this is the correct form for the DataStage->Databank 'BagIt' upload. Anusha or Richard will have to clarify this step.

 

 

 

 

Ben O'Steen

unread,
Sep 17, 2012, 7:11:02 AM9/17/12
to Katherine Fletcher, dataflo...@googlegroups.com, dataflo...@googlegroups.com


Hi,

To all those on the list, the included message on the previous email was from me to a private group of developers - the section on SWORD2 was off the cuff and not checked by others who are actively working on that side of the development. As such, it really should be taken with a *large* pinch of salt until Richard Jones or Anusha reviews it to see if it tallies with reality.

Also, the message included a number of out of context replies by me as well as the formatting being completely broken. I have a draft post ready and it will be blogged once I get it checked and this post would be the preferred version.

Ben

PS I have not yet begun the previously mentioned migration from pylons to django and do not like to publicly talk about code before I've started writing it (which I am due to tomorrow) I'll still post to the list as I was going to once I'm underway and have a better handle on how and when (testable) features  might appear.

tl;dr I do not like to "announce vapourware"

--
 
 

Richard Jones

unread,
Oct 3, 2012, 5:57:50 AM10/3/12
to Ben O'Steen, Katherine Fletcher, dataflo...@googlegroups.com, dataflo...@googlegroups.com
Hi Folks,

Playing catch-up on this thread ...
I need to back-trace through the emails, and find the actual errors
that you are experiencing. If it's an issue with authentication, then
we should be able to suitably separate this from being just a SWORDv2
issue, and focus on figuring out the access control issues.

>> My first task for Oxford is to migrate the DataBank codebase to use
>> Django, instead of Pylons. I've hit a dead-end experimenting with handling
>> large uploads with pylons and can't see a real fix.

What is the issue here? I tested initially with multi-gigabyte files
over a local network without any problems.
This all looks correct, with the exception that the package type
should be the databank bagit format URI, which I'd have to check the
dox for, but which is probably something like:
http://databank.ox.ac.uk/package/DataBankBagIt.

Cheers,

Richard


--

Richard Jones,

Founder, Cottage Labs
t: @richard_d_jones, @cottagelabs
w: http://cottagelabs.com

Graham Klyne

unread,
Oct 4, 2012, 7:41:40 AM10/4/12
to Richard Jones, Ben O'Steen, Katherine Fletcher, dataflo...@googlegroups.com, dataflo...@googlegroups.com
Richard,

We isolated a test case for the problem we have been seeing.

https://github.com/dataflow/DataStage/blob/ecbb856b5dddf350c40345191724aece99148c25/test/FileShare/tests/TestDatasetSubmission.py

It would help if
(a) you can tell us if we're doing anything wrong in our use of the SWORD client
library, and
(b) indicate whether or not the test case is something you can use to isolate
the problem.

Thanks.

#g
--


On 03/10/2012 10:57, Richard Jones wrote:
> Hi Folks,
>
> Playing catch-up on this thread ...
>
> On 17 September 2012 12:11, Ben O'Steen <bos...@gmail.com> wrote:
>>
>>> mailing list for a progress update from Thursday�s hackathon.
>>>
>>>
>>>
>>> In the meantime, the irreplaceable Ben O�Steen has written in with an
>>> update (and some tips) on SWORD-related questions.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Katherine� and now over to Ben:
>>>
>>>
>>>
>>>
>>>
>>> I'm not sure what the root of the current SWORD errors are. I know that
>>> there are issues with the authentication libraries but this may have nothing
>>> to do with the errors you are experiencing. Richard or Anusha (when she is
>>> back) may be able to tell you more.
>
> I need to back-trace through the emails, and find the actual errors
> that you are experiencing. If it's an issue with authentication, then
> we should be able to suitably separate this from being just a SWORDv2
> issue, and focus on figuring out the access control issues.
>
>>> My first task for Oxford is to migrate the DataBank codebase to use
>>> Django, instead of Pylons. I've hit a dead-end experimenting with handling
>>> large uploads with pylons and can't see a real fix.
>
> What is the issue here? I tested initially with multi-gigabyte files
> over a local network without any problems.
>

Richard Jones

unread,
Oct 5, 2012, 4:37:59 AM10/5/12
to Graham Klyne, Ben O'Steen, Katherine Fletcher, dataflo...@googlegroups.com, dataflo...@googlegroups.com
Hi Graham,

Thanks for this. The use of the client library looks correct, so I'd
need to see if I can replicate this problem locally. I'll try to do
this over the next few days and let you know how I get on. From what
I can gather from the conversation, though, is that this is a
server-side issue rather than a client issue, and we need to figure
out why retrieving service documents fails after some number of
successful attempts. To me this sounds like either an environmental
issue or some subtlety of the relationship between the sword code and
the pylons framework.

I'll let you know what I find out as soon as I can.

Cheers,

Richard
>>>> mailing list for a progress update from Thursday’s hackathon.
>>>>
>>>>
>>>>
>>>> In the meantime, the irreplaceable Ben O’Steen has written in with an
>>>> update (and some tips) on SWORD-related questions.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Katherine… and now over to Ben:
Reply all
Reply to author
Forward
0 new messages