Using Dataverse as a Middleware for Data Management

223 views
Skip to first unread message

Goktug

unread,
Mar 17, 2022, 11:38:42 AM3/17/22
to Dataverse Users Community
Hey,

I've asked this question to Dataverse support, but this may benefit more from a free discussion. I've been asked to employ a middleware that organizes microscopy images (OME / OME-TIFF) and then pushed them to OMERO server. For this purpose, I'm asked to utilize Dataverse. I'm aware that Dataverse is a data repository in its core, but is there an API capability that would allow Dataverse to push data into a database?

Thanks,
 

Philip Durbin

unread,
Mar 18, 2022, 4:52:44 PM3/18/22
to dataverse...@googlegroups.com
Dataverse has a feature called "workflows" that allows you to run a script or other code when datasets are published: https://guides.dataverse.org/en/5.10/developers/workflows.html

So in that sense, yes, you could write a script to push data into a database when a dataset is published in Dataverse.

I hope this helps,

Phil
 

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2de5ae34-2070-4583-a7a5-fe318f7cdac0n%40googlegroups.com.


--

Goktug

unread,
Mar 22, 2022, 12:26:10 PM3/22/22
to Dataverse Users Community
Hi Philip,

Thanks, this is exactly what I searched for. The documented workflow implementations for pushing data are aimed towards archivers (Archivematica, Duracloud / Chronopolis). Are there workflow classes that are defined for other cloud storage services, such as OpenStack/Swift?

Best Regards,

James Myers

unread,
Mar 22, 2022, 12:45:54 PM3/22/22
to dataverse...@googlegroups.com

FWIW: The archiver workflows are all transferring zipped Bags containing all of the data files and an OAI_ORE json-ld metadata file. There is an S3Archiver class in development now.

 

The broader workflow mechanism isn’t limited to working with a Bag though. For example, the http/sr and

http/authext can make a call to an external application which can then use the Dataverse API to retrieve metadata, get individual files, etc. In the case of the authext step, gives your application a key that can be used to access the Dataverse API and, for example, add metadata pointing to your external image source, before the dataset version is published. (Your workflow can run prior to publication proceeding or after publication is done.)

 

-- Jim

Philip Durbin

unread,
Mar 22, 2022, 2:34:59 PM3/22/22
to dataverse...@googlegroups.com
> Are there workflow classes that are defined for other cloud storage services, such as OpenStack/Swift?

No, but I'd like to re-emphasize what Jim was saying about the flexibility of the workflow mechanism. When you say "classes" I assume you're referring to the ":ArchiverClassName" database setting that references a Java file. You could certainly contribute a Java file for OpenStack or Swift that fits in with the archiver use case, but you don't have to use Java at all.

Originally, workflows were added because a user wanted to move files around on a filesystem when datasets were published. He built a RESTful webservice in Python (that does the actual moving) and created a workflow in Dataverse to send information across the wire on publish (the dataset title, DOI, etc.). You can see the full workflow at https://github.com/IQSS/dataverse/blob/v5.10/scripts/api/data/workflows/internal-httpSR-workflow.json but I'll paste below just the part about communication with the other service to give you an idea of how it works:

"parameters": {
    "url":"http://localhost:5050/dump/${invocationId}",
    "method":"POST",
    "contentType":"text/plain",
    "body":"${invocationId}\ndataset.id=${dataset.id} /\ndataset.identifier=${dataset.identifier} /dataset.globalId=${dataset.globalId} /\ndataset.displayName=${dataset.displayName} /\ndataset.citation=${dataset.citation} /\nminorVersion=${minorVersion} /\nmajorVersion=${majorVersion} /\nreleaseCompleted=${releaseStatus} /",
    "expectedResponse":"OK.*",
    "rollbackUrl":"http://localhost:5050/dump/${invocationId}",
    "rollbackMethod":"DELETE"
}

You can read more about this workflow at https://guides.dataverse.org/en/5.10/developers/big-data-support.html#repository-storage-abstraction-layer-rsal but I should warn you that it's not especially readable. They're more like notes for how to set it up for testing.

Anyway, the point is that workflows can be pretty flexible. If you want to create a Java archiver class, great. But you aren't limited to this.

Please keep the questions coming!

Phil
 

Goktug

unread,
Mar 25, 2022, 11:39:49 AM3/25/22
to Dataverse Users Community
Hey,

Thanks for all the info you guys have shared. I'm reading through the guide and I'm getting more familiar with it. One thing interests me: There is a Data Migration API that supports data being pulled into Dataverse. Is there a modification of that API that does the opposite? I'm asking this because there's already an established API that could be used.

Best,

James Myers

unread,
Mar 25, 2022, 1:11:07 PM3/25/22
to dataverse...@googlegroups.com

Migrate works at the dataset level (which can contain many files) and is intended to let you transfer ~ownership of a dataset previously published elsewhere to Dataverse – keeping the original publication date, keeping the assigned DOI/PID and having Dataverse update the DOI landing page to redirect to the copy of the dataset now in Dataverse, etc. There isn’t a direct opposite in the API but the OAI-ORE metadata export contains the metadata that is needed to send into the migrate API and the archival bags we’ve talked about before are basically the dataset serialized into a zip file that could then be used (e.g. via DVUploader and the migrate API) to recreate the original dataset in a new Dataverse instance. (Dataverse also supports Harvesting – which allows one Dataverse to list datasets that are maintained in a different repository – more info in the Guides).

 

At the file level, if the workflow mechanism doesn’t do what you need, you might be interested in Dataverse’ internal concept of ‘stores’ and work-in-progress to create a store that can reference datafiles that are remote/managed in a separate system. The ‘store’ concept originally allowed Dataverse to have one upload mechanism but to then redirect files for storage in different types of underlying stores, i.e. a file system, or any number of different types of systems accessible via the S3 object-store protocol. The S3 store has since been upgraded with options such that, when you initiate a file upload to Dataverse, Dataverse can provide you/your browser/app with a direct (signed) URL to the S3 store so that the transfer does not need to go through the Dataverse server itself. (Same for downloads – the client can be redirected to get the file directly from the S3 bucket.). The work-in-progress adds a new type of store where the datafile remains at a remote location, accessible via URL. Upload is assumed to have been done independently and Dataverse will redirect to the URL for download.

 

If OMERA supports S3, you might be able to configure to use it via the S3 store. If you have a way to upload files to OMERA separately, you might be able to use the planned ‘HTTPOverlay’ store to allow Dataverse to reference the file rather than uploading into Dataverse. Other variants could be possible – e.g. users upload to Dataverse and you use a workflow, as discussed previously, or other mechanism to transfer relevant files to OMERA and use the planned HTTPOverlay store to then reference the remote copy (and possibly deleting the copy in Dataverse). (Developing a native OMERA store for Dataverse could be a possibility as well, but that would be significantly more involved.)

 

Another fairly lightweight possibility would be to look at Dataverse’s External Tools framework/Previewers. These are registered per-MimeType. A simple example is a video player – a Previewer for mpeg can be registered which adds a button in the Dataverse UI to ‘preview’ that type of file. Clicking the button redirects you to a separate app, in this case one that embeds the HTML5 video player and pipes the video file to it from Dataverse so you can watch the video. The Previewers developed to date don’t keep any state, i.e. they retrieve the file from Dataverse every time someone invokes the previewer, but an OMERA previewer could do something like copy the file to OMERA on the first view and refer to that copy on all subsequent views. (Assuming OMERA provides a web interface to view a file, a previewer might do nothing more than make sure OMERA has a copy and then redirect to OMERA for viewing).

 

As you can see – lots of options, and which one might be best probably depends on the details of your use case w.r.t. whether you want to use Dataverse’s existing upload mechanisms, whether you want OMERA to have a copy or the be the only store for the relevant files, etc.

Goktug

unread,
Mar 28, 2022, 3:31:50 AM3/28/22
to Dataverse Users Community
Hey Jim,

Thanks for the detailed answer. OMERO is currently experimenting with S3 if I'm not mistaken. The following links, especially the OME.Zarr image stored on an S3 store being viewed in Fiji may be relevant. I will take a more thorough look into this as this might be a more elegant solution to move data into OMERO. 


Goktug

unread,
Mar 28, 2022, 4:39:12 AM3/28/22
to Dataverse Users Community
I was mistaken, as there's an OMERO on AWS deployment: 

goktug

unread,
Apr 1, 2022, 5:58:40 AM4/1/22
to dataverse...@googlegroups.com
I'd like to ask another question: Is it possible to manipulate the contents of an uploaded and ingested .csv file on Dataverse? I've seen the sample population data Python script and I'm wondering if this can be extended.

You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/ijrmNTGjOo0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/62c9c21b-7c99-4ee8-8c31-416e5b0607d8n%40googlegroups.com.

Philip Durbin

unread,
Apr 1, 2022, 3:13:32 PM4/1/22
to dataverse...@googlegroups.com
Hi, I'm confused. When you say you want to edit the contents of a .csv file it reminds me of this "editing text files" issue at https://github.com/IQSS/dataverse/issues/3104 . Are you looking for an in-browser spreadsheet editor, sort of like Google Spreadsheets, but for Dataverse files? We don't have anything like this but I suppose someone could built a "configure" (as opposed to "explore") external tool for this: https://guides.dataverse.org/en/5.10/api/external-tools.html

In that issue I explained the supported approach, which is to edit the .csv file locally and then use the "Replace File" in Dataverse to replace it: https://guides.dataverse.org/en/5.10/user/dataset-management.html#replace-files

If you're talking about something else, please set me straight!

Also, what sample population data Python script? Can you please provide a link?

Thanks,

Phil

Goktug

unread,
Apr 4, 2022, 3:35:17 AM4/4/22
to Dataverse Users Community
Hi Phil,

My question was similar to the Github link that you've shared, but then as I read through I came across the Direct DataFile Upload/Replace API as you've mentioned. We will be deploying Dataverse on an S3 bucket so that should work. OMERO has an image annotation structure that requires the user to create a .csv file, which matches the dataset with the image filenames. Later on another .csv is required as a "key-pairing" that provides microscopy metadata. These processes are tedious, thus I was looking for an option to automate this process. My proposal is to load the images as a draft, use Dataset Curation Label API to label the files, which then triggers the Direct DataUpload/Replace API.


Best,

Philip Durbin

unread,
Apr 4, 2022, 10:59:45 AM4/4/22
to dataverse...@googlegroups.com
Sounds good. Obviously, if you hit any stumbling blocks, please let us know.

I believe Harvard Medical School is using OMERO so your work may be of interest to them. You can see an old issue about this here: https://github.com/IQSS/dataverse/issues/2247



Goktug

unread,
Apr 10, 2022, 10:03:12 AM4/10/22
to Dataverse Users Community
Well, this is stretching the Dataverse a bit, but the research group would like to connect OMERO to the central user identification system over LDAP. Is it possible to pass these user identifications from Dataverse over to OMERO, instead of having two different user management settings? There is a talk of using Shibboleth for OMERO but apparently it hasn't been tested before and they are not very enthusiastic about it.

Philip Durbin

unread,
Apr 11, 2022, 10:17:16 AM4/11/22
to dataverse...@googlegroups.com
If you need to connect Dataverse to an LDAP server, I believe the best way is to run your own Identity Management (IDM) server that connects to the LDAP server and connects to Dataverse over OIDC: https://guides.dataverse.org/en/5.10.1/installation/oidc.html

Of course, sometimes people say LDAP when they mean Microsoft or ADFS. It's possible to connect Dataverse to these as well:

- Microsoft: https://guides.dataverse.org/en/5.10.1/installation/oauth2.html

I hope this helps,

Phil
Reply all
Reply to author
Forward
0 new messages