Cloud Dataverse

Esther Dzale

unread,

Jun 8, 2018, 5:03:12 AM6/8/18

to Dataverse Users Community

Hi,

could someone tell me what is the current status of the Cloud Dataverse thing? Have you implemented the federated Dataverse within the MOC were data from any Dataverse can be replicated? Is it already possible to upload data directly in a Swift storage and then publish a dataset from Swift to Dataverse?

Esther

Philip Durbin

unread,

Jun 8, 2018, 9:22:15 AM6/8/18

to dataverse...@googlegroups.com

Hi Esther,

Thanks for asking. I believe we're trying to get away from the term "Cloud Dataverse" but I know we've used this term in the past. There's a nice doc called "Cloud Dataverse User Stories"[1] that hasn't been touched in over a year that captures some of the ideas you're talking about. For example, "As a Curator on the Harvard Dataverse, I want to select a published dataset for replication in the cloud so that computations can be run on it."

There was some early work at https://github.com/IQSS/dataverse/pull/3239 but it was never merged.

That said, I don't mean to give the impression that nothing is happening. There's a new feature in Dataverse 4.9 documented at http://guides.dataverse.org/en/4.9/installation/config.html#setting-up-compute that explains that if you are using Swift for storage and have all the other prerequisites in place the "Compute" button now has three options:

- Compute on a single dataset
- Compute on multiple datasets
- Compute on a single datafile

I still consider all of this quite experimental. I believe the only installation of Dataverse that's set up this way is https://dataverse.massopen.cloud . That said, the link above explains how to set up an environment like this if you want to play with it and give feedback.

There were also experiments going on with Spark recently that I wrote about in the "Spark and Dataverse (Big Data Containers, computation)" thread at https://groups.google.com/d/msg/dataverse-community/P4llZSssZ2Q/zvhGltLpAQAJ

It's funny that you mention uploading data directly in a Swift storage and then publish a dataset because there's a related system, also somewhat experimental, the Data Capture Module (DCM), that works something like this. http://guides.dataverse.org/en/4.9/developers/big-data-support.html is going to be way to hard for anyone who isn't close to this project to understand but if you look for "/api/batch/jobs/import/datasets/files" on the page, this is the step where files that have been uploaded outside of Dataverse (via rsync) and made visible within Dataverse as a "package". I guess I'm wondering if this concept could be extended to Swift.

I sure hope you're coming to the community meeting next week because I'd love to pick your brain! :)

Phil

1. https://docs.google.com/document/d/1Rl3DC83kqNILv90b2ryEKFKnlVhHYI9ov4-HJgXMNz0/edit?usp=sharing

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/bece66f6-9cfe-4bb5-af02-e02d6ad48e64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Jannik Lévesque

unread,

Jun 11, 2018, 1:03:22 PM6/11/18

to Dataverse Users Community

Hi!

I think my question is related. If I'm right, according to the documention, we can configure dataverse to create a new swift container to store the files we upload, but we can't create a dataset based on an existing swift container?

We are looking for a way to link external sources to datasets. For instance, if I have an existing swift container where my files are stored, I would like to link it to a dataset instead of creating a new container and uploading the files. Also, if I add new files to the container, they would appear in dataverse too. I think it was the last point of Esther.

We are interested in the data catalogue aspect of dataverse and sometimes would like to link data that is already stored somewhere else. Do you think it is something feasible? If so, do you have any idea on the best way to achieve this?

Philip Durbin

unread,

Jun 11, 2018, 3:49:51 PM6/11/18

to dataverse...@googlegroups.com

Yes, you and Esther are right that currently a new Swift container would be created when users upload files to Dataverse.

The main model Dataverse has for storing files elsewhere is that when datasets are "harvested" using OAI-PHM, metadata is recorded in Dataverse, but the files are stored remotely. So if the system in which you want to store the data can be harvested via OAI-PMH, that would be a way to have the files be stored outside of your installation of Dataverse but still have them be discoverable within Dataverse.

Speaking of Swift, I just got off the phone with Kevin and Amaz from Scholars Portal and we were talking about Swift. I'd like to get all of you Swift users (including https://dataverse.massopen.cloud ) talking to each other.

Phil

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2c041592-b1b4-43fc-b2a4-b87994ea0e6c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Guillaume Moutier

unread,

Jun 11, 2018, 11:16:31 PM6/11/18

to Dataverse Users Community

Hi Philip,

I'm jumping in on this discussion after Jannik (she's from my team at Laval University, Quebec City). Some quick context: we are setting up a BigData/Data Analyis platform which looks really similar to the "Cloud Dataverse" that has been presented at the OS Summit last year in Boston. The only difference we have is that instead of on-demand computing resources based on OpenStack, we'll try to go directly to containers with OpenShift. We already have recipes to spin Jupyter notebooks + Spark environment directly in containers. I had a chance to talk with people from RedHat involved in the MOC 3 weeks ago, there will surely be further talks.

So two quick things:

- In the early Cloud Dataverse presentations the datalake seemed to be based on Ceph storage (which is what we are deploying), but then it's always talked of Swift. Is it because Ceph was abandoned later on, or is the Swift API from Ceph being used?

- As you mentioned, I'd really like to get in touch with people with the same setup (or similar). So if there are any talks/meetings/groups for this, could you point me to the right direction?

Thanks,

Guillaume.

Le lundi 11 juin 2018 15:49:51 UTC-4, Philip Durbin a écrit :

Yes, you and Esther are right that currently a new Swift container would be created when users upload files to Dataverse.

The main model Dataverse has for storing files elsewhere is that when datasets are "harvested" using OAI-PHM, metadata is recorded in Dataverse, but the files are stored remotely. So if the system in which you want to store the data can be harvested via OAI-PMH, that would be a way to have the files be stored outside of your installation of Dataverse but still have them be discoverable within Dataverse.

Speaking of Swift, I just got off the phone with Kevin and Amaz from Scholars Portal and we were talking about Swift. I'd like to get all of you Swift users (including https://dataverse.massopen.cloud ) talking to each other.

Phil

On Mon, Jun 11, 2018 at 1:03 PM, Jannik Lévesque <jannikl...@gmail.com> wrote:

Hi!

I think my question is related. If I'm right, according to the documention, we can configure dataverse to create a new swift container to store the files we upload, but we can't create a dataset based on an existing swift container?

We are looking for a way to link external sources to datasets. For instance, if I have an existing swift container where my files are stored, I would like to link it to a dataset instead of creating a new container and uploading the files. Also, if I add new files to the container, they would appear in dataverse too. I think it was the last point of Esther.

We are interested in the data catalogue aspect of dataverse and sometimes would like to link data that is already stored somewhere else. Do you think it is something feasible? If so, do you have any idea on the best way to achieve this?

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2c041592-b1b4-43fc-b2a4-b87994ea0e6c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Philip Durbin

unread,

Jun 12, 2018, 6:44:07 AM6/12/18

to dataverse...@googlegroups.com

Hi, if you're interested in Jupyter notebooks, you should definitely check out the BinderHub issue at https://github.com/IQSS/dataverse/issues/4714

I went ahead and made the notes from the quick meeting yesterday with Kevin and Amaz public but I'm sure it will be hard to follow: https://docs.google.com/document/d/1klQSXj-GkmyLD4DlJIBSX_7uaB_EM0njHtm4livtqh0/edit?usp=sharing

We plan to talk some more in IRC in about 4 hours at 10:30 Eastern. All are welcome. You can join #dataverse on freenode or http://chat.dataverse.org . Logs will appear at http://irclog.iq.harvard.edu/dataverse/2018-06-12 . They're wondering how they can help make upload and download more robust. I'm interested in hearing more about their ideas for maybe using Globus for this.

I am not personally a Swift or Globus user but I'm trying to catch up. At some point we should probably have a community call on all this stuff: https://dataverse.org/community-calls

Phil

On Mon, Jun 11, 2018 at 11:16 PM, Guillaume Moutier <guillaum...@gmail.com> wrote:

Hi Philip,

I'm jumping in on this discussion after Jannik (she's from my team at Laval University, Quebec City). Some quick context: we are setting up a BigData/Data Analyis platform which looks really similar to the "Cloud Dataverse" that has been presented at the OS Summit last year in Boston. The only difference we have is that instead of on-demand computing resources based on OpenStack, we'll try to go directly to containers with OpenShift. We already have recipes to spin Jupyter notebooks + Spark environment directly in containers. I had a chance to talk with people from RedHat involved in the MOC 3 weeks ago, there will surely be further talks.
So two quick things:
- In the early Cloud Dataverse presentations the datalake seemed to be based on Ceph storage (which is what we are deploying), but then it's always talked of Swift. Is it because Ceph was abandoned later on, or is the Swift API from Ceph being used?
- As you mentioned, I'd really like to get in touch with people with the same setup (or similar). So if there are any talks/meetings/groups for this, could you point me to the right direction?

Thanks,

Guillaume.

Le lundi 11 juin 2018 15:49:51 UTC-4, Philip Durbin a écrit :

Yes, you and Esther are right that currently a new Swift container would be created when users upload files to Dataverse.

The main model Dataverse has for storing files elsewhere is that when datasets are "harvested" using OAI-PHM, metadata is recorded in Dataverse, but the files are stored remotely. So if the system in which you want to store the data can be harvested via OAI-PMH, that would be a way to have the files be stored outside of your installation of Dataverse but still have them be discoverable within Dataverse.

Speaking of Swift, I just got off the phone with Kevin and Amaz from Scholars Portal and we were talking about Swift. I'd like to get all of you Swift users (including https://dataverse.massopen.cloud ) talking to each other.

Phil

On Mon, Jun 11, 2018 at 1:03 PM, Jannik Lévesque <jannikl...@gmail.com> wrote:

Hi!

I think my question is related. If I'm right, according to the documention, we can configure dataverse to create a new swift container to store the files we upload, but we can't create a dataset based on an existing swift container?

We are looking for a way to link external sources to datasets. For instance, if I have an existing swift container where my files are stored, I would like to link it to a dataset instead of creating a new container and uploading the files. Also, if I add new files to the container, they would appear in dataverse too. I think it was the last point of Esther.

We are interested in the data catalogue aspect of dataverse and sometimes would like to link data that is already stored somewhere else. Do you think it is something feasible? If so, do you have any idea on the best way to achieve this?

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2c041592-b1b4-43fc-b2a4-b87994ea0e6c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/811e59a6-3cbd-4444-9383-0926bc9b164c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Esther Dzale

unread,

Jun 12, 2018, 9:42:26 AM6/12/18

to Dataverse Users Community

Hi Philip,

thank you for your answer and all the new insights. I will try my best to join the IRC talk today but I am not sure. At least, I will go through the documents you pointed out and provide some feedback.

Esther

Le vendredi 8 juin 2018 15:22:15 UTC+2, Philip Durbin a écrit :

Hi Esther,

Thanks for asking. I believe we're trying to get away from the term "Cloud Dataverse" but I know we've used this term in the past. There's a nice doc called "Cloud Dataverse User Stories"[1] that hasn't been touched in over a year that captures some of the ideas you're talking about. For example, "As a Curator on the Harvard Dataverse, I want to select a published dataset for replication in the cloud so that computations can be run on it."

There was some early work at https://github.com/IQSS/dataverse/pull/3239 but it was never merged.

That said, I don't mean to give the impression that nothing is happening. There's a new feature in Dataverse 4.9 documented at http://guides.dataverse.org/en/4.9/installation/config.html#setting-up-compute that explains that if you are using Swift for storage and have all the other prerequisites in place the "Compute" button now has three options:

- Compute on a single dataset
- Compute on multiple datasets
- Compute on a single datafile

I still consider all of this quite experimental. I believe the only installation of Dataverse that's set up this way is https://dataverse.massopen.cloud . That said, the link above explains how to set up an environment like this if you want to play with it and give feedback.

There were also experiments going on with Spark recently that I wrote about in the "Spark and Dataverse (Big Data Containers, computation)" thread at https://groups.google.com/d/msg/dataverse-community/P4llZSssZ2Q/zvhGltLpAQAJ

It's funny that you mention uploading data directly in a Swift storage and then publish a dataset because there's a related system, also somewhat experimental, the Data Capture Module (DCM), that works something like this. http://guides.dataverse.org/en/4.9/developers/big-data-support.html is going to be way to hard for anyone who isn't close to this project to understand but if you look for "/api/batch/jobs/import/datasets/files" on the page, this is the step where files that have been uploaded outside of Dataverse (via rsync) and made visible within Dataverse as a "package". I guess I'm wondering if this concept could be extended to Swift.

I sure hope you're coming to the community meeting next week because I'd love to pick your brain! :)

Phil

1. https://docs.google.com/document/d/1Rl3DC83kqNILv90b2ryEKFKnlVhHYI9ov4-HJgXMNz0/edit?usp=sharing

On Fri, Jun 8, 2018 at 5:03 AM, Esther Dzale <estd...@gmail.com> wrote:

Hi,
could someone tell me what is the current status of the Cloud Dataverse thing? Have you implemented the federated Dataverse within the MOC were data from any Dataverse can be replicated? Is it already possible to upload data directly in a Swift storage and then publish a dataset from Swift to Dataverse?
Esther

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/bece66f6-9cfe-4bb5-af02-e02d6ad48e64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ata Turk

unread,

Jun 12, 2018, 11:21:33 AM6/12/18

to Dataverse Users Community

Hi Guillaume,

I am leading the group at BU/MOC that develops the swift-backed DV along with the addition of compute buttons. In answer to your two questions:

- In the early Cloud Dataverse presentations the datalake seemed to be based on Ceph storage (which is what we are deploying), but then it's always talked of Swift. Is it because Ceph was abandoned later on, or is the Swift API from Ceph being used?

We are using the Swift API from CEPH, our Swift front-end is offered by our OpenStack cloud, and all of our OpenStack storage is backed by CEPH.

- As you mentioned, I'd really like to get in touch with people with the same setup (or similar). So if there are any talks/meetings/groups for this, could you point me to the right direction?

Would love to get in touch, you can email me directly if you have more questions also I plan to attend the DV days on Thursday so if you are around we can chat F2F as well.

Reply all

Reply to author

Forward