Backup question

Jamie Jamison

unread,

Sep 19, 2019, 2:01:08 PM9/19/19

to Dataverse Users Community

The UCLA dataverse is on AWS, data in an S3 bucket that is cross region replicated. For backup of the data I'm running rclone to backup the data to our department box.

I'm reading through the dataverse backup scripts. My question is how others are backing up their metadata. I need to setup a script to backup the database but wasn't sure how other people are setting this up.

Thank you,

Jamie Jamison

jam...@library.ucla.edu

Don Sizemore

unread,

Sep 19, 2019, 2:04:18 PM9/19/19

to dataverse...@googlegroups.com

Hello,

We're queuing this up in cron: https://wiki.postgresql.org/wiki/Automated_Backup_on_Linux

and pushing the resulting nightly backups into our preservation pipeline (in iRODS).

Donald

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fc53d7a5-d01c-49b6-aaab-ae011236599a%40googlegroups.com.

Jamie Jamison

unread,

Sep 19, 2019, 3:42:51 PM9/19/19

to Dataverse Users Community

I know metadata can be exported but just to clarify - is the dataset metadata in the postgres database? I'm trying to setup backup and restore procedures and describe where the various pieces are located. The hypothetical scenario is rebuilding our system from backups.

On Thursday, September 19, 2019 at 11:04:18 AM UTC-7, Don Sizemore wrote:

Hello,

We're queuing this up in cron: https://wiki.postgresql.org/wiki/Automated_Backup_on_Linux
and pushing the resulting nightly backups into our preservation pipeline (in iRODS).

Donald

On Thu, Sep 19, 2019 at 2:01 PM Jamie Jamison <jam...@g.ucla.edu> wrote:

The UCLA dataverse is on AWS, data in an S3 bucket that is cross region replicated. For backup of the data I'm running rclone to backup the data to our department box.

I'm reading through the dataverse backup scripts. My question is how others are backing up their metadata. I need to setup a script to backup the database but wasn't sure how other people are setting this up.

Thank you,

Jamie Jamison
jam...@library.ucla.edu

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Don Sizemore

unread,

Sep 19, 2019, 3:54:00 PM9/19/19

to dataverse...@googlegroups.com

The main pieces you want are:

• user-uploaded datafiles (formally known as the files.dir hierarchy, now in S3 for you)

• the postgres database (which indeed contains all metadata and may be backed up via shell script)

• any customization you made beneath /usr/local/glassfish4/glassfish/domains/domain1/docroot or equivalent (in our case, none)

Our datafiles still live on local storage, so we drop our nightly database dumps alongside files.dir and push the whole thing into our preservation pipeline.

Donald

On Thu, Sep 19, 2019 at 3:42 PM Jamie Jamison <jam...@g.ucla.edu> wrote:

I know metadata can be exported but just to clarify - is the dataset metadata in the postgres database? I'm trying to setup backup and restore procedures and describe where the various pieces are located. The hypothetical scenario is rebuilding our system from backups.

On Thursday, September 19, 2019 at 11:04:18 AM UTC-7, Don Sizemore wrote:

Hello,

We're queuing this up in cron: https://wiki.postgresql.org/wiki/Automated_Backup_on_Linux
and pushing the resulting nightly backups into our preservation pipeline (in iRODS).

Donald

On Thu, Sep 19, 2019 at 2:01 PM Jamie Jamison <jam...@g.ucla.edu> wrote:

The UCLA dataverse is on AWS, data in an S3 bucket that is cross region replicated. For backup of the data I'm running rclone to backup the data to our department box.

I'm reading through the dataverse backup scripts. My question is how others are backing up their metadata. I need to setup a script to backup the database but wasn't sure how other people are setting this up.

Thank you,

Jamie Jamison
jam...@library.ucla.edu

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fc53d7a5-d01c-49b6-aaab-ae011236599a%40googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c97f6663-c743-4a9a-a507-355d68e7b5f2%40googlegroups.com.

Jamie Jamison

unread,

Sep 19, 2019, 5:56:14 PM9/19/19

to Dataverse Users Community

That makes things seem clearer. Our user uploaded datafiles in S3 are backed up and I've got a got a cron script to back up postgres.

Silly as it seems I wasn't sure where the metadata was but now it seems more obvious.

Thank you again,

Jamie

On Thursday, September 19, 2019 at 12:54:00 PM UTC-7, Don Sizemore wrote:

The main pieces you want are:

• user-uploaded datafiles (formally known as the files.dir hierarchy, now in S3 for you)
• the postgres database (which indeed contains all metadata and may be backed up via shell script)
• any customization you made beneath /usr/local/glassfish4/glassfish/domains/domain1/docroot or equivalent (in our case, none)

Our datafiles still live on local storage, so we drop our nightly database dumps alongside files.dir and push the whole thing into our preservation pipeline.

Donald

On Thu, Sep 19, 2019 at 3:42 PM Jamie Jamison <jam...@g.ucla.edu> wrote:

I know metadata can be exported but just to clarify - is the dataset metadata in the postgres database? I'm trying to setup backup and restore procedures and describe where the various pieces are located. The hypothetical scenario is rebuilding our system from backups.

On Thursday, September 19, 2019 at 11:04:18 AM UTC-7, Don Sizemore wrote:

Hello,

We're queuing this up in cron: https://wiki.postgresql.org/wiki/Automated_Backup_on_Linux
and pushing the resulting nightly backups into our preservation pipeline (in iRODS).

Donald

On Thu, Sep 19, 2019 at 2:01 PM Jamie Jamison <jam...@g.ucla.edu> wrote:

The UCLA dataverse is on AWS, data in an S3 bucket that is cross region replicated. For backup of the data I'm running rclone to backup the data to our department box.

I'm reading through the dataverse backup scripts. My question is how others are backing up their metadata. I need to setup a script to backup the database but wasn't sure how other people are setting this up.

Thank you,

Jamie Jamison
jam...@library.ucla.edu

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fc53d7a5-d01c-49b6-aaab-ae011236599a%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,

Sep 20, 2019, 3:06:10 PM9/20/19

to dataverse...@googlegroups.com

While it's certainly true that the metadata for datasets is in the database, one of the features of Dataverse is that the metadata is also placed onto the filesystem (or S3 or Swift) for safe keeping in a variety of standard formats such as Dublin Core, DDI, etc.

All of these files end with ".cached" and look something like this:

[pdurbin@dvnweb-vm6 ~]$ cd /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/YZGEJS
[pdurbin@dvnweb-vm6 YZGEJS]$ ls -1
16d4606c55d-e9b9fe7cdec1
16d4606ca7a-ee63a3950028
export_Datacite.cached
export_dataverse_json.cached
export_dcterms.cached
export_ddi.cached
export_html.cached
export_oai_datacite.cached
export_oai_dc.cached
export_oai_ddi.cached
export_OAI_ORE.cached
export_schema.org.cached

[pdurbin@dvnweb-vm6 YZGEJS]$

Here's an example of how the Dublin Core file looks:

[pdurbin@dvnweb-vm6 YZGEJS]$ cat export_dcterms.cached | xmllint -format -
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns="http://dublincore.org/documents/dcmi-terms/">
<dcterms:title>Darwin's Finches</dcterms:title>
<dcterms:identifier>https://doi.org/10.5072/FK2/YZGEJS</dcterms:identifier>
<dcterms:creator>Finch, Fiona</dcterms:creator>
<dcterms:publisher>Root</dcterms:publisher>
<dcterms:issued>2019-09-18</dcterms:issued>
<dcterms:modified>2019-09-18T20:18:32Z</dcterms:modified>
<dcterms:description>Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.</dcterms:description>
<dcterms:subject>Medicine, Health and Life Sciences</dcterms:subject>
<dcterms:license>NONE</dcterms:license>
</metadata>
[pdurbin@dvnweb-vm6 YZGEJS]$

The idea here is that if 30 years from now nobody can figure out how to restore your PostgreSQL database to whatever newfangled technology they're using in in 2049 someone should still be able to figure out how print out the contents of flat files like XML and JSON.

All this is to say that as part of your backup strategy (dare I say archiving and preservation strategy) you could consider backing up all these XML and JSON files. Just in case. :)

This feature is described (perhaps as not as well as it could be) under "Automatic Exports" in the following way at http://guides.dataverse.org/en/4.16/admin/metadataexport.html#automatic-exports

"Publishing a dataset automatically starts a metadata export job, that will run in the background, asynchronously. Once completed, it will make the dataset metadata exported and cached in all the supported formats listed under Supported Metadata Export Formats in the Dataset + File Management section of the User Guide. A scheduled timer job that runs nightly will attempt to export any published datasets that for whatever reason haven’t been exported yet."

(The other reason we save these files to disk is for performance so that don't have to be regenerated on the fly from the database whenever someone clicks "Metadata" and "Export Metadata".)

We don't talk about this feature at https://dataverse.org/software-features and I'm not sure what to call it. Ideas are welcome. :)

I hope this helps.

Thanks,

Phil

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fc53d7a5-d01c-49b6-aaab-ae011236599a%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c97f6663-c743-4a9a-a507-355d68e7b5f2%40googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/b80a20d2-be4f-494a-8168-bc49d4a64d85%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Janet McDougall - Australian Data Archive

unread,

Sep 22, 2019, 9:04:07 PM9/22/19

to Dataverse Users Community

Great conversation here, and Phil this metadata preservation information is very useful. I didn’t realise that the metadata 'export' accessed this store rather than the database - i can understand why as you have described though.

Will the Data Curation tool 'variable metadata' also be a metadata export when it's in production?

thanks
Janet

Philip Durbin

unread,

Sep 22, 2019, 9:39:02 PM9/22/19

to dataverse...@googlegroups.com

My understanding of the Data Curation Tool is that it is enriching variable metadata stored in the database which (as before) is then exported into DDI, for example. I think one of the big things is being able to add weights to variables but it's probably best to consult Victoria Lubitch's slides about the tool: https://osf.io/a2wtk/

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

Philipp at UiT

unread,

Sep 23, 2019, 7:30:31 AM9/23/19

to Dataverse Users Community

Just to be sure: The .cached files are stored in the file system we use as part of our Dataverse installation in addition to be stored in the PostgreSQL database? I wasn't aware of this. I think this is very useful when it comes to long-term preservation efforts. Do preservation trails in e.g. Archivematica relate to these files in any way?

Best, Philipp

mandag 23. september 2019 03.39.02 UTC+2 skrev Philip Durbin følgende:

My understanding of the Data Curation Tool is that it is enriching variable metadata stored in the database which (as before) is then exported into DDI, for example. I think one of the big things is being able to add weights to variables but it's probably best to consult Victoria Lubitch's slides about the tool: https://osf.io/a2wtk/

On Sun, Sep 22, 2019 at 9:04 PM Janet McDougall - Australian Data Archive <janet.m...@anu.edu.au> wrote:

Great conversation here, and Phil this metadata preservation information is very useful. I didn’t realise that the metadata 'export' accessed this store rather than the database - i can understand why as you have described though.

Will the Data Curation tool 'variable metadata' also be a metadata export when it's in production?

thanks
Janet

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

Philip Durbin

unread,

Sep 23, 2019, 10:08:23 AM9/23/19

to dataverse...@googlegroups.com

Yes. Out of the box, the "cached" files are stored on the filesystem but they'll be on S3 or Swift if you use those alternate file storage options (I'm 90% sure of this). I do agree that this is helpful for long term preservation. I've heard Jon Crabtree from Odum talk about the importance of this feature. I'm still interested in a good, short way to talk about this at https://dataverse.org/software-features so if any wordsmiths out there have any suggestions, please let me know. :) I keep thinking I should give a talk called "hidden features of Dataverse". :)

I'm not sure what "preservation trails" are but my reading of https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/transfer/dataverse/#dataverse-mets-file that the "native JSON" format produced by Dataverse (export_dataverse_json.cached in my example above) is transformed by Archivematica into a DDI-based METS file. I'm sorry if I have any of this wrong. Someone else out there knows the details better than I do.

I hope this helps,

Phil

On Mon, Sep 23, 2019 at 7:30 AM Philipp at UiT <uit.p...@gmail.com> wrote:

Just to be sure: The .cached files are stored in the file system we use as part of our Dataverse installation in addition to be stored in the PostgreSQL database? I wasn't aware of this. I think this is very useful when it comes to long-term preservation efforts. Do preservation trails in e.g. Archivematica relate to these files in any way?

Best, Philipp

mandag 23. september 2019 03.39.02 UTC+2 skrev Philip Durbin følgende:

My understanding of the Data Curation Tool is that it is enriching variable metadata stored in the database which (as before) is then exported into DDI, for example. I think one of the big things is being able to add weights to variables but it's probably best to consult Victoria Lubitch's slides about the tool: https://osf.io/a2wtk/

On Sun, Sep 22, 2019 at 9:04 PM Janet McDougall - Australian Data Archive <janet.m...@anu.edu.au> wrote:

Great conversation here, and Phil this metadata preservation information is very useful. I didn’t realise that the metadata 'export' accessed this store rather than the database - i can understand why as you have described though.

Will the Data Curation tool 'variable metadata' also be a metadata export when it's in production?

thanks
Janet

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

--
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eaf4e260-dd14-4ec6-b639-b5474d88ec3b%40googlegroups.com.

Philipp at UiT

unread,

Sep 23, 2019, 10:19:28 AM9/23/19

to Dataverse Users Community

Thanks, Phil. The term "preservation trail" is probably not used anywhere in any preservation framework, but was coined by me on the fly in an attempt to distract from my ignorance on this issue. ;-) We are not using Archivematica or any similar tool yet, so like Phil, I'd be interested in any more details.

Philipp

mandag 23. september 2019 16.08.23 UTC+2 skrev Philip Durbin følgende:

Yes. Out of the box, the "cached" files are stored on the filesystem but they'll be on S3 or Swift if you use those alternate file storage options (I'm 90% sure of this). I do agree that this is helpful for long term preservation. I've heard Jon Crabtree from Odum talk about the importance of this feature. I'm still interested in a good, short way to talk about this at https://dataverse.org/software-features so if any wordsmiths out there have any suggestions, please let me know. :) I keep thinking I should give a talk called "hidden features of Dataverse". :)

I'm not sure what "preservation trails" are but my reading of https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/transfer/dataverse/#dataverse-mets-file that the "native JSON" format produced by Dataverse (export_dataverse_json.cached in my example above) is transformed by Archivematica into a DDI-based METS file. I'm sorry if I have any of this wrong. Someone else out there knows the details better than I do.

I hope this helps,

Phil

On Mon, Sep 23, 2019 at 7:30 AM Philipp at UiT <uit.p...@gmail.com> wrote:

Just to be sure: The .cached files are stored in the file system we use as part of our Dataverse installation in addition to be stored in the PostgreSQL database? I wasn't aware of this. I think this is very useful when it comes to long-term preservation efforts. Do preservation trails in e.g. Archivematica relate to these files in any way?

Best, Philipp

mandag 23. september 2019 03.39.02 UTC+2 skrev Philip Durbin følgende:

My understanding of the Data Curation Tool is that it is enriching variable metadata stored in the database which (as before) is then exported into DDI, for example. I think one of the big things is being able to add weights to variables but it's probably best to consult Victoria Lubitch's slides about the tool: https://osf.io/a2wtk/

On Sun, Sep 22, 2019 at 9:04 PM Janet McDougall - Australian Data Archive <janet.m...@anu.edu.au> wrote:

Great conversation here, and Phil this metadata preservation information is very useful. I didn’t realise that the metadata 'export' accessed this store rather than the database - i can understand why as you have described though.

Will the Data Curation tool 'variable metadata' also be a metadata export when it's in production?

thanks
Janet

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

--
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eaf4e260-dd14-4ec6-b639-b5474d88ec3b%40googlegroups.com.

Philip Durbin

unread,

Sep 23, 2019, 10:32:22 AM9/23/19

to dataverse...@googlegroups.com

Thanks. My understanding of http://guides.dataverse.org/en/4.16/installation/config.html#bagit-export is that the "export_OAI_ORE.cached" file is included in BagIt file (maybe under a different name).

On Mon, Sep 23, 2019 at 10:19 AM Philipp at UiT <uit.p...@gmail.com> wrote:

Thanks, Phil. The term "preservation trail" is probably not used anywhere in any preservation framework, but was coined by me on the fly in an attempt to distract from my ignorance on this issue. ;-) We are not using Archivematica or any similar tool yet, so like Phil, I'd be interested in any more details.

Philipp

mandag 23. september 2019 16.08.23 UTC+2 skrev Philip Durbin følgende:

Yes. Out of the box, the "cached" files are stored on the filesystem but they'll be on S3 or Swift if you use those alternate file storage options (I'm 90% sure of this). I do agree that this is helpful for long term preservation. I've heard Jon Crabtree from Odum talk about the importance of this feature. I'm still interested in a good, short way to talk about this at https://dataverse.org/software-features so if any wordsmiths out there have any suggestions, please let me know. :) I keep thinking I should give a talk called "hidden features of Dataverse". :)

I'm not sure what "preservation trails" are but my reading of https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/transfer/dataverse/#dataverse-mets-file that the "native JSON" format produced by Dataverse (export_dataverse_json.cached in my example above) is transformed by Archivematica into a DDI-based METS file. I'm sorry if I have any of this wrong. Someone else out there knows the details better than I do.

I hope this helps,

Phil

On Mon, Sep 23, 2019 at 7:30 AM Philipp at UiT <uit.p...@gmail.com> wrote:

Just to be sure: The .cached files are stored in the file system we use as part of our Dataverse installation in addition to be stored in the PostgreSQL database? I wasn't aware of this. I think this is very useful when it comes to long-term preservation efforts. Do preservation trails in e.g. Archivematica relate to these files in any way?

Best, Philipp

mandag 23. september 2019 03.39.02 UTC+2 skrev Philip Durbin følgende:

My understanding of the Data Curation Tool is that it is enriching variable metadata stored in the database which (as before) is then exported into DDI, for example. I think one of the big things is being able to add weights to variables but it's probably best to consult Victoria Lubitch's slides about the tool: https://osf.io/a2wtk/

On Sun, Sep 22, 2019 at 9:04 PM Janet McDougall - Australian Data Archive <janet.m...@anu.edu.au> wrote:

Great conversation here, and Phil this metadata preservation information is very useful. I didn’t realise that the metadata 'export' accessed this store rather than the database - i can understand why as you have described though.

Will the Data Curation tool 'variable metadata' also be a metadata export when it's in production?

thanks
Janet

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

--
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eaf4e260-dd14-4ec6-b639-b5474d88ec3b%40googlegroups.com.

--
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/d79f753c-357f-4c8b-86fe-c655cc8549c3%40googlegroups.com.

James Myers

unread,

Sep 23, 2019, 10:53:23 AM9/23/19

to dataverse...@googlegroups.com

and the datacite.xml file as well.

Right now though, I think the database is the only complete record, and the only one that would allow you to restore a Dataverse instance. Some of the export formats are partial by design, i.e. in that they only include the subset of metadata that can be mapped to a particular schema/format. The json and ore exports are conceptually different – they are nominally intended to be complete (at least the ore map is since I was the one with the intention) – but, in practice, I don’t think they include everything needed to round-trip yet. For the ORE map, I know that it was in development before the variable metadata editing was merged and before provenance (text and file) was done, so it doesn’t include everything needed to restore a dataset yet. It’s definitely a goal of the ORE/BagIt effort to make the archive file sufficient to restore a dataset into a working Dataverse instance, and the Dataverse API is sufficient/close to sufficient to export/re-ingest everything so tools like pyDataverse could become a back-up tool. But for now, I think the cached metadata files are more of a preservation option than a back-up one.

-- Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8FZsQ_nx7iG1MmBC9vR5LnnjP3NtsqmsUGKveRUz0txLA%40mail.gmail.com.

Janet McDougall - Australian Data Archive

unread,

Sep 23, 2019, 10:34:19 PM9/23/19

to Dataverse Users Community

hi Jim & ALL

Thanks for the further clarification on the ORE/Bagit status. I'm still not exactly clear as to how/where the variable level metadata will be captured though. Will it be available as part of the cached metadata?

How are others preserving their data if it is not ingested through Dataverse as tabular data? Do you all follow preservation steps such as Archivematica would do with microservices?

ADA usually processes data in SPSS, and uses Stat Transfer to export spss data as .dat and .sps (syntax) to reconstruct the data, and in the past used Nesstar to export data file and variable metadata in DDI xml. This is why the metadata cache is really useful to understand.

Janet

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eaf4e260-dd14-4ec6-b639-b5474d88ec3b%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/d79f753c-357f-4c8b-86fe-c655cc8549c3%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

James Myers

unread,

Sep 24, 2019, 3:41:39 PM9/24/19

to dataverse...@googlegroups.com

Janet,

FWIW: The ORE map being generated is stored as a cached file and is available from the ‘export metadata’ menu. That said, the main purpose in creating it was to include it in the BagIt file along with all the content (and other files required by the BagIt spec, and more recently, as recommended by RDA) and to be able to transfer those zipped Bags to long term storage. And, eventually, to pull them back into Dataverse. For that reason, the initial code doesn’t archive things like variable level metadata that would be regenerated on import. However, that changes now that the variable metadata is editable.

So far code exists to create the ORE map and the Bag, and to send the Bag to Google’s Cloud, to DuraCloud (previously the way into DPN and currently a way to send things to Chronopolis) or to the file system (created by Odum as a way to then manage the Bags in iRODS) – with creation triggered manually as an admin or automatically as part of publication. The initial concept is that the archive represents the state of the Dataset in Dataverse and that it is stored but not being further processed ala Archivematica.

Open issues:

· There is not yet a way to read the Bag automatically back into a Dataverse instance.

· While I believe the Bags with their ORE map files were sufficient to recreate a Dataset in Dataverse (i.e. all the data and metadata that users see would be the same after a round-trip) this is no longer true given the recent work to make variable-level metadata editable (so what Dataverse would re-derive is not the same as the edited original) and to allow uploaded provenance files, neither of which are included in the current Bags.

· As said above, the original use case was for archiving/future restoration rather than preservation, so the Bags currently don’t include metadata/data that is automatically derived by Dataverse.

The first two are definitely things that we (I/QDR/GDCC/…) want to/plan to address in the ~near term. This would include adding the edit variable-level metadata to the OREmap and adding any provenance file to the Bag (with a reference in the ORE map). (The Dataverse Uploader already has some place-holder code for reading a Bag, so we have a starting point).

The latter item – thinking about the Bag in support of preservation is something that could be done, e.g. a switch to add all derivable data/metadata, including all of the cached metadata export files, to the Bag. I guess one question for the community is whether that’s useful. (Should all the export formats be included? Just specified ones? If the ORE map includes the variable level metadata (and potentially the derived file formats), are the exported metadata files also needed?)

--Jim

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eaf4e260-dd14-4ec6-b639-b5474d88ec3b%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/d79f753c-357f-4c8b-86fe-c655cc8549c3%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8FZsQ_nx7iG1MmBC9vR5LnnjP3NtsqmsUGKveRUz0txLA%40mail.gmail.com.

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/74b5072b-1f9e-4938-bbb1-f7b08e63a7cf%40googlegroups.com.

Janet McDougall - Australian Data Archive

unread,

Sep 25, 2019, 12:37:53 AM9/25/19

to Dataverse Users Community

thanks Jim, i will need some time to digest that but really useful to consider around preservation and restoration. I will include these notes when we get to this point.

Janet

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/02cdd1a2-8cfd-49b3-a047-8071121dc530%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eaf4e260-dd14-4ec6-b639-b5474d88ec3b%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/d79f753c-357f-4c8b-86fe-c655cc8549c3%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8FZsQ_nx7iG1MmBC9vR5LnnjP3NtsqmsUGKveRUz0txLA%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Reply all

Reply to author

Forward