Migrating data from DSpace to Dataverse

251 views
Skip to first unread message

Joerg Messer

unread,
Jan 17, 2013, 8:02:13 PM1/17/13
to dataverse...@googlegroups.com
Greetings,

We're trying to determine the best way to migrate data from DSpace (v1.5.x) to DVN (v3.3?).  Rumour has it that there some new tools available with the coming v3.3 release that could make the bulk migration process somewhat easier.  Does anyone have any pointers regarding the bulk loading of DVN?  Thanks.

//Joerg Messer @ UBC

Gustavo Durand

unread,
Jan 17, 2013, 8:12:57 PM1/17/13
to dataverse...@googlegroups.com
Hi Joerg,

The new bulk process in 3.3 is for uploading a batch of files as a zip file to an existing study.

However, there has been the ability for a Network Admin to do a batch import for a long time now. Please take a look at the Import Utilities section of:


Will that help with what you are trying to accomplish? The key is to have the import directory with the correct structure. (and from DSpace you should be able to export out as Dublin Core for the batch import).

Let me know how this looks and if we can provide further guidance.

Gustavo

Joerg Messer

unread,
Jan 17, 2013, 8:32:36 PM1/17/13
to dataverse...@googlegroups.com
Gustavo,

I understand the basic approach which seems to be very similar to what DSpace uses with their ItemImport utility.  What I'm not sure about is the details.  The import description in the page you referenced is a little sketchy.  Would you happen to have a more detailed example featuring richer metadata and a number of files to upload?  That would be very useful. 

//Joerg Messer @ UBC Library

Gustavo Durand

unread,
Jan 17, 2013, 8:39:37 PM1/17/13
to dataverse...@googlegroups.com
Hi Joerg,

I hope this helps:

Basically create a directory, let's call it import/

Under this directory have 1 folder for each study so:

import/study1/
import/study2/
etc.

in each of those directories, have a study.xml file (often in ddi format, but in your case, I suspect dublin core) and any additional files associated with that study. You can also place these files in subdirectories, in which case the directory names will be used as the file category.

So for study1:

import/study1/study.xml
import/study1/file1
import/study1/category1/file2
import/study1/category1/file3
import/study1/category2/file4

The above would convert to 1 study in the chosen dataverse, with 5 associated files; file1 would not have a category, while others would.


Does that help clarify things?

Gustavo

Joerg Messer

unread,
Jan 17, 2013, 11:50:15 PM1/17/13
to dataverse...@googlegroups.com
Gustavo,

This is very similar to DSpace. Could you tell me if it's possible to
include a description with each uploaded file? It would also be
useful to be able to specify permissions since in many cases the
descriptive elements are public and the actual data is not. If so,
how is this metadata included?
--
J o e r g M e s s e r
1.604.708.0671 | joerg....@gmail.com

Gustavo Durand

unread,
Jan 18, 2013, 2:10:22 PM1/18/13
to dataverse...@googlegroups.com
Unfortunately it is not currently possible to include descriptions or
permission through the batch import. These would have to done manually
afterwards through the UI.

Also note, that since the purpose of the batch import was to
facilitate the data entry process of multiple studies, once the
studies are imported, they are still in "draft" state. A curator or
admin should review each study and release for publication.

Joerg Messer

unread,
Jan 18, 2013, 2:35:42 PM1/18/13
to dataverse...@googlegroups.com
Thanks for the clarification.

Joerg Messer

unread,
Jan 22, 2013, 3:02:01 PM1/22/13
to dataverse...@googlegroups.com

I thought it might be worth making one more query wrt to migrating data from DSpace to DVN.  We have about 1600 data sets in a DSpace v1.5.x repository (abacus.library.ubc.ca) and almost all the files in these data sets have detailed file level descriptions of some kind.  Having to manually re-enter this information would be *extremely* painful.  Is there no way that we could automate the migration process so that all off our metadata including these file descriptions manages to make it over to DVN?  Are there any plans for this kind of functionality in the near future?  Or, possibly, is there some a way I can insert the necessary info directly into the DVN database?  Any suggestions would be most welcome. 

//Joerg Messer - UBC Library

On Friday, 18 January 2013 11:35:42 UTC-8, Joerg Messer wrote:
Thanks for the clarification.

On Fri, Jan 18, 2013 at 11:10 AM, Gustavo Durand
<> wrote:
> Unfortunately it is not currently possible to include descriptions or
> permission through the batch import. These would have to done manually
> afterwards through the UI.
>
> Also note, that since the purpose of the batch import was to facilitate the
> data entry process of multiple studies, once the studies are imported, they
> are still in "draft" state. A curator or admin should review each study and
> release for publication.
>
>
>
> On Jan 17, 2013, at 23:50 , Joerg Messer wrote:
>
>> Gustavo,
>>
>> This is very similar to DSpace.  Could you tell me if it's possible to
>> include a description with each uploaded file?  It would also be
>> useful to be able to specify permissions since in many cases the
>> descriptive elements are public and the actual data is not.  If so,
>> how is this metadata included?
>>
>>
>> On Thu, Jan 17, 2013 at 5:39 PM, Gustavo Durand

Gustavo Durand

unread,
Jan 22, 2013, 4:56:06 PM1/22/13
to dataverse...@googlegroups.com
Within the app, there is no way. Currently, we have not planned for any such functionality, though it would clearly be a useful addition.

However, as you have to have sysadmin access to do this anyway (to put the import directory in the right place), I assume you also have access to the db.

So then you could just do some batch UPDATE query on the filemetadata table right after the import happens - it would have to use a text editing software or Excel to produce several individual queries in the form:

UPDATE filemetadata set description = 'DESCRIPTION' where label = 'FILENAME'; 

Of course FILENAME would have to be unique across studies, so that that it doesn't update more than 1 record per file. Is this the case?


Does that help?

Gustavo

Joerg Messer

unread,
Jan 22, 2013, 5:50:12 PM1/22/13
to dataverse...@googlegroups.com
Gustavo,

The file names being unique would be a long shot but I'm assuming that we could query based on the title and then link that to the files names.  There must be some way to tie the file name to the study which should make it unique.  Are there any further schema docs available?

//Joerg

Gustavo Durand

unread,
Jan 22, 2013, 6:12:58 PM1/22/13
to dataverse...@googlegroups.com

It is of course link to the study, that just would make the query more complicated, so I was hoping to avoid. We don't really have good documentation on the schema (this is something we need to work on ASAP), but I'm sure I can help you figure it out.

Basically the title would be stored in the metadata table. Each metadata and filemetadata is associated with a studyversion.

So filemetadata has a studyversion_id and studyversion has a metadata_id.

So something like (note, I haven't tested/verified this yet):

UPDATE filemetadata fm set fm.description = 'DESCRIPTION' 
from studyversion sv, metadata m
where fm.studyversion_id = sv.id
and sv.metadata_id = m.id
and m.title = 'STUDY_TITLE'
and fm.label = 'FILENAME';


Gustavo

Joerg Messer

unread,
Jan 22, 2013, 6:37:21 PM1/22/13
to dataverse...@googlegroups.com
Gustavo,

Many thanks for your help.  When I understand the batch upload process a little better I'll try some queries against the database.  I'm just doing some simple test uploads now.  I'm wondering if it's better to downgrade from the DSpace Qualified DC to the DVN Unqualified DC or upgrade to DVN DDI.  I think DDI might be less painful since UDC is not quite detailed enough for us. 

Philip Durbin

unread,
Jan 22, 2013, 6:46:43 PM1/22/13
to dataverse...@googlegroups.com
On Tue, Jan 22, 2013 at 6:12 PM, Gustavo Durand
<gdu...@hmdc.harvard.edu> wrote:
> We don't really have good documentation on the schema
> (this is something we need to work on ASAP)

Please take this with a HUGE grain of salt, but I just ran SchemaSpy
on a fresh install of DVN 3.3 and this is what it showed:

http://dvn-5.hmdc.harvard.edu/tmp-schemaspy-tmp/relationships.html

I put "tmp" all over this thing because it shouldn't be considered
authoritative at all... I'm new here. :)

Phil

p.s. Here's how I generated those images:

java -jar /tmp/schemaSpy_5.0.0.jar -t pgsql -host localhost -db dvnDb
-u dvnApp -p secret -dp
/root/dvninstall/pgdriver/postgresql-8.4-703.jdbc4.jar -o /tmp/out -s
public

--
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Joerg Messer

unread,
Jan 23, 2013, 1:40:55 PM1/23/13
to dataverse...@googlegroups.com, philip...@harvard.edu

Thanks Phil.  Very useful.

Stephen Marks

unread,
Jan 24, 2013, 8:02:18 AM1/24/13
to dataverse...@googlegroups.com, philip...@harvard.edu
Yes, this is great!
Reply all
Reply to author
Forward
0 new messages