How to retrieve metadata of ALL datasets on demo server?

297 views
Skip to first unread message

Yuzhang Han

unread,
Nov 23, 2019, 10:22:10 PM11/23/19
to Dataverse Users Community
Hi there,

I am working on a research project where I need to download the metadata of ALL datasets on a server. I was wondering if it was possible to do that using any API command, or in any way?

Thanks,
Andrew

Gautier, Julian

unread,
Nov 24, 2019, 8:43:20 AM11/24/19
to dataverse...@googlegroups.com
Hi Andrew,

I can think of three ways, depending on the amount of metadata you'd like for each dataset and what format you'd like it in, and I'm sure others in our community will have more insight.
  • Dataverse's Search API can get you some metadata, in JSON format, for all published datasets in a repository (server). (The recent Dataverse update, v4.18, also makes it possible to retrieve metadata of any unpublished datasets that an account you have on the Dataverse repository has the right permissions on.) Each API call can retrieve up to 1000 records, so you might have to iterate through pages of results if there are more than 1000 records. The Search API guide includes a helpful python script for doing this. Some repositories require you to use an API token, like UNC Dataverse, so you might have to create a repository account (or ask that one be created for you) and create an API token to get the Search API to work for that repository.
  • The Native API includes an endpoint that, for a given dataset persistent ID or dataset database ID, will export that dataset's metadata in JSON or XML and in one several standards, including Dataverse's JSON standard (dataverse_json), which will include all of a dataset's metadata. You can get the dataset PIDs of a repository by using the Search API.
  • The repository might be publishing all of its dataset metadata over OAI-MPH, and you can access those records over the web to scrape metadata in several metadata standards. (In case you're not too familiar with OAI-PMH, I like DataCite's guide, and the records are paginated, like Dataverse's Search API, so you'd have to iterate through pages of results.) This spreadsheet has a list of base addresses for Dataverse-based repositories.
If you're able to share any details, I'd love to know what you're working on and what you end up doing!

All best,
Julian

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/8e6d5ae2-6a5c-456a-84b7-225bec1cbf02%40googlegroups.com.


--
Julian Gautier
Product Research Specialist, IQSS

Julian Gautier

unread,
Nov 24, 2019, 1:07:39 PM11/24/19
to Dataverse Users Community
Hi again,

Just realized your original post's title has "on demo server". If you mean demo.datavese.org, it might be helpful to know that that server doesn't require an API token to use the Search API, and metadata records for all of its published datasets are available over OAI-PMH. If you go the OAI-PMH route, you can use "dataverse_json" as the metadata export format, and each records will include the "directApiCall" to retrieve dataset metadata in JSON, e.g. https://demo.dataverse.org/oai?verb=ListRecords&metadataPrefix=dataverse_json. The "oai_ddi" format is in xml, so all metadata elements are included. "oai_ddi" includes more metadata than the other formats (oai_dc, datacite, etc).

Hope that's helpful!
Julian


Stefan Kasberger

unread,
Dec 3, 2019, 5:45:33 AM12/3/19
to Dataverse Users Community
Hi,

I can also recommend pyDataverse, a Python API wrapper developed by me. :)
You would have to add the feature to retrieve all Dataverses. Once you have all Dataverse Aliases, you can easily retrieve all Datasets inside.

Cheerz, Stefan

Stefan Kasberger

unread,
Dec 3, 2019, 5:46:31 AM12/3/19
to Dataverse Users Community
Here the missing Project URL on GitHub: https://github.com/AUSSDA/pyDataverse


Am Sonntag, 24. November 2019 04:22:10 UTC+1 schrieb Yuzhang Han:

Jamie Jamison

unread,
Dec 19, 2019, 7:49:44 PM12/19/19
to Dataverse Users Community
Hi,  I also need to retrieve an entire collection's metadata.  I've been working through the pyDataverse examples but can't find what feature to add so I can retrieve all the dataverses.   Thank you,  jamie

Philip Durbin

unread,
Dec 19, 2019, 7:57:30 PM12/19/19
to dataverse...@googlegroups.com
Hi Jamie, when you say "collection" can you please be a little more specific? Do you mean all of https://dataverse.ucla.edu ? Including harvested datasets?

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.


--

Jamie Jamison

unread,
Dec 20, 2019, 1:15:29 PM12/20/19
to Dataverse Users Community
Sorry, too many cold meds, I wasn't clear.  The metadata I want is from the Social Science Data Archive which is hosted at harvard.dataverse not ucla.  This one is at: https://dataverse.harvard.edu/dataverse/ssda_ucla.  I'd like to get the metadata for all of it - we tend to think of the SSDA as a collection.


On Thursday, December 19, 2019 at 4:57:30 PM UTC-8, Philip Durbin wrote:
Hi Jamie, when you say "collection" can you please be a little more specific? Do you mean all of https://dataverse.ucla.edu ? Including harvested datasets?

On Thu, Dec 19, 2019 at 7:49 PM Jamie Jamison <jam...@g.ucla.edu> wrote:
Hi,  I also need to retrieve an entire collection's metadata.  I've been working through the pyDataverse examples but can't find what feature to add so I can retrieve all the dataverses.   Thank you,  jamie

On Tuesday, December 3, 2019 at 2:45:33 AM UTC-8, Stefan Kasberger wrote:
Hi,

I can also recommend pyDataverse, a Python API wrapper developed by me. :)
You would have to add the feature to retrieve all Dataverses. Once you have all Dataverse Aliases, you can easily retrieve all Datasets inside.

Cheerz, Stefan

Am Sonntag, 24. November 2019 04:22:10 UTC+1 schrieb Yuzhang Han:
Hi there,

I am working on a research project where I need to download the metadata of ALL datasets on a server. I was wondering if it was possible to do that using any API command, or in any way?

Thanks,
Andrew

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Julian Gautier

unread,
Dec 21, 2019, 1:49:41 PM12/21/19
to dataverse...@googlegroups.com
Hi everyone,

I'd like to share a collection of Python 3 scripts I've been using for collecting and analyzing the metadata of datasets in Dataverse-based repositories. The scripts are published and documented at https://github.com/jggautier/dataverse-scripts/tree/master/get-dataverse-metadata, and they use pyDataverse. Please feel free to re-write (and for you Python gurus out there, improve?) any of the scripts.

I used the scripts to download the Dataverse JSON metadata of the 30k+ datasets published in the Harvard Dataverse repository and write certain metadata to CSV files. The JSON files and CSV files are published in Harvard Dataverse at https://doi.org/10.7910/DVN/DCDKZQ. The metadata is current as of December 12, 2019. Consider downloading the JSON metadata from there instead of using the scripts to re-download the JSON files from Harvard Dataverse (unless you really need the most recent metadata). I also downloaded the metadata of datasets in Scholar's Portal Dataverse and Dataverse NL and plan to add them to that dataset, too. I'm told it would be possible for repository installation system admins to get the cached JSON files straight from their servers, instead of using the Native API to download them. So I could imagine that each repository could, if they wanted to, publish their collection of dataset JSON files (or send them my way so I could add them to the dataset in Harvard Dataverse).

There are other tools that do similar things as these scripts, listed in the User Guides at http://guides.dataverse.org/en/latest/admin/reporting-tools.html, particularly TDL's reporting tool at https://github.com/TexasDigitalLibrary/dataverse-reports, but they require access to the repository's postgres database to get the metadata, so they're more for repository admins who have access to the repository database.

Lastly, Jamie, I've invited Jesus at CIMMYT, who completed a migration from Harvard Dataverse to their own Dataverse-based repository, to share his group's process here. He's agreed and will get back to us after the holidays. I'm hoping that'll help other's efforts to migrate.

Happy holidays!

Jamie Jamison

unread,
Jan 8, 2020, 4:03:56 PM1/8/20
to Dataverse Users Community
Hello,

I've downloaded the scripts.   Very helpful. So far I've gotten the PIDs and json data.  

Thank you so much,

Jamie

On Saturday, December 21, 2019 at 10:49:41 AM UTC-8, Julian Gautier wrote:
Hi everyone,

I'd like to share a collection of Python 3 scripts I've been using for collecting and analyzing the metadata of datasets in Dataverse-based repositories. The scripts are published and documented at https://github.com/jggautier/get-dataverse-metadata, and they use pyDataverse. Please feel free to re-write (and for you Python gurus out there, improve?) any of the scripts.

I used the scripts to download the Dataverse JSON metadata of the 30k+ datasets published in the Harvard Dataverse repository and write certain metadata to CSV files. The JSON files and CSV files are published in Harvard Dataverse at https://doi.org/10.7910/DVN/DCDKZQ. The metadata is current as of December 12, 2019. Consider downloading the JSON metadata from there instead of using the scripts to re-download the JSON files from Harvard Dataverse (unless you really need the most recent metadata). I also downloaded the metadata of datasets in Scholar's Portal Dataverse and Dataverse NL and plan to add them to that dataset, too. I'm told it would be possible for repository installation system admins to get the cached JSON files straight from their servers, instead of using the Native API to download them. So I could imagine that each repository that wanted to publish their collection of dataset JSON files (or send them my way so I could add them to the dataset in Harvard Dataverse).

There are other tools that do similar things as these scripts, listed in the User Guides at http://guides.dataverse.org/en/latest/admin/reporting-tools.html, particularly TDL's reporting tool at https://github.com/TexasDigitalLibrary/dataverse-reports, but they require access to the repository's postgres database to get the metadata, so they're more for repository admins.

Jamie Jamison

unread,
Oct 12, 2020, 4:14:56 PM10/12/20
to Dataverse Users Community
I know I've asked this before, so apologies in advance.  I've downloaded Julian Gautier's scripts which were helpful if I could run command line.   But, I need to get all the metadta from the Social Science Data Archive which is hosted at Harvard's Dataverse (https://dataverse.harvard.edu/dataverse/ssda_ucla).  I don't have the permission/ability to log in and run something command line. 

So that's my question. Is it possible to get metadata for an entire dataverse that I own but can't access commandline.

Thank you,

Jamie
UCLA Dataverse

Philip Durbin

unread,
Oct 13, 2020, 11:37:02 AM10/13/20
to dataverse...@googlegroups.com
Using either Search API or the "contents" endpoint, you should be able to get a list of all the datasets' DOIs. From there you should be able to extract the metadata in whatever format you prefer (DDI/XML, Schema.org JSON-LD, etc.) Then it's a matter of creating a report.

Does this help? I know this is a fair amount of work, a fair amount of scripting.

Phil


Stefan Kasberger

unread,
Nov 5, 2020, 9:48:51 AM11/5/20
to Dataverse Users Community
I recommend using the get_children() function from pyDataverse's develop branch (https://pydataverse.readthedocs.io/en/develop/reference.html#pyDataverse.api.NativeApi.get_children).
It is not properly documented yet, but you only have to pass the parent id (of the Dataverse or Dataset), the parent_type ("dataverse" or "dataset") and a list of included children_types (can include "dataverses", "datasets", "datafiles"). It will the recursivly walk through all children, collect the relevant metadata and return it in a dictionary.
I already used it in some bigger data migration projects, it really convenient, especially the bigger the collection gets.

Cheerz, Stefan

danny...@g.harvard.edu

unread,
Nov 5, 2020, 10:37:31 AM11/5/20
to Dataverse Users Community
Thanks Stefan for the response! Jamie, I hope Stefan's response helps. 

Jamie, can you write more about what you mean by: 

"Is it possible to get metadata for an entire dataverse that I own but can't access commandline?"

I write this as someone who is not a developer so apologies if the answer is obvious. :) 

Thanks,

Danny
Reply all
Reply to author
Forward
0 new messages