Downloading metadata as CSV or Excel

318 views
Skip to first unread message

Yashika Jain

unread,
Sep 30, 2021, 8:07:42 AM9/30/21
to Dataverse Users Community
Hi everyone,

I'm new to the Dataverse. I wanted to download all the metadata associated with a particular search query as an excel or csv file.
Can someone please guide me about how to do this.
Any help would be appreciated.

Thank You 
Yashika Jain

Sebastian Karcher

unread,
Sep 30, 2021, 8:48:16 AM9/30/21
to dataverse...@googlegroups.com
Could you say a bit more about what you are trying to do and what tools you are comfortable using? Natively, Dataverse metadata comes as JSON, which doesn't translate unambiguously to a table.
My approach would be to use the R or python library to get the metadata and then get it into the desired format, but there are (less flexible) options if you don't want to code at all.
Sebastian

Sent from my phone

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3ae4c0c2-f12f-4374-b2b5-2726c73ca9fbn%40googlegroups.com.

Yashika Jain

unread,
Sep 30, 2021, 9:17:59 AM9/30/21
to Dataverse Users Community
Thank you for the message.
I'm trying to extract all metadata related to "Climate" for a project and I need that data in a tabular format. 
I'm comfortable in Python, though I have never used it for scrapping data from the web. Neither have I worked with APIs.
I'm open to learning any library or anything which could help me in getting this task done.

Philip Durbin

unread,
Sep 30, 2021, 9:55:12 AM9/30/21
to dataverse...@googlegroups.com
If you like Python you might enjoy trying to add the functionality to pyDataverse.

pyDataverse already provides access to the Dataverse Search API: https://pydataverse.readthedocs.io/en/latest/reference.html#pyDataverse.api.SearchApi

The next step would be to take the search results and transform them into a CSV.

Hope this helps,

Phil



--

Julian Gautier

unread,
Sep 30, 2021, 11:05:10 AM9/30/21
to Dataverse Users Community
Hey all,

Phil, can the Search API be used to get all of the metadata of each dataset? I've always seen that the Search API returns a few metadata fields (description, subject, keywords) but definitely not everything. See https://demo.dataverse.org/api/search?q=*&type=dataset.

Yashika Jain, are there particular Dataverse repositories whose "climate" datasets you're interested in? Many of the known Dataverse repositories make most of the needed APIs completely publicly available, but some block them off, even the Search API, with varying levels of access (some are available if you have a repository account, but some of those repositories don't let just anyone create an account).

Another thing I think you'll need to consider is which metadata fields you need. Most Dataverse repositories use the same metadata fields, and you can find information about those fields in the Dataverse User Guide at https://guides.dataverse.org/en/latest/user/appendix. Some repositories are customized to use additional fields. I collect info about those fields, for as many known Dataverse repositories as I can, and share it in the CSV file at https://dataverse.harvard.edu/file.xhtml?fileId=4965245&version=10.0.

As far as using the APIs and Python, I often use the Dataverse Search API to get the persistent IDs of certain datasets, use other API endpoints to get the JSON metadata of each of those datasets and info about the Dataverse repository's metadatablocks (basically the structure of the metadata fields), and use some other Python scripts that use the metadatablock info to transform the JSON metadata files of each dataset into CSV files, one CSV file for each field (and then join those CSV files as needed). All of these scripts are in my GitHub repo at https://github.com/jggautier/dataverse_scripts. (I'm not a developer so the scripts could probably be better, but I use them a lot and I do my best to comment the code =)

A while back I also wrote a script for the administrators of a data collection in the Harvard Dataverse Repository (which doesn't block any of the needed APIs) and shared it in the Google Colab notebook at  https://colab.research.google.com/drive/1T33ERYLjVopaBz3c4FqGizpgHpCplsH1. That script returns CSV files for the three metadata fields they need to track ("dataset title", "subject", and "keyword") for the datasets in their collection. The script could be adjusted to use the Search API to find "climate" datasets, then return CSV files for each of the fields you're interested in.

Lastly, there was some development work done recently to make the Dataverse software provide dataset metadata in a JSON structure that's more "flat", so much, MUCH easier to parse (such as for converting into CSV files). It's described in the GitHub pull request at https://github.com/IQSS/dataverse/issues/6497 and the maintainer of the pyDataverse module has expressed interest in it, too, although I don't know that status of that. Because it's new I think it's available only in Dataverse repositories using the latest version of the software (v5.6) and it might also be "superuser" only (so available only to administrators of the repository). Others in the community could share info more about this.

I hope this is all helpful!

Philip Durbin

unread,
Sep 30, 2021, 12:03:41 PM9/30/21
to dataverse...@googlegroups.com
All of Julian's suggestions sound spectacular and should be investigated with all speed.

To respond to his question of "can the Search API return all the metadata fields?" the answer is no, BUT there is a pull request that seems to add this functionality (I haven't tried it myself) at https://github.com/IQSS/dataverse/pull/7942

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Paul Boon

unread,
Oct 4, 2021, 9:39:12 AM10/4/21
to dataverse...@googlegroups.com

Hi all,

 

It might be interesting to have a ‘download metadata as csv’ on the webpage showing the search results.

Thus allowing non-programmers to get that metadata and inspect in a spreadsheet application.

 

If more people want this functionality maybe someone could create an issue for it.

 

Cheers,

Paul

Julian Gautier

unread,
Oct 4, 2021, 10:15:48 AM10/4/21
to Dataverse Users Community
Agreed! The info and discussion in the issue at https://github.com/IQSS/dataverse/issues/6471 is also related.

Yashika Jain

unread,
Oct 4, 2021, 11:36:21 AM10/4/21
to Dataverse Users Community
Hello Everyone,

Thank you so much for responding.
Thank you Julian, for sharing the GitHub repository. But I am facing some errors while running the "get_dataset_PIDs.py" notebook on this https://github.com/jggautier/dataverse_scripts repository (as per the attached screenshot). I am a beginner and I am not able to resolve it.
Can someone please help me with this?

Thank you!
Best,
Yashika





Gautier, Julian

unread,
Oct 4, 2021, 11:53:47 AM10/4/21
to dataverse...@googlegroups.com
I'd be glad to help. It'll be helpful to know: Dataverse repository are you trying to get dataset metadata from?

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.


--
Julian Gautier
Product Research Specialist, IQSS

Yashika Jain

unread,
Oct 4, 2021, 12:10:29 PM10/4/21
to Dataverse Users Community
Sorry, I'm really unsure about what do you mean by "repository".
I'm just trying to extract the metadata associated with the keyword "Climate".
Thank you

Julian Gautier

unread,
Oct 4, 2021, 12:40:57 PM10/4/21
to Dataverse Users Community
Not a problem =)

By "repository" I mean the websites that you can visit to see the data from the different universities and other types of organizations that are using the Dataverse software. The Harvard Dataverse Repository at https://dataverse.harvard.edu is one of those repositories. The Université catholique de Louvain in Belgium has a separate repository at https://dataverse.uclouvain.be.

We know of about 70 repositories using the Dataverse software to preserve and share many types of data. The map at the bottom of https://dataverse.org shows their rough locations. And there's a list at https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo of the known repositories.

If I had to guess, I'd say that it doesn't matter to you which repository the data comes from, as long as it's related to climate research. Is that right?

The scripts I wrote, including the one you tried to use, use the Dataverse software APIs and let you search in specific repositories, so you have to know which repository you'd like to search in. If you'd like to search across all known Dataverse repositories, we'll have to think about how to do that, since the dataset metadata of all 70 Dataverse repositories isn't in any one system or search engine.

Gautier, Julian

unread,
Oct 4, 2021, 1:53:31 PM10/4/21
to dataverse...@googlegroups.com
Hi again,

Another member of the Dataverse community, Sherry Lake, pointed out that you might be referring to Microsoft Dataverse, which is a completely different thing.

If you've already done some searching, could you share the website you're looking?

Thanks!

Julian 

Yashika Jain

unread,
Oct 4, 2021, 3:08:21 PM10/4/21
to dataverse...@googlegroups.com
Hi Julian,

Thank you for your reply!
I am looking to extract the metadata from The Harvard Dataverse Repository.



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Julian Gautier

unread,
Oct 4, 2021, 4:09:23 PM10/4/21
to Dataverse Users Community
Ah okay. Well that makes things a little simpler! :) I was going to reply a lot of suggestions, and I know the community could benefit from learning more about your use case, but I think some real-time communication would be better to help get the info you need for your project.

Sometime this week, could we talk over Zoom or your preferred voice chatting app? If so, I can email you directly to arrange a time.

Thanks!
Julian

Yashika Jain

unread,
Oct 5, 2021, 6:12:37 AM10/5/21
to dataverse...@googlegroups.com
Hi Julian, 
Sure, we can discuss over zoom.
Thank you for your time! 

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages