Harvest data sets from other repositories (Dataverse)

179 views
Skip to first unread message

Manuel Podetti

unread,
Apr 24, 2024, 3:53:02 PM4/24/24
to Dataverse Users Community
Hi everyone. From our repository we are trying to harvest data sets from other repositories (Dataverse). But it is difficult for us to find any repository that has the OAI server open. We found in this group a list  "List of Dataverse installation OAI-PMH (Harvesting) URLs and sets", but we saw that they give an error or apparently the OAI server is not open. I wanted to ask if you know of any Dataverse that has the OAI server open, so I could perform a test. 

Thank you very much, regards

--
Manuel Podetti
SERVICIOS DIGITALES

Av. Italia 6201 - Edificio Los Nogales
Montevideo, Uruguay
T. (598) 26004411 - Int. 266

www.anii.org.uy

Sherry Lake

unread,
Apr 24, 2024, 4:45:01 PM4/24/24
to dataverse...@googlegroups.com
Not sure about your version of Dataverse software, there had been problems, Jim/Phil may know.

But at least UVA's and Harvard's OAI servers are working. Our Dataverse repository (UVA) harvests from Harvard's with no problem.


--
Sherry Lake

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/eef5881c-346f-4f81-8ec5-292f4bbf2286n%40googlegroups.com.

Philip Durbin

unread,
Apr 24, 2024, 5:09:14 PM4/24/24
to dataverse...@googlegroups.com
Yes, Sherry's list is a good starting point.

On the map of installations we show harvesting sets that each installation has advertised. Most don't advertise any but some do. I'm putting a curl command below (look for non-null) to show the advertised sets.

By the way, if any installation would like to advertise a set (or change what's in the data below), please open an issue at https://github.com/IQSS/dataverse-installations

I hope this helps,

Phil

$ curl -s https://iqss.github.io/dataverse-installations/data/data.json | jq -r '.installations[] | [.hostname, .harvesting_sets]'
[
  "abacus.library.ubc.ca",
  [
    "abacus_open"
  ]
]
[
  "dataverse.theacss.org",
  null
]
[
  "dataverse.ada.edu.au",
  null
]
[
  "dadosdepesquisa.fiocruz.br",
  null
]
[
  "dataverse.asu.edu",
  null
]
[
  "data.aussda.at",
  [
    "all_published"
  ]
]
[
  "awf.rodbuk.pl",
  null
]
[
  "dmportal.biodata.pt",
  null
]
[
  "bonndata.uni-bonn.de",
  null
]
[
  "borealisdata.ca",
  [
    "sp_dataverse"
  ]
]
[
  "dataverse.bhp.org.bw",
  null
]
[
  "data.brin.go.id",
  null
]
[
  "dataverse.cbpf.br",
  null
]
[
  "opendata.cesa.edu.co",
  null
]
[
  "dataverse.cidacs.org",
  null
]
[
  "data.cifor.org",
  [
    "cifor_general"
  ]
]
[
  "data.cimmyt.org",
  [
    "cimmytdatadvn",
    "cimmytswdvn",
    "iwypdvn"
  ]
]
[
  "dataverse.cirad.fr",
  null
]
[
  "science-data.hu",
  null
]
[
  "dataverse.csuc.cat",
  null
]
[
  "datasets.coronawhy.org",
  null
]
[
  "data.crossda.hr",
  null
]
[
  "researchdata.cuhk.edu.hk",
  null
]
[
  "dados.ipb.pt",
  null
]
[
  "archaeology.datastations.nl",
  null
]
[
  "lifesciences.datastations.nl",
  null
]
[
  "phys-techsciences.datastations.nl",
  null
]
[
  "ssh.datastations.nl",
  null
]
[
  "dare.uol.de",
  null
]
[
  "dataverse.dartmouth.edu",
  null
]
[
  "darus.uni-stuttgart.de",
  null
]
[
  "dataverse.ird.fr",
  null
]
[
  "data.sciencespo.fr",
  null
]
[
  "dataportal.ing.pan.pl",
  null
]
[
  "datarepositorium.sdum.uminho.pt",
  null
]
[
  "dataspace.ust.hk",
  null
]
[
  "edatos.consorciomadrono.es",
  [
    "openaire_data"
  ]
]
[
  "dataverse.nl",
  null
]
[
  "dataverse.no",
  [
    "dataverseno"
  ]
]
[
  "dataverse.rhi.hi.is",
  null
]
[
  "dorel.univ-lorraine.fr",
  null
]
[
  "researchdata.ntu.edu.sg",
  null
]
[
  "dunas.ua.pt",
  null
]
[
  "edmond.mpdl.mpg.de",
  null
]
[
  "dataverse.fgv.br",
  null
]
[
  "dataverse.fiu.edu",
  null
]
[
  "dvn.fudan.edu.cn",
  null
]
[
  "dataverse.orc.gmu.edu",
  null
]
[
  "data.univ-gustave-eiffel.fr",
  null
]
[
  "data.goettingen-research-online.de",
  null
]
[
  "dataverse.harvard.edu",
  [
    "IQSS"
  ]
]
[
  "heidata.uni-heidelberg.de",
  [
    "heidata"
  ]
]
[
  "repositoriopesquisas.ibict.br",
  null
]
[
  "dataverse.icrisat.org",
  [
    "icrisat"
  ]
]
[
  "dataverse.mpi-sws.org",
  null
]
[
  "dataverse.iza.org",
  null
]
[
  "dataverse.ifdc.org",
  null
]
[
  "datasets.iisg.amsterdam",
  null
]
[
  "indata.cedia.edu.ec",
  null
]
[
  "dataverse.pushdom.ru",
  null
]
[
  "data.cipotato.org",
  null
]
[
  "dataverse.ipgp.fr",
  null
]
[
  "dataverse.iit.it",
  null
]
[
  "archive.data.jhu.edu",
  [
    "jhuda_all"
  ]
]
[
  "dataverse.jpl.nasa.gov",
  null
]
[
  "data.fz-juelich.de",
  null
]
[
  "keen.zih.tu-dresden.de",
  null
]
[
  "rdr.kuleuven.be",
  null
]
[
  "dataverse.lib.virginia.edu",
  [
    "UVA-Libra-Data"
  ]
]
[
  "lida.dataverse.lt",
  null
]
[
  "dataverse.acg.maine.edu/dvn",
  null
]
[
  "data.mel.cgiar.org",
  null
]
[
  "researchdata.nie.edu.sg",
  null
]
[
  "dataverse.nioz.nl",
  null
]
[
  "dataverse.lib.nycu.edu.tw",
  null
]
[
  "portal.odissei.nl",
  null
]
[
  "dataverse.uclouvain.be",
  null
]
[
  "dataverse.openforestdata.pl",
  null
]
[
  "osnadata.ub.uni-osnabrueck.de",
  null
]
[
  "papyrus-datos.co",
  null
]
[
  "opendata.pku.edu.cn",
  null
]
[
  "datos.pucp.edu.pe",
  null
]
[
  "data.qdr.syr.edu",
  [
    "qdr_whole"
  ]
]
[
  "entrepot.recherche.data.gouv.fr",
  [
    "ALL"
  ]
]
[
  "redape.dados.embrapa.br",
  null
]
[
  "redata.anii.org.uy",
  null
]
[
  "dataverse.unr.edu.ar",
  null
]
[
  "datos.uchile.cl",
  null
]
[
  "datos.unlp.edu.ar",
  null
]
[
  "research-data.urosario.edu.co",
  null
]
[
  "datav.udec.cl",
  null
]
[
  "repositoriodedados.unifesp.br",
  null
]
[
  "dataverse.ufabc.edu.br",
  null
]
[
  "dataverse.ileel.ufu.br",
  null
]
[
  "repositorio.polen.fccn.pt",
  null
]
[
  "dadosabertos.rnp.br",
  null
]
[
  "soildata.mapbiomas.org/",
  null
]
[
  "rodbuk.pl",
  null
]
[
  "agh.rodbuk.pl",
  null
]
[
  "pk.rodbuk.pl",
  null
]
[
  "uek.rodbuk.pl",
  null
]
[
  "uj.rodbuk.pl",
  null
]
[
  "dataverse.rsu.lv",
  null
]
[
  "data.scielo.org",
  null
]
[
  "sodha.be",
  null
]
[
  "datahub.tec.mx",
  null
]
[
  "dataverse.tdl.org",
  [
    "TDR"
  ]
]
[
  "planetary-data-portal.org",
  null
]
[
  "dataverse.ucla.edu",
  null
]
[
  "uken.rodbuk.pl",
  null
]
[
  "dataverse.lib.unb.ca",
  null
]
[
  "dataverse.unc.edu",
  [
    "odum_all"
  ]
]
[
  "dataverse.lib.umanitoba.ca",
  null
]
[
  "dataverse.unimi.it",
  null
]
[
  "dataverse.vtti.vt.edu",
  null
]
[
  "data.worldagroforestry.org",
  null
]



--

juan...@gmail.com

unread,
Apr 25, 2024, 5:36:19 AM4/25/24
to Dataverse Users Community
Hi Manuel.

  I think that the OAI servers are opened, but Dataverse shows a 500 error when you try to access to an OAI URL without using an oai verb.

 
 Regards,

Juan

Manuel Podetti

unread,
Apr 25, 2024, 1:05:57 PM4/25/24
to Dataverse Users Community
Thank you very much, Sherry Phil and Juan for the responses.

We are already doing the tests.

Regards

Philip Durbin

unread,
Apr 25, 2024, 4:34:27 PM4/25/24
to dataverse...@googlegroups.com
I just wanted to follow up on Juan's comment about going to https://edatos.consorciomadrono.es/oai with no arguments and seeing a 500 error. This was a bug we fixed this in Dataverse 6.2 with this pull request: https://github.com/IQSS/dataverse/pull/10205

The demo server has already been upgraded to 6.2 so if you go to https://demo.dataverse.org/oai you should see this more meaningful error ("badVerb"):

<?xml version="1.0"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2024-04-25T20:31:45Z</responseDate>
  <request>https://demo.dataverse.org/oai</request>
  <error code="badVerb">No argument 'verb' found</error>
</OAI-PMH>

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Manuel Podetti

unread,
Apr 26, 2024, 11:25:14 AM4/26/24
to Dataverse Users Community
Thank Phil for the updates.

We are already doing tests, based on the information they gave us 

Julian Gautier

unread,
Apr 26, 2024, 2:25:36 PM4/26/24
to Dataverse Users Community
Hi Manual,

Dataverse is also able to harvest from repositories without needing to specify a harvesting set, in which case all published datasets are harvested. Harvard Dataverse harvests from 5 other Dataverse-based repositories, such as DataverseNL, without specifying a set.

So if we use the "harvesting_sets" column in the data that Phil mentioned to see which Dataverse repositories make their datasets harvestable, we miss these cases where repositories have enabled harvesting and haven't created harvesting sets.

The Google Sheet at https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo is another way to see the data that Phil's curl command is querying.

About testing harvesting: Harvard Dataverse tries to harvest from 21 other repositories that use Dataverse and we've been working through technical issues that prevent the repository from harvesting many or all of the datasets from about half of those repositories.

And we've also been considering policies to help us mitigate these technical issues, such as harvesting less rich metadata formats, like oai_dc, from repositories that are using less-recent versions of Dataverse.

These policy discussions reminded me of cases where managers of Harvard Dataverse either entered into more formal agreements to harvest metadata, like the Data-PASS project, or were encouraged to enter into a more formal agreement. For example, the folks who manage the Survey Research Data Archive asked us about signing a memorandum of understanding where we would agree to maintain the technologies that let us share each other's metadata. We thought that wasn't necessary, but I've started to see the merit in it.

Lastly, the community's been talking in our Zulip chat about making it easier for repository administrators to detect and troubleshoot harvesting issues. I think your testing could help inform that discussion :)

Julian

Julian Gautier (he/him)
Product Research Specialist, IQSS
Interested in helping test Dataverse? Sign up for usability testing

Manuel Podetti

unread,
May 3, 2024, 4:02:13 PM5/3/24
to Dataverse Users Community
Hi Julian

Thanks for detailed information on harvesting in other Dataverse. We appreciate you taking the time to explain the complexities involved.

We understand the challenges of establishing policy in a diverse community. Establsih agreements on paper provides institutional support, but at least in my experience, they also do not provide assurance that these actions will be carried out. I think agreements are important, especially to facilitate continuity, when sometimes the people in charge of managing the repositories change.

Regarding the conversation in the Zulip chat, we have not been able to finish documenting the proces, currentrly our developer i sick and we have stopped the process, but as sonn as recovers I will try to summarize the work and transfer the information there.

Thanks again for your recommendations

Manuel
Reply all
Reply to author
Forward
0 new messages