Issues with Google Dataset Search metadata harvesting from Dataverse-powered repositories

48 views
Skip to first unread message

Philipp Conzett

unread,
Jul 6, 2023, 2:17:42 AM7/6/23
to Dataverse Users Community
I just discovered that DataverseNO metadata doesn't seem to be harvested by Google Dataset Search anymore.

I've run four searches, and none of the expected DataverseNO datasets were on the result list:

spatial changes sediment fram strait
Expected result: https://doi.org/10.18710/GUX2O8

genusvariasjon
Expected result:  https://doi.org/10.18710/MTEQYP

oil leakage from sub-marine Arctic reservoirs
Expected result:  https://doi.org/10.18710/I3L0BQ

carboniferous rocks kongsfjorden
Expected result:  https://doi.org/10.18710/APGAWL


I did some similar searches for datasets published in the Harvard Dataverse. In two cases, I found the dataset on the result list, whereas in the other two cases, I could find the dataset on Google Dataset Search:

Turkana Food
Expected result:  https://doi.org/10.7910/DVN/CU69YZ
Not found in Google Dataset Search.

Anti-Corruption Campaigns
Expected result:  https://doi.org/10.7910/DVN/40UUKA
Not found in Google Dataset Search.

Anacostia River nutrient
Result: https://doi.org/10.7910/DVN/IOAHBP
Found in Google Dataset Search.

Gentrification pioneer businesses
Result: https://doi.org/10.7910/DVN/WPAQNJ
Found in Google Dataset Search.

Does anyone have an explanation of this behavior? Is there a way we can improve metadata harvesting by Google Dataset Search?

Thanks!
Philipp

Geneviève Michaud

unread,
Jul 6, 2023, 3:19:41 AM7/6/23
to dataverse...@googlegroups.com
Hi Philipp,

I'va had mixed results with this search engine before.
data.sciencespo is currently on the same Dataverse version. What I can say is that some datasets are found (even recently published), some don't.

found:
travail décent

not found:
Monde d’Avant, Monde d’Après

found:
pratiques numériques on data.sciencespo

not found:
radicalité religieuse

Somehow Datacite commons seems to be a grrod predictor for the search. If a dataset is not found here, then Google Dataset search does not find it either ...
I'm afraid I've got no clue, but I would definitely be interested in this topic.

Geneviève

Geneviève Michaud
CDSP - UAR 828 Sciences Po - CNRS

Centre de Données Socio-Politiques

27, rue Saint-Guillaume
75337 Paris cedex 07
Téléphone : +33 (0)1 45 49 72 83



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/d414e704-3bd5-4b33-80ce-36b95ec5ef57n%40googlegroups.com.

Vaidas Morkevičius

unread,
Jul 6, 2023, 3:22:23 AM7/6/23
to dataverse...@googlegroups.com
We at LiDA (lida.dataverse.lt) also noticed this strange behavior. In our case some datasets are present,and others not. And we have not found a pattern or explanation which of the datasets become present on Google Dataset Search.

Best,
Vaidas

Philip Durbin

unread,
Jul 13, 2023, 4:26:04 PM7/13/23
to dataverse...@googlegroups.com
If you go to https://search.google.com/test/rich-results and type in the URL for your dataset, what do you see?

I'm seeing "blocked by robots.txt" for this URL, for example: https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/GUX2O8

(You have to click the arrow next to "crawl failed" to see the details.)

I'm attaching a screenshot.

Thanks,

Phil

p.s. Thanks to a Googler I emailed for this tip!



--
Screen Shot 2023-07-13 at 4.18.28 PM.png

Philipp Conzett

unread,
Jul 14, 2023, 1:04:21 AM7/14/23
to Dataverse Users Community
Thanks Phil and the Googler! I had my suspicions about the robots.txt file and just found out we had the same issue back in 2019... We simply forgot including updating of the robots.txt file as part of our upgrade script; see https://guides.dataverse.org/en/latest/installation/config.html#letting-search-engines-crawl-your-installation.

Best, Philipp
Reply all
Reply to author
Forward
0 new messages