The new Google Dataset Search

247 views
Skip to first unread message

Mercè Crosas

unread,
Sep 5, 2018, 1:18:31 PM9/5/18
to dataverse...@googlegroups.com
Of interest to the Dataverse community:


Best,
Merce

----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University

Shu Wen Chew

unread,
Sep 6, 2018, 1:30:23 AM9/6/18
to dataverse...@googlegroups.com
Hi  Mercè,

Many thanks for alerting us of this development. This new Dataset Search will definitely encourage more faculty to share their data now that it's more dscoverable on Google. 

The datasets from Dataverse do show up nicely in Dataset Search. However, I noticed the author names are missing from the dataset records that were indexed from our Dataverse installation at NTU. However, author names are available on the dataset records from Figshare. 

It seems like the issue is due to how the authors are being presented in the JSON-LD schema:

Dataverse:

image.png

Figshare:

image.png

Are there plans to update the JSON-LD schema for Dataverse so that the dataset creators' / authors' names could be displayed in the Google Dataset Search's records?

Cheers,
Shu Wen

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CAPAYmDM6g%3DoxHJ-ePnzXZyJzLYE7PpwriUrXRznu3myWayf4Gg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Philip Durbin

unread,
Sep 6, 2018, 7:00:33 AM9/6/18
to dataverse...@googlegroups.com
Originally we had "@type": "Person" in the JSON-LD output but in Dataverse it's possible to have organizations as authors ("Gallup Organization", "Geological Survey (U.S.)", etc.) so we took it out. Please see discussion in these two places (I'll attach screenshots):


Shu Wen (or anyone else who is interested), a good way to help would be to create a new GitHub issue about this. I *think* the solution is to provide a way to indicate if an author is a person or an organization. Thanks for bringing this to our attention!

Phil

On Thu, Sep 6, 2018 at 1:30 AM, Shu Wen Chew <sw.c...@gmail.com> wrote:
Hi  Mercè,

Many thanks for alerting us of this development. This new Dataset Search will definitely encourage more faculty to share their data now that it's more dscoverable on Google. 

The datasets from Dataverse do show up nicely in Dataset Search. However, I noticed the author names are missing from the dataset records that were indexed from our Dataverse installation at NTU. However, author names are available on the dataset records from Figshare. 

It seems like the issue is due to how the authors are being presented in the JSON-LD schema:

Dataverse:

image.png

Figshare:

image.png

Are there plans to update the JSON-LD schema for Dataverse so that the dataset creators' / authors' names could be displayed in the Google Dataset Search's records?

Cheers,
Shu Wen
On Thu, Sep 6, 2018 at 1:18 AM Mercè Crosas <mcr...@iq.harvard.edu> wrote:
Of interest to the Dataverse community:


Best,
Merce

----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CAJQsWHbrxaWMmBkSRKfYC0h8Pn3ku95OrCn%3DUT7V4n3AReQHoA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
Screen Shot 2018-09-06 at 6.51.12 AM.png
Screen Shot 2018-09-06 at 6.50.42 AM.png

Mercè Crosas

unread,
Sep 6, 2018, 8:17:01 AM9/6/18
to dataverse...@googlegroups.com
Hi Shu Wen,

Thanks for bringing the subject. I'd like to review this with the team to see if it makes sense to change it in a way that recognizes authors in the search records. I do think that this is important for authors, and therefore it's important for us.

Merce

----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8Gej-t2OJoVv82thiL0DZTRaWvmJkw-f37%2BptZiNJbpVw%40mail.gmail.com.

Sherry Lake

unread,
Sep 6, 2018, 8:45:09 AM9/6/18
to Dataverse Users Community
I've been watching twitter and a few slack channels about this new search from Google. According to Google's info page, the Google datasearch can use schema.org metadata, but it seems so far at least, that Google datasearch is using Datacite metadata. I have not done an exhaustive test, but this seems to be the consensus on twitter. I have no idea how it manages duplicate sources for the "same" dataset.

And another thing about the Google Datasearch "sources", there is NO provenance. A highly sighted dataset from Stanford is in the search as being from "Kaggle". It's only on Kaggle because the original source data was used on a project there.

I want to keep my eye out to see how things shake out.

--
Sherry

Mercè Crosas

unread,
Sep 6, 2018, 9:55:43 AM9/6/18
to dataverse...@googlegroups.com
That's a good point, Sherry. I think that you are right, this might be the case for now, but we'll explore more. One of the things we need to improve is how much metadata we send to DataCite at the time of minting a DOI and publishing a dataset.


----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

Sebastian Karcher

unread,
Sep 6, 2018, 11:16:01 AM9/6/18
to dataverse...@googlegroups.com
Hi Sherry,

at least for our entries, it picks up the schema.org directly from our Dataverse installation[1] (without authors, so agree that'd be good to fix) and then from Datacite for some old items (pre DV) it doesn't dedupe correctly. According to Martin Fenner, Datacite is only used as a fallback option. Since google also uses the schema.org metadata there, and does get authors, it shouldn't be hard to check what they do.

And yes, Merce, I'd love to improve the metadata sent to Datacite from DV. It's on my list to look at when I'm coming back from leave.

Sebastian


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.



--
Sebastian Karcher, PhD
www.sebastiankarcher.com

Mercè Crosas

unread,
Sep 6, 2018, 11:18:32 AM9/6/18
to dataverse...@googlegroups.com
Yes, I checked with Martin Fenner too this morning and this seems to be correct.

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CAOSYSD7K3vjksqXv8sLXbuX1Rz_D%2BnMuQO0XwxQ1b%2B-xnCRX2w%40mail.gmail.com.

Pete Meyer

unread,
Sep 6, 2018, 11:40:50 AM9/6/18
to Dataverse Users Community
Hi everyone,

One open question I've had about schema.org is how to validate the JSON-LD produced by all the datasets in a repository prior to deployment.  The only suggestions I'm aware of are copy/pasting into Google's structured data tool (which isn't something that seems to scale very well) or go live and check what search engines see (which is post deployment) - is anyone aware of better options?

I'm on board with sending more metadata to the DOI system too - it's been on our list for a while as well.

Best,
Pete


On Thursday, September 6, 2018 at 11:16:01 AM UTC-4, Sebastian Karcher wrote:
Hi Sherry,

at least for our entries, it picks up the schema.org directly from our Dataverse installation[1] (without authors, so agree that'd be good to fix) and then from Datacite for some old items (pre DV) it doesn't dedupe correctly. According to Martin Fenner, Datacite is only used as a fallback option. Since google also uses the schema.org metadata there, and does get authors, it shouldn't be hard to check what they do.

And yes, Merce, I'd love to improve the metadata sent to Datacite from DV. It's on my list to look at when I'm coming back from leave.

Sebastian

On Thu, Sep 6, 2018 at 8:45 AM, Sherry Lake <shla...@gmail.com> wrote:
I've been watching twitter and a few slack channels about this new search from Google. According to Google's info page, the Google datasearch can use schema.org metadata, but it seems so far at least, that Google datasearch is using Datacite metadata. I have not done an exhaustive test, but this seems to be the consensus on twitter. I have no idea how it manages duplicate sources for the "same" dataset.

And another thing about the Google Datasearch "sources", there is NO provenance. A highly sighted dataset from Stanford is in the search as being from "Kaggle". It's only on Kaggle because the original source data was used on a project there.

I want to keep my eye out to see how things shake out.

--
Sherry

On Wednesday, September 5, 2018 at 1:18:31 PM UTC-4, Merce wrote:
Of interest to the Dataverse community:


Best,
Merce

----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

Philip Durbin

unread,
Sep 7, 2018, 7:36:23 AM9/7/18
to dataverse...@googlegroups.com
I just copied my comment about "@type": "Person" into the issue that Shu Wen just opened (thanks!): https://github.com/IQSS/dataverse/issues/5029 - Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's records

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.

To post to this group, send email to dataverse-community@googlegroups.com.

julian...@g.harvard.edu

unread,
Sep 17, 2018, 10:33:14 AM9/17/18
to Dataverse Users Community
Here's a list of known Dataverse installations whose dataset pages have or should have schema.org (some sites are down so I couldn't check) and info about:
  • if their robots.txt is blocking Google from crawling their sites
  • if any of their datasets are indexed in Google's Dataset Search
  • if Google's Dataset Search isn't picking up author names
No author names are displayed for any of the installations' datasets. Some installations' datasets are missing the dataset PIDs (see the notes column).

Of the nine installations with robots.txt files telling Google not to crawl, seven don't seem to be indexed by Dataset Search. (Datasets from the other two installations, CIFOR and Libra Data, are indexed and Google is finding these installations' schema.org metadata somehow. I searched for "cifor" and "university of virginia dataverse".)

A github issue to continue work on schema.org metadata is being updated (https://github.com/IQSS/dataverse/issues/4371).

julian...@g.harvard.edu

unread,
Sep 20, 2018, 1:23:34 PM9/20/18
to dataverse...@googlegroups.com
Hey Pete,

Someone at a meeting reviewing what json-ld and schema.org are just shared a link to a project that I think seeks to do what you're talking about: https://github.com/chharvey/schemaorg-jsd

Pete Meyer

unread,
Sep 20, 2018, 2:01:41 PM9/20/18
to Dataverse Users Community
Hi Julian,


On Thursday, September 20, 2018 at 1:23:34 PM UTC-4, julian...@g.harvard.edu wrote:
Hey Pete,

Someone at a meeting reviewing what json-ld and schema.org are just shared a link to a project that I think seeks to do what you're talking about: https://github.com/chharvey/schemaorg-jsd

Thanks! It looks like it might take a little work at integrating, but this looks like exactly what I've been looking for.

Best,
Pete
 

James Turitto

unread,
Sep 21, 2018, 2:20:27 AM9/21/18
to Dataverse Users Community
In addition to the lack of author--mentioned above--we've noticed that not all of the datasets we have posted in the Harvard installation appear in the google dataset search. When I first searched last week, none of our datasets were appear, but this week it seems a number of datasets appear but not all of them. Have others had this same issue? 

Philip Durbin

unread,
Sep 21, 2018, 9:29:36 AM9/21/18
to dataverse...@googlegroups.com
We're hoping that adding a sitemap to Dataverse installations will help. https://github.com/IQSS/dataverse/issues/4261 is in the current sprint. We discussed sitemaps briefly during a sprint planning meeting on Wednesday and realized that we don't have a lot of in house knowledge about sitemaps so if there are people in the community who have implemented them before, please get in touch. If not, we'll figure it out. :)

Anyway, using Google Dataset Search I can't find the dataset I deposited in Harvard Dataverse either ( https://doi.org/10.7910/DVN/TJCLKP ), so I feel your pain. I'm glad to hear that some of your dataset are now appearing that weren't before.

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--

danny...@g.harvard.edu

unread,
Sep 21, 2018, 12:47:34 PM9/21/18
to Dataverse Users Community
Hi James,

We adjusted the robots.txt file on dataverse.harvard.edu late last week to align it with what we recommend in the Dataverse Guides. The incorrect robots.txt file was likely the reason that datasets were not initially appearing and is likely the reason that they are now appearing bit by bit. I'd expect them to all appear soon, as Google's crawlers do their work. The sitemap that Phil mentions will be helpful here as well. 

Thanks,

Danny

James Turitto

unread,
Sep 24, 2018, 11:32:28 PM9/24/18
to Dataverse Users Community
Thanks Phil and Danny. This is helpful. 
Reply all
Reply to author
Forward
0 new messages