Anyone having recent trouble with google indexing? (robots.txt)

90 views
Skip to first unread message

sebastiank...@u.northwestern.edu

unread,
Dec 10, 2023, 4:37:20 PM12/10/23
to Dataverse Users Community
We're having mysterious issues with google claiming our datasets are blocked from indexing by robots.txt. We originally had the default robots.txt active when this happened; we since added
Disallow: /dataset.xhtml*jsessionid=

To remove some of the noise in the crawling (the nofollow recently added to facets should also help here).
Our full robots.txt is here:

Here's one of the URLs Google refused to crawl yesterday (for the 2nd time, so this doesn't appear to be a glitch):
https://data.qdr.syr.edu/dataset.xhtml?persistentId=doi%3A10.5064%2FF6BUAX58&version=&q=&fileTypeGroupFacet=%22Data%22&fileAccess=&tagPresort=true&folderPresort=true

When testing this URL with Google's own robots.txt test on the live robots.txt, google shows this as allowing crawling. This has led to every single one of our datasets being unlisted on google dataset search (they have flawless json-ld and were listed previously).

Is this just us, or are others affected? And does anyone have suggestions?

Thanks so much in advance,
Sebastian

Kris Dekeyser

unread,
Dec 11, 2023, 5:10:49 AM12/11/23
to Dataverse Users Community
Hi Sebastian,

Our robots.txt looks basically the same (https://rdr.kuleuven.be/robots.txt) and that seems to work. We do have a sitemap reference in the robots file and perform an update and submit of the sitemap (if it has changed) every night.

Kris

Don Sizemore

unread,
Dec 11, 2023, 7:46:37 AM12/11/23
to dataverse...@googlegroups.com
Sebastian,

I don't know about recent trouble, but you're echoing my experience.
At one time I shamelessly stole Leonid's robots.txt from Harvard, and GoogleBot still complained it was DisAllowed after each index.
I have a hazy memory of un-encoded quotes or brackets in the URL parameters returned by Dataverse causing errors back when Google _did_ index us.
My personal feeling is that the bulk of our robots.txt troubles could be avoided if GoogleBot would simply respect CrawlDelay.

Best of luck and let us know how it goes,
Don

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/884e6f6d-0bf6-4ff2-8233-6875750b3b09n%40googlegroups.com.

Sherry Lake

unread,
Dec 11, 2023, 8:46:17 AM12/11/23
to dataverse...@googlegroups.com
I haven't gotten any Google dashboard errors, or actually any emails from Google in a while.... might be since we migrated to the cloud?

Anyway, I just checked our datasets on Google Dataset Search and see only one!

Not sure where the rest went. Our site mape is accessible, but I just noticed there is nothing there for "2023"? And I am sure most of my datasets were in our Google data search results earlier???

Off to investigate.

--
Sherry




Sherry Lake

unread,
Dec 11, 2023, 9:58:37 AM12/11/23
to dataverse...@googlegroups.com
One problem (there might be more, but first things first), is that my site map is not up to date. Reading this issue is probably why (never looked at the logs to see if site map had completed):

Then I'll step through the other parts, it has been a while since I set Google Data search up.

--
Sherry

sebastiank...@u.northwestern.edu

unread,
Dec 14, 2023, 1:34:48 PM12/14/23
to Dataverse Users Community
To close this from our end -- this did not end up being directly Dataverse related: we're doing a couple of redirects to the IDP to check for login status when you go to a Dataverse page and Googlebot didn't like that. We've now disabled the redirects for bots and related Google useragents and things are back to normal.

Sherry Lake

unread,
Dec 14, 2023, 3:48:19 PM12/14/23
to dataverse...@googlegroups.com
I thought I got things set up correctly, but the Google Search Console doesn't like the URL of the site map:

(which is what I have in my robots.txt, as well as everyone else, who I copied from ;-) )

When I use that sitemap URL in Google Search console, it can't be fetched.

it's found and read. Maybe things changed with the URL because we are on AWS, now?

So I am off to change the site map in my robots.txt..... I assume that's what I need to do?

--
Sherry Lake




James Myers

unread,
Dec 15, 2023, 5:52:40 PM12/15/23
to dataverse...@googlegroups.com

On behalf of the community, the GDCC is pleased to announce the v1.4 release of the community-maintained Dataverse Previewers.

 

This release has two important changes:

 

More Previewers! Since the v1.3 release, the community has added previewers for

  • Geospatial content (geojson),
  • Zip Files (.zip, .eln),
  • NetCDF and HDF Files and NCML metadata,
  • Markdown formatted text,
  • Shape files,
  • GeoTiff,
  • Rich HTML (HTML containing scripts that must run for proper display, e.g. for content generated with Plotly), and
  • RO-Crate files.

 

(Please read the notes in the example Configuration command pages for details about these previewers and their Dataverse version compatibility and limitations.)

 

Support for Signed URLs. As of Dataverse v6.1, signed URLs are supported as an alternative to sending an API key to external tools, including previewers. (Signed URLs were introduced in Dataverse v5.13 but do not work when datasets are accessed via Private URLs until v6.1). Signed URLs are short-lived and are specific to the API calls Previewers need to read and display dataset metadata and file contents.

 

Updating Your Dataverse Installation:

 

There are multiple ways to install/update your Previewers. The Dataverse Previewers repository has detailed instructions in the README file. In brief,

 

If you are using a local copy, you should download and install the latest release and then update your local Dataverse configurations. If you are using the Previewers served from gdcc.github.io, you only need to update your configurations.

 

To update your configurations, you can either delete the configurations for the v1.3 (or earlier) previewers you are using and run the example curl commands to add the v1.4 tools or you can update your database directly.

 

If you wish to continue to use API keys, use the curl commands in the Dataverse 5.2+ examples file in the repository.

If you wish to switch to SIgnedURLs (recommended), use the curl commands in the Dataverse 6.1+ examples file in the repository.

 

 

Reply all
Reply to author
Forward
0 new messages