sitemap.xml, robots.txt and file.xhtml

86 views

Skip to first unread message

Ludovic DANIEL

unread,

May 11, 2021, 10:39:00 AM5/11/21

to Dataverse Dev

Hello, I recently worked on indexation of Dataverse content on Google and I have some questions and feedbacks.

I am new to the community and I work in a French Intitution that runs a Dataverse instance. I do lack some functionnal knowleages on the product.

I was wondering if something is missing for URLs that matches file.xhtml?persistentId= pattern as it is considered to Disallow: / in robots.txt and also it is absent of sitemap.xml generated via API.

Is the content of file.xhtml not revelent for indexation ?

If not, why not use <meta name="robots" content="noindex" /> in html of file.xhtml to avoid any indexation (as robots.txt will not block every kind of indexation : "While Google won't crawl or index the content blocked by a robots. txt file, we might still find and index a disallowed URL if it is linked from other places on the web"). This idea is a bit devious but I did experienced issues with a loss of robots.txt in production for a couple of month, so I need informations to treat my problem.

Also, we have a sitemap with more than 50k URLs that is not accepted by Googlebot Crawler, we decided to create a little workaround to split the sitemap into multiple files using sitemap index. A suggestion could be to generated split sitemaps directly trought the API.

Thank you for your time.

Ludovic.

danny...@g.harvard.edu

unread,

May 13, 2021, 10:20:04 AM5/13/21

to Dataverse Dev

Hi Ludovic, thanks, these are all interesting questions! I'm not sure that I have answers, but we'd welcome any pull requests or issues for improving this area of the application.

I'm not sure if it's helpful, but you can look at the robots.txt file for dataverse.harvard.edu at https://dataverse.harvard.edu/robots.txt. It's slightly different than the example in the Guides and may provide some additional ideas (or generate more questions).