Hello, I recently worked on indexation of Dataverse content on Google and I have some questions and feedbacks.
I am new to the community and I work in a French Intitution that runs a Dataverse instance. I do lack some functionnal knowleages on the product.
I was wondering if something is missing for URLs that matches
file.xhtml?persistentId= pattern as it is considered to
Disallow: / in
robots.txt and also it is absent of sitemap.xml generated via
API.
Is the content of file.xhtml not revelent for indexation ?
If not, why not use <meta name="robots" content="noindex" /> in html of file.xhtml to avoid any indexation (as robots.txt will not block every kind of indexation : "While Google won't crawl or index the content blocked by a robots. txt file, we might still find and index a disallowed URL if it is linked from other places on the web"). This idea is a bit devious but I did experienced issues with a loss of robots.txt in production for a couple of month, so I need informations to treat my problem.
Also, we have a sitemap with more than 50k URLs that is not accepted by Googlebot Crawler, we decided to create a little workaround to split the sitemap into multiple files using
sitemap index. A suggestion could be to generated split sitemaps directly trought the API.
Thank you for your time.
Ludovic.