jsessionid exclusion and sitemap.xml in robots.txt

119 views
Skip to first unread message

Eunice Soh

unread,
Apr 7, 2022, 1:51:50 AM4/7/22
to Dataverse Users Community
Hi,


jsessionid exclusion

Later realised from Google Search Console that it's indexing the pages with jsessionid. 

2022-04-07_10-50-26.png


Is there any way to exclude jsessionid from robots.txt? Wonder if anyone has had experience.


Disallow: /*;jsessionid 

sitemap.xml


The sitemap.xml is live, but couldn't be fetched. Any idea if it needs to be specified in robots.txt?

2022-04-07_13-49-13.png

Kind regards,
Eunice

James Myers

unread,
Apr 7, 2022, 8:08:27 AM4/7/22
to dataverse...@googlegroups.com

Eunice,

A couple thoughts:

Jsessionid showing in the URL is, AFAIK, an indication that cookies are being blocked/not working correctly – they don’t show up at most Dataverse sites. There are some old community emails on this topic – for example,  I saw a comment I made indicating that I’d seen this due to a load balancer in one thread. It may be worth trying to figure out/fix why jsessionid is appearing if you’re not configured to use it for some specific reason.  (There is also a switch in the web.xml file that is commented out by default. If something in your configuration is blocking cookies, forcing this could break normal use though. Could also be worth a try).

 

In terms of indexing, I’m not sure what the best approach would be if you can’t eliminate the jssessionids. The Google  URL Parameters tool guidance re:  parameters  in the “No: Doesn't affect page content” category seems like it would avoid duplicates – I think that would tell Google to only index with one jsessionid (if it doesn’t already – the image in your email doesn’t show any duplicates, so perhaps Google already understands it doesn’t need to track URLs with different jsessionids.). The URL rewrite methods could make sense but if you just drop all jsessionids to all users and cookies aren’t working, you’d probably be blocking logins as well. In any case, I don’t recall any info shared about this in the community while I’ve been around.

 

Re: sitemap.xml  - just looking a little bit, it appears that the sitemap is available at both /sitemap.xml and /sitemap/sitemap.xml and the latter should be visible to robots via the

Allow /sitemap/

Line in the suggested robots.txt. Hopefully if you tell Google to retrieve /sitemap/sitemap.xml instead, it will work. That said, I’m not sure why Google shouldn’t read /sitemap.xml and now that I look, the guides say that other search engines may use that copy but the robot.txt example Disallows / and blocks it. So – it appears there’s some inconsistency in allowing other engines to access the sitemap that could/should be figured out. Others probably know more of the history here.

 

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/bc725128-4f91-4408-9b99-1b5d141fd1d0n%40googlegroups.com.

Philip Durbin

unread,
Apr 7, 2022, 9:46:44 AM4/7/22
to dataverse...@googlegroups.com
I have a few more thoughts on jsessionid to add to Jim's excellent response.

It's easy to reproduce seeing jsessionid in URLs if you use a fresh browser or a private/incognito window. Here are steps:

- open a private/incognito window
- click "Log In"
- observe that you will see a jsessionid in the URL*

The interesting thing is that if you then click *another* link (let's say the "Sign Up" link), the jessionid will not be present in the URL. This is true of all subsequent clicks. That is to say, you only see the jsessionid after the first click. And only if you have a fresh browser or a private/incognito window. That's been my experience anyway.

https://github.com/IQSS/dataverse/issues/3254 was originally about preventing the jessionid from appearing in URLs. (Like Jim said, it seems like web.xml is the place to do that.) What we ended up working on was changing session IDs on the backend when the user changes (PR #7111). That's a nice improvement but please feel free to open a fresh issue about not having the jsessionid in the URL, if that's what you want.

Thanks,

Phil





--

Eunice Soh

unread,
Apr 7, 2022, 10:18:45 PM4/7/22
to Dataverse Users Community
Thanks, Jim for your inputs on jsessionid as an issue and advice on indexing.
Thanks, Phil for instructions on replicating the issue and the github issue.

For now, jsessionid doesn't appear subsequent to logging in, when cookies are blocked using Incognito i.e. Phil's steps.
That said, I'm not sure why Google indexing is picking up on those jsessionid. Could they be older pages? Not sure. 

Pertaining URL parameters tool: it's helpful to see the query params listed. To note that is going away end Apr, Google recommends using robots.txt to ignore query params (https://developers.google.com/search/blog/2022/03/url-parameters-tool-deprecated). 
Wonder if jsessionid is considered query param as it appears after ";" not "?". It doesn't appear on URL parameters tool page and doesn't appear after adding as a parameter.
I'll try blocking jsessionid with robots.txt, hopefully when a recrawl is done, the jsessionid pages will be removed.

Pertaining sitemap: /sitemap/sitemap.xml also has the same "General HTTP error". /sitemap.xml and /sitemap/sitemap.xml are both live. Will try adding "Allow: /sitemap/" to robots.txt and see if it helps



James Myers

unread,
Apr 8, 2022, 7:30:50 AM4/8/22
to dataverse...@googlegroups.com

If you figure out a good practice with jsessionid, it would be great to get it back in the guide.

 

Re: sitemaps – I just added a comment on #8329 suggesting that we might want to add Allow: /sitemap.xml to the sample robots file (which already allows /sitemap/ ) that would cover your case below. It sounds like you can add a Sitemap: directive to robots.txt to help indexers find it too. In any case – if you find out more, please comment/link info there. (For Google, you can also just retract the /sitemap.xml you submitted and add /sitemap/sitemap.xml which should be allowed by the /sitemap/ entry in the sample robots.txt, but that doesn’t work for robots that expect a default /sitemap.xml location.)

 

Thanks,

Pierre Le Corre

unread,
Jun 4, 2024, 7:40:36 AM6/4/24
to Dataverse Users Community
This is an old discussion but we ran into the same issue with jsessionid. The exact reason why jsessionid is present in Google crawls with our setup is unclear, but we found a fix for the robots.txt file to make sure it does not get indexed.

The fix is, assuming you have the same Allow rules as the default Dataverse robots.txt:
Disallow: /*;jsessionid 
Disallow: /dataset.xhtml;jsessionid=
Disallow: /javax.faces.resource/*.xhtml;jsessionid=
Disallow: /api/datasets/:persistentId/thumbnaill;jsessionid=

"Disallow: /*;jsessionid" is not enough by itself because other Allow rules are more specific (specificity in a robots.txt = number of characters). For every Allow rule that is more specific (ie longer), we need to add another Disallow rule. Here for instance, because we have "Allow: /dataset.xhtml", we specify "Disallow: /dataset.xhtml;jsessionid="

Pierre

Reply all
Reply to author
Forward
0 new messages