Googlebot re-encoding facet URLs

185 views
Skip to first unread message

Eben English

unread,
Oct 28, 2014, 2:47:15 PM10/28/14
to blacklight-...@googlegroups.com
Has anyone else run into the problem of Googlebot crawling your site and producing a bunch of errors in your logs because it's attempting to encode an already-encoded URL prior to following a link? This seems to happen most often with facet links that have a lot of URL-encoded parameters.

For example, in the HTML output for a catalog#index search results page where multiple facets have already been selected, you might have a link like:

<a class="facet_select" href="/catalog?f%5Bfacet_field_1%5D%5B%5D=foo&amp;f%5Bfacet_field_2%5D%5B%5D=bar">bar</a>

(The "user" has already selected the "foo" value for facet_field_1, and this is a link which adds the "bar" value for facet_field_2.)

Looking at our Apache access logs, what I'm seeing is that when Googlebot attempts to crawl this link, it's encoding the URL first, which results in double-encoding:

66.249.67.103 - - [27/Oct/2014:07:44:32 -0400] "GET /catalog?f%255Bfacet_field_1%255D%255B%255D=foo&f%255Bfacet_field_2%255D%255B%255D=bar HTTP/1.1" 200 5266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

...which then produces parameters passed to CatalogController like:

--- !ruby/hash:ActionController::Parameters
f%5Bfacet_field_1%5D%5B%5D: foo
f%5Bfacet_field_2%5D%5B%5D: bar
action: index
controller: catalog

...which of course returns no results.

The problem is even worse when Googlebot attempts to access URLs that have "sort" params in them, because it ends up passing a string like this to Solr:

"sort"=>"title_info_primary_ssort asc%2C date_start_dtsi asc"

...which causes Solr to throw an error:

ERROR - 2014-10-27 10:05:03.686; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Can't determine a Sort Order (asc or desc) in sort spec 'title_info_primary_ssort asc%2C date_start_dtsi asc'

...and an error in the Blacklight app's log/production.log as well:

E, [2014-10-28T14:38:56.988338 #29835] ERROR -- : RSolr::Error::Http - 400 Bad Request

Our Tomcat, Solr, and Rails logs are filling up with these errors, to the point where it's making it very difficult to diagnose other problems since these errors are burying everything.

Has anyone else run across this situation? If so, how did you fix it?

I'm thinking I'm going to have put some rewrite rule into Apache? Blocking Googlebot from crawling the site isn't an acceptable option.

Thanks in advance,

Eben English
Boston Public Library


Jonathan Rochkind

unread,
Oct 28, 2014, 3:59:39 PM10/28/14
to blacklight-...@googlegroups.com
I have not run into this, and I have been carefully following my logs
for errors over the past month or so.

I suspect Googlebot is not actually re-encoding your URLs -- but that
something somewhere is or was on the web has those improperly-encoded
URLs, and Googlebot scraped that place and got the URLs and then tried
to follow them.

That something could have been a previous version of your app, or it
could be some HTML version of a log or analytics file you have
somewhere, or who knows.

I did find instances of Googlebot asking for things that that caused 500
errors -- in most cases, I was able to figure out that it was a past
version of my app that was producing those links (that now returned
errors), that google still clearly had in it's index somehow.

In some cases it was mysterious to me where they were coming from, but
based on the identified ones, I figured they probably had similar origins.

In all cases, if _any_ HTTP request you can throw at a Blacklight (or
any other) app produces an uncaught exception and a 500, I consider it a
bug in the app. I filed PR's and/or made local changes to my app to
resolve some of the ones I did find, so they would no longer result in
uncaught exceptions. In some/many cases the _appropriate_ response to a
weird request is a 0-results page, or a 404 error, however.

I stopped worrying about where Google was getting these malformed URLs,
but did make changes to my app and/or Blacklight to make sure none of
them resulted in uncaught exceptions and 500 errors anymore.

Jonathan
> --
> You received this message because you are subscribed to the Google
> Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to blacklight-develo...@googlegroups.com
> <mailto:blacklight-develo...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages