Response not gzipped specifically for Twitterbot requests?

79 views
Skip to first unread message

Taengoo Taengstagram

unread,
Jun 29, 2015, 6:27:04 AM6/29/15
to google-a...@googlegroups.com
I've noticed for when Twitterbot crawls my app on GAE, the response does not appear to be gzipped (as seen by the response bytes size in GAE logs). I've tested this with other apps deployed on the *.appspot.com, for example https://ga-dev-tools.appspot.com/.

To illustrate, I'm using a test user agent  "Twitterbot/9.0", although the actual Twitter user agent is "Twitterbot/1.0".

# Test case 1: With a generic Mozilla useragent Mozilla/9.0 + gzip headers, response returned is gzipped
$ curl 'https://ga-dev-tools.appspot.com/' -H 'Accept-Encoding: gzip, deflate, sdch' --compressed -A 'Mozilla/9.0' -i

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Cache-Control: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Date: Mon, 29 Jun 2015 10:11:35 GMT
Server: Google Frontend
Alternate-Protocol: 443:quic,p=1
Transfer-Encoding: chunked

# Test case 2: With a Twitterbot useragent Twitterbot/9.0 + gzip headers, response returned is not gzipped
$ curl 'https://ga-dev-tools.appspot.com/' -H 'Accept-Encoding: gzip, deflate, sdch' --compressed -A 'Twitterbot/9.0' -i

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Cache-Control: no-cache
Date: Mon, 29 Jun 2015 10:12:06 GMT
Server: Google Frontend
Content-Length: 7956
Alternate-Protocol: 443:quic,p=1

# Test case 3: With a Twitterbot useragent Twitterbot/9.0 + no other headers, response returned is not gzipped
$ curl 'https://ga-dev-tools.appspot.com/' -A 'Mozilla/9.0' -i

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Cache-Control: no-cache
Date: Mon, 29 Jun 2015 10:13:17 GMT
Server: Google Frontend
Content-Length: 7956
Alternate-Protocol: 443:quic,p=1


You will noticed that GAE is returning identical responses for test #2 (Twitterbot) and #3 (uncompressed request). This is unexpected and rather puzzling. Any idea why?


Nick (Cloud Platform Support)

unread,
Jun 29, 2015, 3:27:59 PM6/29/15
to google-a...@googlegroups.com, taengs...@gmail.com
Hey Taengoo,

It seems as though you may have stumbled on a valid Feature Request in the making. In the docs, it's explained that serving content-encoding: gzip responses is done based on a combination of User-Agent and Accept-Encoding headers, however it appears that the Twitterbot UA string doesn't pass the test. 

Attached is a .tar.gz containing an example app you can deploy, and a script you can use, to test this behaviour on App Engine. If you change the application id in app.yaml inside the app/ directory, you can deploy the app. At that point, you'll want to run :

./curl-uas.sh 1.testheaders.APPID.appspot.com

Where your APPID will be an actual app id. 

This script runs through the user-agents in user-agents.txt, which contain the most statistically-popular UA strings on the web at the moment, along with several test values. You'll notice that your observations are replicated for Twitterbot-style UA strings, while the special User-Agent "gzip", as explained in the docs, can force compression.

I think you should open a Feature Request thread in the public issue tracker to either have the Twitterbot UA included in the list of those which can accept gzip if they request it via Accept-Encoding, or to simply have the Accept-Encoding header be respected.

If possible, you could modify your Twitterbot to use UA "gzip", in order to simply get it working today.

Best wishes,

Nick
test-ua-content-encoding.tar.gz

Taengoo Taengstagram

unread,
Jun 30, 2015, 12:06:14 AM6/30/15
to google-a...@googlegroups.com, taengs...@gmail.com
I've logged it as issue #12104 https://code.google.com/p/googleappengine/issues/detail?id=12104

Thanks for pointing out the presence of a whitelist. This explains why I've seen uncompressed responses in the logs to possibly lesser known mobile useragents such as custom embedded webviews. This is unfortunate when it is precisely these mobile devices which will stand to gain the most from compressed content.

Also to note, application/site owners are rarely in a position to request that crawlers/users modify their user agent string to comply with such a specific requirement for GAE.

Nick (Cloud Platform Support)

unread,
Jun 30, 2015, 12:31:56 PM6/30/15
to google-a...@googlegroups.com, taengs...@gmail.com
Hey Taengoo,

Glad to hear that. I've processed the issue and should update that thread shortly with a special number identifying the feature request so that the thread can be updated when progress is made. 

I also appreciate that it's not always possible to set User-Agent: gzip, so point taken there. I look forward to seeing where this goes, since as you say, compressed content is one of the most important performance benefits one can implement.

Best wishes,

Nick
Reply all
Reply to author
Forward
0 new messages