FakeWeb - Bug with Mechanize ??

Arco

unread,

Apr 7, 2009, 8:55:22 PM4/7/09

to FakeWeb

For me, the Mechanize 'get' method often fails when working with
FakeWeb.

Here is a script that I use for testing in IRB. I tested using
different URL's and found:

- http://www.google.com:80/ WORKS
- http://google.com FAILS
- http://google.com:80 FAILS

Why do some URL's fail?
Does anyone know how to make all URL's work??

require 'fakeweb'
require 'mechanize'
x = WWW::Mechanize.new
url = "http://www.google.com:80/"
cf = 'out.htm'
`curl -is #{url} > #{cf}`
FakeWeb.register_uri(:get, url, :response => cf)
FakeWeb.allow_net_connect = false
FakeWeb.registered_uri?(:get, url)
x.get url

Chris Kampmeier

unread,

Apr 8, 2009, 11:52:56 AM4/8/09

to fakewe...@googlegroups.com

On Apr 7, 2009, at 5:55 PM, Arco wrote:
> For me, the Mechanize 'get' method often fails when working with
> FakeWeb.

Hi, can you tell me what versions of fakeweb and mechanize you're
using? Oh, and your version of ruby, including patchlevel? It'd be
easier for me to try to reproduce your problem, that way.

Thanks,
Chris

Arco

unread,

Apr 10, 2009, 11:09:52 AM4/10/09

to FakeWeb

I am using fakeweb (1.2.0) and mechanize (0.9.2)

On cygwin, here are the results of my tests:
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
FAILS >> http://google.com
FAILS >> http://www.google.com
WORKS >> http://www.google.com:80/
FAILS >> http://www.google.com/news
FAILS >> http://google.com:80
FAILS >> http://google.com:80/

On linux, here are the results of my tests:
ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux]
FAILS >> http://google.com
WORKS >> http://www.google.com
WORKS >> http://www.google.com:80/
FAILS >> http://www.google.com/news
FAILS >> http://google.com:80
FAILS >> http://google.com:80/

Here is the revised test script I used.
require 'rubygems'

require 'fakeweb'
require 'mechanize'
x = WWW::Mechanize.new

%w(http://google.com http://www.google.com http://www.google.com:80/
http://www.google.com/news http://google.com:80 http://google.com:80/).each
do |url|

cf = 'out.htm'
`curl -is #{url} > #{cf}`
FakeWeb.register_uri(:get, url, :response => cf)
FakeWeb.allow_net_connect = false
FakeWeb.registered_uri?(:get, url)

begin
x.get url
puts " WORKS >> #{url}"
rescue
puts " FAILS >> #{url}"
end
end

Chris Kampmeier

unread,

Apr 11, 2009, 11:44:50 PM4/11/09

to FakeWeb

OK, I did a little investigation:

FakeWeb had a bug where trailing slashes were considered significant
for requests to the root of a domain, so e.g. http://example.com/ and
http://example.com were considered different URLs. This has been
fixed: http://github.com/chrisk/fakeweb/commit/c98f1d8f8643449e035ce0204d8788adc9defbd9
That'll make it into the next gem release. Thanks for bringing it to
my attention!

After that change, your test script works for me with two exceptions,
which seem legitimate. First, Google responds with a 301 to GET
http://google.com, with the Location header set to http://www.google.com/.
Mechanize tries to follow that redirect, but you hadn't registered
that URI yet, so you get a FakeWeb::NetConnectNotAllowedError.

Second, Google seems to check the user-agent (or perhaps something
more complicated) for requests to http://www.google.com/news. It
returns a 403 Forbidden for your curl request, and then Mechanize
raises a WWW::Mechanize::ResponseCodeError when handling the faked
response.

I don't have a Windows setup to test with, so I'd be interested in
hearing some more info about the disparity between your results with
Linux. I do test FakeWeb on multiple versions of Ruby, though
(including 1.8.7 p72 and 1.8.6 p114), so I don't think that's it.

Chris

On Apr 10, 8:09 am, Arco <akl...@gmail.com> wrote:
> I am using fakeweb (1.2.0) and mechanize (0.9.2)
>
> On cygwin, here are the results of my tests:
> ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
> FAILS >>http://google.com
> FAILS >>http://www.google.com
> WORKS >>http://www.google.com:80/
> FAILS >>http://www.google.com/news
> FAILS >>http://google.com:80
> FAILS >>http://google.com:80/
>
> On linux, here are the results of my tests:
> ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux]
> FAILS >>http://google.com
> WORKS >>http://www.google.com
> WORKS >>http://www.google.com:80/
> FAILS >>http://www.google.com/news
> FAILS >>http://google.com:80
> FAILS >>http://google.com:80/
>
> Here is the revised test script I used.
> require 'rubygems'
> require 'fakeweb'
> require 'mechanize'
> x = WWW::Mechanize.new
> %w(http://google.comhttp://www.google.comhttp://www.google.com:80/

> http://www.google.com/newshttp://google.com:80http://google.com:80/).each

Arco

unread,

Apr 12, 2009, 10:13:03 PM4/12/09

to FakeWeb

Thanks for looking at this Chris. I spent some time on this today and
found that my problems boiled down to a couple of issues:

>> Following Redirects
- I found out how to use the curl '-L' option to follow redirects
- This option generates 'extra' 301 headers that cause FakeWeb to fail
- The non-200 headers can be removed, and FakeWeb will work
(hint - use "string.gsub(/^[\s\S]+^HTTP\/1\.0 200/,'HTTP/1.0 200')")
- I also configured Mechanize to follow redirects, just to be safe

>> User Agent
- I found that some websites don't work if they don't see a "User
Agent" header
- I used the curl option '-A "Mac Safari"' and that did the trick
- I also configured Mechanize to use the same User Agent header

>> FakeWeb Bug
- I think FakeWeb crashes if it sees a URL that ends in ':80/'
- I couldn't figure out how to work around this

>> It works!
- I manually applied your github commit and I made the changes above
- Then everything worked on both cygwin and linux

Looks like I don't have a way to attach my revised test script.
So I will paste it here - apologies in advance for non-brevity...

#!/usr/bin/env ruby

require 'rubygems'
require 'fakeweb'
require 'mechanize'

class FwTest
def initialize
@domains = %w(google bbcnews nytimes yahoo slashdot fandango
vitalist)
@agent = WWW::Mechanize.new do |a|
a.user_agent_alias = 'Mac Safari'
a.follow_meta_refresh = true
end
end

def test_urls
@domains.inject([]) do |output, dom|
output << "http://#{dom}.com"
output << "http://#{dom}.com/"
output << "http://#{dom}.com:80"
# this causes fakeweb to crash
#output << "http://#{dom}.com:80/}"
output << "http://www.#{dom}.com"
output << "http://www.#{dom}.com/"
output << "http://www.#{dom}.com:80"
# this causes fakeweb to crash
#output << "http://www.#{dom}.com:80/}"
end
end

def test_without_fw
FakeWeb.allow_net_connect = true
test_urls.each do |url|
begin
@agent.get url
puts " w/o FakeWeb:WORKS> #{url}"
rescue
puts " w/o FakeWeb:FAILS> #{url}"
end
end
end

def test_with_fw
FakeWeb.allow_net_connect = false
cf = 'cache.htm'
cc = 1
test_urls.each do |url|
cache_data = strip_non_200_headers(`curl -is -A 'Mac Safari' -L #
{url}`)
File.open(cf, 'w') {|out| out.puts cache_data}

FakeWeb.register_uri(:get, url, :response => cf)

FakeWeb.registered_uri?(:get, url)
begin
@agent.get url
puts "With FakeWeb:WORKS> #{url}"
rescue
datfile = "#{cc}_#{cf}"
File.rename(cf, datfile)
puts "With FakeWeb:FAILS> #{url.ljust(25)} (Cache File: #
{datfile})"
cc += 1
end
end
end

def strip_non_200_headers(string)
string.gsub(/^[\s\S]+^HTTP\/1\.0 200/,'HTTP/1.0 200')
end

end

if $0 == __FILE__
x = FwTest.new
#x.test_without_fw
x.test_with_fw
end

On Apr 11, 8:44 pm, Chris Kampmeier <ChrisGKampme...@gmail.com> wrote:
> OK, I did a little investigation:
>
> FakeWeb had a bug where trailing slashes were considered significant

> for requests to the root of a domain, so e.g.http://example.com/andhttp://example.comwere considered different URLs. This has been
> fixed:http://github.com/chrisk/fakeweb/commit/c98f1d8f8643449e035ce0204d878...

> That'll make it into the next gem release. Thanks for bringing it to
> my attention!
>
> After that change, your test script works for me with two exceptions,

> which seem legitimate. First, Google responds with a 301 to GEThttp://google.com, with the Location header set tohttp://www.google.com/.

> Mechanize tries to follow that redirect, but you hadn't registered
> that URI yet, so you get a FakeWeb::NetConnectNotAllowedError.
>
> Second, Google seems to check the user-agent (or perhaps something