Catching HTTPErrors and URLErrors

46 views
Skip to first unread message

Tom

unread,
Aug 30, 2012, 2:03:39 PM8/30/12
to beauti...@googlegroups.com
Hello,
        I have an issue with my code crashing every time it encounters an HTTPError and URLError.  Basically my code scrapes a page cleanly and it searches for a certain url, if that page has the url then I want it to open that url and scrape the following page.... However some of these targeted urls are blank or malformed.... so my code crashes... Ultimately, I want my code to stop or break when it encounters an error and just pass that bad/malformed url and continue with the loop....
Here is the snippet of code where I try to catch the bad urls... does anyone have any suggestions for catching these, passing them by, and continuing on with the loop?


for link in row.find_all('a', limit=1):
                y = (link.get('href'))
                time.sleep(15)
                try:
                    data = urllib2.urlopen(y).read()
                except HTTPError, e:
                    print "The server could not fulfill the request."
                    print "Error code: ", e.code
                    time.sleep(100)
                except URLError, e:
                    print "We failed to reach a server."
                    print "Reason: ", e.reason     
                soup1 = BeautifulSoup(data, "html5lib", from_encoding="utf-8")

Thanks,
Tom

Message has been deleted

Paul Walker

unread,
Aug 31, 2012, 6:50:24 AM8/31/12
to beauti...@googlegroups.com
On 30 August 2012 19:03, Tom <boo...@gmail.com> wrote:

> have any suggestions for catching these, passing them by, and continuing on
> with the loop?
>
>
> for link in row.find_all('a', limit=1):
> y = (link.get('href'))
> time.sleep(15)
> try:
> data = urllib2.urlopen(y).read()
> except HTTPError, e:
> print "The server could not fulfill the request."
> print "Error code: ", e.code
> time.sleep(100)
> except URLError, e:
> print "We failed to reach a server."
> print "Reason: ", e.reason
> soup1 = BeautifulSoup(data, "html5lib",
> from_encoding="utf-8")

Two suggestions:

* use a continue in the except clauses, which probably isn't necessary if
* you move "soup1 = BeautifulSoup(data, "html5lib",
from_encoding="utf-8")" inside the try. As it stands, it's calling
BeautifulSoup with an empty object. It may even say that in the
backtrace if you check it. :-)

--
Paul

Tom

unread,
Sep 4, 2012, 12:52:33 PM9/4/12
to beauti...@googlegroups.com
Hello,
       I added both the continue's and the "soup1 = BeautifulSoup(data, "html5lib",
from_encoding="utf-8")" into the try and except block.  I keep getting similar errors.. Such as HttpError no defined..
Any ideas on how to get around this?

Thanks,
Tom

Paul Walker

unread,
Sep 5, 2012, 6:37:57 AM9/5/12
to beauti...@googlegroups.com
> HttpError no defined

What defines the HTTPError exception? If it's in a module you may need
to prepend the module name, e.g.

except httplib.HTTPError, e:
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/FLd8oEthf4wJ.
>
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.



--
Paul

Brian L Cartwright

unread,
May 31, 2015, 1:05:08 PM5/31/15
to beauti...@googlegroups.com
I’m already getting the values inside the <a></a> block, but I’d like to grab the customkey as well. How do I isolate it?
 
<td sorttable_customkey='Chapman,Denz`l'><a href='../players/profile.asp?P=denzl-chapman'>Denz'l Chapman</a></td>
 
thanks
Brian
Reply all
Reply to author
Forward
0 new messages