BeautifulSoup does not see links that exist in page source

51 views
Skip to first unread message

Zeynel

unread,
Nov 29, 2009, 11:08:13 AM11/29/09
to beautifulsoup
Please see this thread in StackOverflow:
http://stackoverflow.com/questions/1814750/how-can-i-translate-this-xpath-expression-to-beautifulsoup

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

But if you look at the page source here
http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName= you
would see that "/cabel" is there. Can this be fixed?

Thanks.

Aaron DeVore

unread,
Nov 29, 2009, 3:05:14 PM11/29/09
to beauti...@googlegroups.com
Got it! This is error in the web page itself. Specifically, this
attribute (search in a text editor to get the tag):

onMouseOver="MM_swapImage('alumni','','/FCWSite/Img/alumni.gif',1);

The onMouseOver attribute isn't closed by a quote mark. sgmllib (the
underlying parser for Beautiful Soup 3.0) mangles the attribute, but
is able to recover. Firefox does the same thing. HTMLParser instead
dies instantly and silently.

By the way, the best query in this case is:

soup.find('a', href="/cabel")

The 'a' allows Beautiful Soup to skip attribute matching on tags that
aren't 'a'. Taking out the regular expression removes the overhead of
regular expression matching.

Cheers!
Aaron DeVore
> --
>
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>
>
>

Aaron DeVore

unread,
Nov 29, 2009, 3:25:48 PM11/29/09
to beauti...@googlegroups.com
By the way, I just contacted White & Case (the law firm with the web
page in question) about the missing quote. It was slightly affecting
the rendering of their web site.

- Aaron DeVore

Zeynel

unread,
Nov 29, 2009, 4:50:57 PM11/29/09
to beautifulsoup
Thanks for the answer! But, I think that I still need to use regex
because I want to find all names not only /cabel. By the way, I used,
Scrapy shell with XPath and there was no problem. Thanks again.

On Nov 29, 3:05 pm, Aaron DeVore <aaron.dev...@gmail.com> wrote:
> Got it! This is error in the web page itself. Specifically, this
> attribute (search in a text editor to get the tag):
>
> onMouseOver="MM_swapImage('alumni','','/FCWSite/Img/alumni.gif',1);
>
> The onMouseOver attribute isn't closed by a quote mark. sgmllib (the
> underlying parser for Beautiful Soup 3.0) mangles the attribute, but
> is able to recover. Firefox does the same thing. HTMLParser instead
> dies instantly and silently.
>
> By the way, the best query in this case is:
>
> soup.find('a', href="/cabel")
>
> The 'a' allows Beautiful Soup to skip attribute matching on tags that
> aren't 'a'. Taking out the regular expression removes the overhead of
> regular expression matching.
>
> Cheers!
> Aaron DeVore
>
>
>
> On Sun, Nov 29, 2009 at 8:08 AM, Zeynel <azeyn...@gmail.com> wrote:
> > Please see this thread in StackOverflow:
> >http://stackoverflow.com/questions/1814750/how-can-i-translate-this-x...

Aaron DeVore

unread,
Nov 30, 2009, 12:58:54 PM11/30/09
to beauti...@googlegroups.com
On Sun, Nov 29, 2009 at 1:50 PM, Zeynel <azey...@gmail.com> wrote:
> Thanks for the answer! But, I think that I still need to use regex
> because I want to find all names not only /cabel. By the way, I used,
> Scrapy shell with XPath and there was no problem. Thanks again.

Yeah, no problem. Well, a problem, but an interesting one. :)

- Aaron DeVore
Reply all
Reply to author
Forward
0 new messages