url inside html problem

256 views
Skip to first unread message

Demonsbook

unread,
Sep 17, 2009, 5:20:40 AM9/17/09
to beautifulsoup
Hi, i have a small problem
I'm parsing a page that has an anchor which as description uses the
address it leads to. This works most of the times without problems but
when the address contains the GET data it suddenly acts funny

an example:

from BeautifulSoup import BeautifulSoup

doc = ['<html>',
'<body>',
'<a href=\"http://something.com/somethingelse.php?
val1=first&val2=second\">',
'http://something.com/somethingelse.php?
val1=first&val2=second',
'</a>',
'</body>',
'</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify(), '\n'
print soup.a['href']
print soup.a.find(text=True)

returns

<html>
<body>
<a href="http://something.com/somethingelse.php?
val1=first&amp;val2=second">
http://something.com/somethingelse.php?val1=first&val2;=second
</a>
</body>
</html>

http://something.com/somethingelse.php?val1=first&val2=second
http://something.com/somethingelse.php?val1=first&val2;=second


it seems as if BeautifulSoup is trying to fix somethink inside the
text itself, and this way the two, basicly identical addresses ends up
different.

Any way of solving this?

Thanks in advance

Jakub Jagiełło

unread,
Sep 17, 2009, 11:58:04 AM9/17/09
to beauti...@googlegroups.com
2009/9/17 Demonsbook <demon...@gmail.com>:

In fact the code you're trying to parse is not walid, as &val; is not
a defined entity.
But this is an error in your code, not in parser (of course you can
change parser code to work as you want with such invalid code).
If you want to use & in url you should code it as &amp; (both in href
and inside the tag).

Joan

unread,
Sep 21, 2009, 6:57:27 PM9/21/09
to beautifulsoup


On 17 Set, 17:58, Jakub Jagiełło <jab...@gmail.com> wrote:

>
> In fact the code you're trying to parse is not walid, as &val; is not
> a defined entity.
> But this is an error in your code, not in parser (of course you can
> change parser code to work as you want with such invalid code).
> If you want to use & in url you should code it as &amp; (both in href
> and inside the tag).

But why is it an error? These are perfectly valid (and heavily used)
URLs, so if your page has them, what can you do?

I am having exactly the same problem, and it took me a long time to
figure out that the problem was this extra semicolon that
BeautifulSoup is inserting, because it thinks that val is an ampersand
tag that someone forgot to close with a semicolon.

Is there a way to prevent BeautifulSoup from doing this correction? I
think that this is what the original poster needs (and so do I).

I have done a quick and dirty fix, and it seems to be working for me,
but it may be incomplete, since I had never looked at the inner
workings of BeautifulSoup before. At the bottom of handle_entityref in
BeautifulSoup.py, there is a set of lines that do:

if not data:
# This case is different from the one above, because we
# haven't already gone through a supposedly comprehensive
# mapping of entities to Unicode characters. We might not
# have gone through any mapping at all. So the chances are
# very high that this is a real entity, and not a
# misplaced ampersand.
data = "&%s;" % ref
self.handle_data(data)

I have removed the semicolon from the data=... line, so it reads:
data = "&%s" % ref

But there should be a more elegant way to do it, I suppose.

Joan

Jakub Jagiełło

unread,
Sep 22, 2009, 8:21:04 AM9/22/09
to beauti...@googlegroups.com
IMHO the problem is that BeautifulSoup does not check if an entity is
defined, and works like all entity-like substrings are entities.
I think there should be added an check if an entity is defined and
correction should be performed only then.

But raw ampersads in URLs are still dangerous. Imagine scripts that
uses a GET variable called "amp":

http://jaboja.pl/?sth=0&amp=1

And page with not entity-encoded link to it:

<html><body>
<a href="http://jaboja.pl/?sth=0&amp=1">http://jaboja.pl/?sth=0&amp=1</a>
</body></html>

When I display above HTML document in Firefox it interprets both <a>
tag content and href as http://jaboja.pl/?sth=0&=1
In fact it is parsed this way by all major web browsers (but only for
the defined entities!).
When I parse it with BeautifulSoup, it interprets &amp in content ad
malformed entity,
but does not do so for the href (what is different than in browsers). Vide:

<html><body>
<a href="http://jaboja.pl/?sth=0&amp;amp=1">http://jaboja.pl/?sth=0&amp;=1</a>
</body></html>

--
Jakub Jagiełło

Joan Creus

unread,
Sep 23, 2009, 6:11:06 AM9/23/09
to beauti...@googlegroups.com


2009/9/22 Jakub Jagiełło <jab...@gmail.com>


IMHO the problem is that BeautifulSoup does not check if an entity is
defined, and works like all entity-like substrings are entities.
I think there should be added an check if an entity is defined and
correction should be performed only then.

 
Rather than making BeautifulSoup more intelligent, I would make this optional. It can be argued that ampersands in URLs are dangerous, but certainly they are ubiquitous. If your goal is to generate completely unambiguous XML pages, then BeautifulSoup may be right to add semicolons.
 
But if your goal is just to parse the content of a page, and get a snapshot of the current state of the page, in order to do just a few minor modifications, then adding semicolons arbitrarily is VERY dangerous and results in hard-to-find bugs. The URLs of the original example, for instance, would stop working if they got replaced by BeautifulSoup's idea of "what the URL syntax should be".
 
And, last but not least, this "feature" should be heavily documented, IMHO.
 
Joan
 

jaboja

unread,
Sep 24, 2009, 12:03:48 PM9/24/09
to beauti...@googlegroups.com
To avoid websites stoping working making BeautifulSoup parse code like
browser would be enough (IMHO).
Reply all
Reply to author
Forward
0 new messages