Dealing with html inside of <script> tags

3,985 views
Skip to first unread message

LunkRat

unread,
Apr 19, 2012, 3:13:17 PM4/19/12
to beautifulsoup
I am scraping a document that has a large amount of tags inside of a
<script> tag. Seems that by default the < and > inside of <script>
tags are converted to &lt and &gt respectively.

How to turn this off so that tags inside of <script> are navigable
like the rest of the tree?

Thanks

Leonard Richardson

unread,
Apr 19, 2012, 3:21:11 PM4/19/12
to beauti...@googlegroups.com
> I am scraping a document that has a large amount of tags inside of a
> <script> tag. Seems that by default the < and > inside of <script>
> tags are converted to &lt and &gt respectively.

They're not really converted to &lt; and &gt;. All the HTML parsers
treat the contents of a <script> tag as a string, and Beautiful Soup
converts < and > to &lt; and &gt; on *output* so as to keep the HTML
valid. If you look at the content, you'll see the data as it is:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<script>foo<bar>baz</script>")

>>> soup.script
<script>foo&lt;bar&gt;baz</script>

>>> soup.script.string
u'foo<bar>baz'

If the contents of a <script> tag happen to be HTML markup, you can
parse them by passing the string into a second BeautifulSoup object:

>>> soup2 = BeautifulSoup(soup.script.string)
>>> soup2.bar
<bar>baz</bar>

Leonard

Leonard Richardson

unread,
Nov 13, 2012, 8:23:47 AM11/13/12
to Leandro Costa, beautifulsoup
There are a couple ways. The simplest is to tell Beautiful Soup not to
use a formatter when producing output:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters

Example:

from bs4 import BeautifulSoup
data = '<a>a < b</a> <script>c < d</script>'
soup = BeautifulSoup(data, "html.parser")
print soup.encode(formatter=None)
# <a>a < b</a> <script>c < d</script>

Or, you can define a custom formatter that only omits the formatting
if the string being formatted is within a <script> tag. Example:

from bs4.dammit import EntitySubstitution
def unformatted_within_script(x):
if x.find_parent("script") is None:
return EntitySubstitution.substitute_html(x)
else:
return x

print soup.encode(formatter=unformatted_within_script)
# <a>a &lt; b</a> <script>c < d</script>

Leonard

On Tue, Nov 13, 2012 at 7:27 AM, Leandro Costa <leandr...@gmail.com> wrote:
> So, what would be the best way to lead with this problem if I'm prettifying
> an HTML document with a lot of <script/> tags with JavaScript code inside?
> Is there a way to disable this convertion for tags <script/>?
>
> []'s
>
> Em quinta-feira, 19 de abril de 2012 16h21min11s UTC-3, Leonard Richardson
> escreveu:
Reply all
Reply to author
Forward
0 new messages