This seems like a bug (beta 6)

7 views
Skip to first unread message

Bruce Eckel

unread,
Feb 16, 2012, 8:51:17 PM2/16/12
to beauti...@googlegroups.com
Here are three different approaches to doing what I think should be the same thing:

import bs4
print bs4.__version__
from bs4 import BeautifulSoup

html = """
<html>
 <body> 
  <p class="c1">
   <span class="c5">
    1 &nbsp; // Values.scala
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    2 &nbsp;
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    3 &nbsp; val anInteger: Int = 11
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    4 &nbsp; val aDouble: Double = 1.4
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    5 &nbsp; // true or false:
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    6 &nbsp; val trueOrFalse: Boolean = true
   </span>
  </p>
 </body>
</html>
"""

print '1' * 40
soup = BeautifulSoup(html)
for s in soup._all_strings():
    if s.strip():
        s.string.replace_with("Something else")
print(soup.prettify())

print '2' * 40
soup = BeautifulSoup(html)
for s in [s for s in soup._all_strings() if s.strip()]:
    s.string.replace_with("Something else")
print(soup.prettify())

print '3' * 40
soup = BeautifulSoup(html)
for s in reversed([s for s in soup._all_strings() if s.strip()]):
    s.string.replace_with("Something else")
print(soup.prettify())

-------------------------------------------------------------------------------------------
I think all three approaches should replace all the strings. I think the list comprehension should produce the same result as the basic for loop. But in the output, the for loop only replaces the first string, whereas the list comprehensions (regardless of direction) replaced all of them:

4.0.0b6
1111111111111111111111111111111111111111
<html>
 <body>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    2
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    3   val anInteger: Int = 11
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    4   val aDouble: Double = 1.4
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    5   // true or false:
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    6   val trueOrFalse: Boolean = true
   </span>
  </p>
 </body>
</html>
2222222222222222222222222222222222222222
<html>
 <body>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
 </body>
</html>
3333333333333333333333333333333333333333
<html>
 <body>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
  <p class="c1">
   <span class="c5">
    Something else
   </span>
  </p>
 </body>
</html>




-- Bruce Eckel
www.Reinventing-Business.com
www.MindviewInc.com

Leonard Richardson

unread,
Feb 17, 2012, 12:06:31 AM2/17/12
to beauti...@googlegroups.com
Modifying a list while iterating over it often gives unpredictable behavior:

>>> a_list = ["a","b","c"]
>>> for i in a_list:
... a_list.remove(i)
...
>>> a_list
['b']

>>> for i in a_list:
... a_list.add("d")

{infinite loop}

In the case you have here, once you remove the current item from the
tree, its pointer to the next item is destroyed, and the iteration
stops. Since this falls within the realm of "weird things happen when
you modify a data structure while iterating over it", I don't really
consider this a bug.

Nor is there any general way to fix it--there's no way of knowing in
advance what the next item out of the generator "should" be, since you
might modify the entire tree in between calls to the generator.

Leonard

> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.

Bruce Eckel

unread,
Feb 17, 2012, 1:06:37 AM2/17/12
to beauti...@googlegroups.com
OK. So you have to capture all the elements first, as I did with the list comprehension.

I guess it makes sense, but the two forms look so similar that it seems subtle.

Bruce Eckel

unread,
Feb 17, 2012, 8:29:11 AM2/17/12
to beauti...@googlegroups.com
After giving this some more thought I realized that it's NOT terribly obvious that changing the text strings in your nodes causes the traversal to break (although it does explain some of the other hiccups in my code). I think it needs a paragraph or two in the docs; perhaps with recommended approaches such as:

Don't do this if you are changing *anything* while you are traversing:

for tag in soup.<any traversal operation>:
    <modify tag>

The modification on the tag will break the traversal.

Instead, collect all the tags *first* in a comprehension, and then traverse the stored tags:

for tag in [tag for tag in soup.<any traversal operation>:
    <modify tag>

(This does make me wonder whether there are any other gotchas I'm stumbling into).
Reply all
Reply to author
Forward
0 new messages