Using find() after append() not working

55 views
Skip to first unread message

Tye Shutty

unread,
Oct 25, 2021, 7:26:51 AM10/25/21
to beautifulsoup

Since I posted that question and answer, I have encountered additional problems where find() fails to find elements, so ideally I would like a robust find() or find() alternative. I need a way to find an element that meets the search criteria regardless of the sequence and method of adding new elements.

facelessuser

unread,
Oct 26, 2021, 1:10:56 AM10/26/21
to beautifulsoup

I don’t think the problem is find but the way the document is being constructed. Beautiful Soup seems to not be linking the elements properly on append. If they aren’t linked properly, find can’t traverse them properly. I suspect there may be a bug in Beautiful Soup in this regards. Maybe it isn’t recursively inserting the children of the new tag properly, so their next_element linkage is all wrong.

If you try and walk the chain when appending to cell before appending the cell, you get None:

>>> print(doc.find('table').next_element.next_element.next_element.next_element)
None

but if you append to cell after you append cell, you get the element:

>>> print(doc.find('table').next_element.next_element.next_element.next_element)
<tr id="row_1"></tr>

Tye Shutty

unread,
Oct 31, 2021, 9:36:14 AM10/31/21
to beautifulsoup
I found another workaround: Simply recreate the document using prettify(). The solution would adapt my stackoverflow problem like so:


doc = BeautifulSoup(doc.prettify(), 'html.parser')
print(doc.find(id='row_1'), '\n')

I'm guessing there's dual datastructure here, where prettify() accesses the string and find() accesses the object

leonardr

unread,
Dec 21, 2021, 1:30:36 PM12/21/21
to beautifulsoup
I filed bug #1948661 to track this problem, investigated, and discovered that the issue happens when you manipulate the Tag.contents list directly, rather than using methods like Tag.append(), which take care to keep the tree in a consistent state. The code given in the Stack Overflow example uses Tag.append() and Tag.contents.append() as if they were interchangeable, but they're not. Tag.contents.append() creates a disconnected tree, which causes the problems seen later.

Ideally Tag.append() and Tag.contents.append() would be interchangeable, and I explored the possibility of making direct manipulation of Tag.contents do the right thing, but that's a very sensitive piece of the code (since methods like Tag.append() themselves work by directly manipulating Tag.contents) and I wasn't confident in the quality of the result, so I shelved the work, at least for now.

If I was building this from scratch, I would probably make Tag.contents read-only, but doing that now would create unpredictable backwards compatibility issues. I might do something lighter, like make Tag.contents issue a warning the first time you try to modify it directly -- that's probably the wrong move but I can't say for sure.

I'm interested in hearing what behavior other people would like to see here.

Leonard

facelessuser

unread,
Dec 21, 2021, 2:11:06 PM12/21/21
to beautifulsoup

I’m surprised I never noticed it was appending to contents. I guess that is the answer then: use the Tag.append() to modify an element’s content, not content.append() directly.

As there are various, diverse ways someone could modify the internals, it would be a pain to create a special content object to warn a user of every possible way they may break the linkage.

Honestly, making it immutable…or maybe not immutable, but at least making direct access private is probably the most reasonable thing. At the very least, I’d document that it shouldn’t be modified directly.

If you ever did a major release (Beautiful Soup 5) aside from removing all the deprecated duplicate methods, this may be something worth considering. I’d consider moving contents to _contents and expose a method that returns a generator of the contents, not unlike what etree does with getChildren.

Honestly, a safer get_contents() method could be exposed now that just does a yield of the items in contents and could be encouraged now instead of using content directly. I imagine you could deprecate contents by making it a deprecated property that returns whatever is in _contents.

Reply all
Reply to author
Forward
0 new messages