copy.copy(soup) takes longer than expected

Chris Papademetrious

unread,

Apr 28, 2024, 9:20:07 PMApr 28

to beautifulsoup

Hello fellow Soupers!

I have a 76 MB HTML file containing about 1 million tags. It takes 23 seconds to parse:

# 23 seconds

import bs4
soup = bs4.BeautifulSoup(html_doc, "lxml")

but it takes 56 seconds to copy:

# 56 seconds

import copy
soup2 = copy.copy(soup)

Does this seem right?

It only takes 1.1 seconds to compare them (it returned true):

# 1.1 seconds

print(soup == soup2)

which is surprising given that just serializing it to a string takes 11 seconds:

# 11 seconds

text = str(soup)

- Chris

Jonn Doe

unread,

Apr 29, 2024, 7:21:18 AMApr 29

to beautifulsoup

a = b = soup = BeautifulSoup(txt, 'html5lib')
print("a:", a)
print("b:", b)
print("a = b", a==b)
b = 5
print("b:", b)
print("a = b", a==b)

Chris Papademetrious

unread,

May 5, 2024, 8:30:46 AMMay 5

to beautifulsoup

Hi jonn,

I'm not quite sure what you're suggesting here.

- Chris

Carlos

unread,

May 6, 2024, 5:47:11 AMMay 6

to beautifulsoup

Hello Chris, I've tested what you said with a 31MB (806952 tags) HTML file and I also get huge times for Python to copy the soup object compared to the time spent to create it initially. I got 8.7 seconds to parse the document and 21.35 to copy it.

Then, I profile the copy.copy method, which then calls the _clone method and recursively the __deepcopy__ method of the Tag class. The stats I got are the following:

156807709 function calls (156534794 primitive calls) in 101.433 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 101.433 101.433 {built-in method builtins.exec}
1 0.000 0.000 101.433 101.433 <string>:1(<module>)
1 0.000 0.000 101.433 101.433 copy.py:66(copy)
1 0.000 0.000 101.433 101.433 element.py:1346(__copy__)
271891/1 2.268 0.000 101.433 101.433 element.py:1318(__deepcopy__)
271890 2.240 0.000 80.649 0.000 element.py:1352(_clone)
543780 1.339 0.000 76.944 0.000 element.py:1605(__getattr__)
543780 0.904 0.000 75.117 0.000 element.py:1987(find)
543780 1.334 0.000 74.213 0.000 element.py:2013(find_all)
543780 13.864 0.000 72.738 0.000 element.py:792(_find_all)
13952142 20.373 0.000 39.910 0.000 element.py:2303(search)
806952 0.879 0.000 11.870 0.000 element.py:488(append)
14495922 5.239 0.000 11.314 0.000 {built-in method builtins.next}
806952 6.919 0.000 10.828 0.000 element.py:406(insert)
55613431 9.934 0.000 10.472 0.000 {built-in method builtins.isinstance}
4580416 7.013 0.000 9.175 0.000 element.py:2240(search_tag)
15302875 5.975 0.000 6.375 0.000 element.py:2062(descendants)
543780 3.260 0.000 5.954 0.000 element.py:2155(__init__)
16857871 4.481 0.000 4.481 0.000 {built-in method builtins.hasattr}
1056301 1.990 0.000 4.213 0.000 element.py:1783(_event_stream)
1087560 1.107 0.000 2.482 0.000 element.py:2203(_normalize_search_value)
535062 0.503 0.000 2.325 0.000 element.py:958(__deepcopy__)
535062 0.810 0.000 1.822 0.000 element.py:943(__new__)
9160832 1.696 0.000 1.696 0.000 element.py:1586(__bool__)
1843796 1.200 0.000 1.505 0.000 element.py:387(_last_descendant)
1056297 0.696 0.000 1.441 0.000 element.py:1641(__ne__)
806952 0.923 0.000 1.248 0.000 <frozen importlib._bootstrap>:1207(_handle_fromlist)
271891 0.801 0.000 1.112 0.000 element.py:1199(__init__)
806953 0.844 0.000 0.844 0.000 element.py:156(setup)
1056297 0.585 0.000 0.745 0.000 element.py:1624(__eq__)
3782358/3781334 0.723 0.000 0.724 0.000 {built-in method builtins.len}
543780 0.600 0.000 0.600 0.000 element.py:2422(__init__)

There are a bunch of calls to the find and find_all methods to the original soup object, which may add some overhead compared to directly construct the parsed tree from the original HTML document. Take this carefully because I am not really into the bs4 implementation, it's just an hypothesis. I have created a repository on Launchpad with the HTML document and the code I used to test this.

Carlos

unread,

May 6, 2024, 6:05:16 AMMay 6

to beautifulsoup

Also, as you mentioned it takes less time to convert the soup object to a string. But not limited to that, creating a new soup object with

soup2 = bs4.BeautifulSoup(str(soup), "lxml")

takes 14 seconds for me, some less time that copying it.

Chris Papademetrious

unread,

May 15, 2024, 11:04:41 AMMay 15

to beautifulsoup

Hi phoenix,

Very cool profiling information! I've never done that before - I will definiteliy check out your code.

Based on your findings, I suspect that the SoupStrainer construction in the find() and find_all() methods is the cause.

- Chris

Chris Papademetrious

unread,

May 16, 2024, 7:49:30 AMMay 16

to beautifulsoup

phoenix,

Thanks to you teaching me about profiling, I have determined that the sourceline and sourcepos calls in the Tag _clone() method are the cause of the numerous find*() calls (and thus the runtime):

def _clone(self):
"""Create a new Tag just like this one, but with no
contents and unattached to any parse tree.

This is the first step in the deepcopy process.
"""
clone = type(self)(
None, None, 'self.name', self.namespace,
self.prefix, self.attrs, is_xml=self._is_xml,
sourceline=self.sourceline, sourcepos=self.sourcepos,
can_be_empty_element=self.can_be_empty_element,
cdata_list_attributes=self.cdata_list_attributes,
preserve_whitespace_tags=self.preserve_whitespace_tags,
interesting_string_types=self.interesting_string_types
)
for attr in ('can_be_empty_element', 'hidden'):
setattr(clone, attr, getattr(self, attr))
return clone

I will file an enhancement request to improve the copy.copy() runtime and mention this finding.

Thanks for your help!

- Chris

Chris Papademetrious

unread,

May 16, 2024, 7:53:20 AMMay 16

to beautifulsoup

Actually, looking at the code, I can't see how that is the cause. They are just numeric positions in a file...

I will investigate further.

- Chris

Chris Papademetrious

unread,

May 16, 2024, 8:51:04 AMMay 16

to beautifulsoup

Yes, it turns out that they are the cause. If I uncomment the debug print() statement in the __getattr__() method and move it here:

def __getattr__(self, tag):
"""Calling tag.subtag is the same as calling tag.find(name="subtag")"""
if len(tag) > 3 and tag.endswith('Tag'):
# BS3: soup.aTag -> "soup.find("a")
tag_name = tag[:-3]
warnings.warn(
'.%(name)sTag is deprecated, use .find("%(name)s") instead. If you really were looking for a tag called %(name)sTag, use .find("%(name)sTag")' % dict(
name=tag_name
),
DeprecationWarning, stacklevel=2
)
return self.find(tag_name)
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
print("Getattr %s.%s" % (self.__class__, tag))
return self.find(tag)
raise AttributeError(
"'%s' object has no attribute '%s'" % (self.__class__, tag))

then I see output like this:

Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos
Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos
Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos
Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos

so I think the runtime is coming from the SoupStrainer construction for that find().

Now I will file the issue. :)

- Chris

Chris Papademetrious

unread,

May 21, 2024, 9:01:19 AMMay 21

to beautifulsoup

For reference, here is the issue I filed:

#2065904: Improve copy.copy() runtime

- Chris

Reply all

Reply to author

Forward