copy.copy(soup) takes longer than expected

97 views
Skip to first unread message

Chris Papademetrious

unread,
Apr 28, 2024, 9:20:07 PMApr 28
to beautifulsoup
Hello fellow Soupers!

I have a 76 MB HTML file containing about 1 million tags. It takes 23 seconds to parse:

# 23 seconds
import bs4
soup = bs4.BeautifulSoup(html_doc, "lxml")

but it takes 56 seconds to copy:

# 56 seconds
import copy
soup2 = copy.copy(soup)

Does this seem right?

It only takes 1.1 seconds to compare them (it returned true):

# 1.1 seconds
print(soup == soup2)

which is surprising given that just serializing it to a string takes 11 seconds:

# 11 seconds
text = str(soup)

 - Chris


Jonn Doe

unread,
Apr 29, 2024, 7:21:18 AMApr 29
to beautifulsoup
            a = b = soup = BeautifulSoup(txt, 'html5lib')
            print("a:", a)
            print("b:", b)
            print("a = b", a==b)
            b = 5
            print("b:", b)
            print("a = b", a==b)

Chris Papademetrious

unread,
May 5, 2024, 8:30:46 AMMay 5
to beautifulsoup
Hi jonn,

I'm not quite sure what you're suggesting here.

 - Chris

Carlos

unread,
May 6, 2024, 5:47:11 AMMay 6
to beautifulsoup
Hello Chris, I've tested what you said with a 31MB (806952 tags) HTML file and I also get huge times for Python to copy the soup object compared to the time spent to create it initially. I got 8.7 seconds to parse the document and 21.35 to copy it.
Then, I profile the copy.copy method, which then calls the _clone method and recursively the __deepcopy__ method of the Tag class. The stats I got are the following:

156807709 function calls (156534794 primitive calls) in 101.433 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  101.433  101.433 {built-in method builtins.exec}
        1    0.000    0.000  101.433  101.433 <string>:1(<module>)
        1    0.000    0.000  101.433  101.433 copy.py:66(copy)
        1    0.000    0.000  101.433  101.433 element.py:1346(__copy__)
 271891/1    2.268    0.000  101.433  101.433 element.py:1318(__deepcopy__)
   271890    2.240    0.000   80.649    0.000 element.py:1352(_clone)
   543780    1.339    0.000   76.944    0.000 element.py:1605(__getattr__)
   543780    0.904    0.000   75.117    0.000 element.py:1987(find)
   543780    1.334    0.000   74.213    0.000 element.py:2013(find_all)
   543780   13.864    0.000   72.738    0.000 element.py:792(_find_all)
 13952142   20.373    0.000   39.910    0.000 element.py:2303(search)
   806952    0.879    0.000   11.870    0.000 element.py:488(append)
 14495922    5.239    0.000   11.314    0.000 {built-in method builtins.next}
   806952    6.919    0.000   10.828    0.000 element.py:406(insert)
 55613431    9.934    0.000   10.472    0.000 {built-in method builtins.isinstance}
  4580416    7.013    0.000    9.175    0.000 element.py:2240(search_tag)
 15302875    5.975    0.000    6.375    0.000 element.py:2062(descendants)
   543780    3.260    0.000    5.954    0.000 element.py:2155(__init__)
 16857871    4.481    0.000    4.481    0.000 {built-in method builtins.hasattr}
  1056301    1.990    0.000    4.213    0.000 element.py:1783(_event_stream)
  1087560    1.107    0.000    2.482    0.000 element.py:2203(_normalize_search_value)
   535062    0.503    0.000    2.325    0.000 element.py:958(__deepcopy__)
   535062    0.810    0.000    1.822    0.000 element.py:943(__new__)
  9160832    1.696    0.000    1.696    0.000 element.py:1586(__bool__)
  1843796    1.200    0.000    1.505    0.000 element.py:387(_last_descendant)
  1056297    0.696    0.000    1.441    0.000 element.py:1641(__ne__)
   806952    0.923    0.000    1.248    0.000 <frozen importlib._bootstrap>:1207(_handle_fromlist)
   271891    0.801    0.000    1.112    0.000 element.py:1199(__init__)
   806953    0.844    0.000    0.844    0.000 element.py:156(setup)
  1056297    0.585    0.000    0.745    0.000 element.py:1624(__eq__)
3782358/3781334    0.723    0.000    0.724    0.000 {built-in method builtins.len}
   543780    0.600    0.000    0.600    0.000 element.py:2422(__init__)

There are a bunch of calls to the find and find_all methods to the original soup object, which may add some overhead compared to directly construct the parsed tree from the original HTML document. Take this carefully because I am not really into the bs4 implementation, it's just an hypothesis.  I have created a repository on Launchpad  with the HTML document and the code I used to test this.

Carlos

unread,
May 6, 2024, 6:05:16 AMMay 6
to beautifulsoup
Also, as you mentioned it takes less time to convert the soup object to a string. But not limited to that, creating a new soup object with
    soup2 = bs4.BeautifulSoup(str(soup), "lxml")

takes 14 seconds for me, some less time that copying it.

Chris Papademetrious

unread,
May 15, 2024, 11:04:41 AMMay 15
to beautifulsoup
Hi phoenix,

Very cool profiling information! I've never done that before - I will definiteliy check out your code.

Based on your findings, I suspect that the SoupStrainer construction in the find() and find_all() methods is the cause.

 - Chris

Chris Papademetrious

unread,
May 16, 2024, 7:49:30 AMMay 16
to beautifulsoup
phoenix,

Thanks to you teaching me about profiling, I have determined that the sourceline and sourcepos calls in the Tag _clone() method are the cause of the numerous find*() calls (and thus the runtime):

    def _clone(self):
        """Create a new Tag just like this one, but with no
        contents and unattached to any parse tree.

        This is the first step in the deepcopy process.
        """
        clone = type(self)(
            None, None, 'self.name', self.namespace,
            self.prefix, self.attrs, is_xml=self._is_xml,
            sourceline=self.sourceline, sourcepos=self.sourcepos,
            can_be_empty_element=self.can_be_empty_element,
            cdata_list_attributes=self.cdata_list_attributes,
            preserve_whitespace_tags=self.preserve_whitespace_tags,
            interesting_string_types=self.interesting_string_types
        )
        for attr in ('can_be_empty_element', 'hidden'):
            setattr(clone, attr, getattr(self, attr))
        return clone

I will file an enhancement request to improve the copy.copy() runtime and mention this finding.

Thanks for your help!

 - Chris

Chris Papademetrious

unread,
May 16, 2024, 7:53:20 AMMay 16
to beautifulsoup
Actually, looking at the code, I can't see how that is the cause. They are just numeric positions in a file...

I will investigate further.

 - Chris

Chris Papademetrious

unread,
May 16, 2024, 8:51:04 AMMay 16
to beautifulsoup
Yes, it turns out that they are the cause. If I uncomment the debug print() statement in the __getattr__() method and move it here:

    def __getattr__(self, tag):
        """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
        if len(tag) > 3 and tag.endswith('Tag'):
            # BS3: soup.aTag -> "soup.find("a")
            tag_name = tag[:-3]
            warnings.warn(
                '.%(name)sTag is deprecated, use .find("%(name)s") instead. If you really were looking for a tag called %(name)sTag, use .find("%(name)sTag")' % dict(
                    name=tag_name
                ),
                DeprecationWarning, stacklevel=2
            )
            return self.find(tag_name)
        # We special case contents to avoid recursion.
        elif not tag.startswith("__") and not tag == "contents":
            print("Getattr %s.%s" % (self.__class__, tag))
            return self.find(tag)
        raise AttributeError(
            "'%s' object has no attribute '%s'" % (self.__class__, tag))

then I see output like this:

Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos
Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos
Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos
Getattr <class 'bs4.element.Tag'>.sourceline
Getattr <class 'bs4.element.Tag'>.sourcepos

so I think the runtime is coming from the SoupStrainer construction for that find().

Now I will file the issue.  :)

 - Chris

Chris Papademetrious

unread,
May 21, 2024, 9:01:19 AMMay 21
to beautifulsoup
For reference, here is the issue I filed:


 - Chris
Reply all
Reply to author
Forward
0 new messages