do large tags break when extracted from soup?

13 views
Skip to first unread message

Sumeet Sandhu

unread,
Sep 7, 2016, 3:01:21 PM9/7/16
to beautifulsoup
I had posted this initially under SUBJECT: using soup.tag.extract() to remove large tags and speed up soup - counterintuitive behavior

My PROBLEM was: when I extracted a tag from soup and processed it with a function, it was drastically slower than processing the tag while it was still attached to its soup. Profile results for both approaches are attached below.

The SOLUTION turned out to be: I extract the tag, unicode(tag), then create a new tag using BeautifulSoup(unicode(tag),'lxml'). This takes almost the same time as processing the tag-in-soup.

What is it about extract(tag) that breaks it? The documentation says:
"At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted."

What nuances am I missing?

regards
Sumeet
----------------------------
cProfile for two approaches : (1) mark sentences in a large tag inside the original soup (2) extract the large tag from soup and then mark sentences. Approach #2 is disastrously slow - see the top 4-5 function calls highlighted in bold red.

PROCESS TAGS INSIDE SOUP

Tue Sep  6 16:48:26 2016    tagSent


        2138697 function calls (2106434 primitive calls) in 1.677 seconds

  Ordered by: internal time

  List reduced from 144 to 10 due to restriction <10>


  ncalls  tottime  percall  cumtime  percall filename:lineno(function)

15641/223    0.139    0.000    0.433    0.002 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:1065(decode)

     514    0.121    0.000    0.980    0.002 {method 'feed' of 'lxml.etree._FeedParser' objects}

   18093    0.105    0.000    0.524    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/builder/_lxml.py:136(start)

   54793    0.088    0.000    0.312    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/__init__.py:287(endData)

   83496    0.088    0.000    0.090    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:191(setup)

   19640    0.069    0.000    0.176    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:783(__init__)

15641/223    0.069    0.000    0.426    0.002 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:1164(decode_contents)

   32956    0.062    0.000    0.062    0.000 {built-in method __new__ of type object at 0x100188900}

  315646    0.060    0.000    0.071    0.000 {isinstance}

   57801    0.053    0.000    0.055    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}


EXTRACT TAGS FROM SOUP, THEN PROCESS

Tue Sep  6 16:49:59 2016    tagSent


        178885528 function calls (171646797 primitive calls) in 90.007 seconds

  Ordered by: internal time

  List reduced from 125 to 10 due to restriction <10>


  ncalls  tottime  percall  cumtime  percall filename:lineno(function)

14425177   11.030    0.000   15.582    0.000 {hasattr}

58030351   10.841    0.000   26.147    0.000 {isinstance}

10812650    9.615    0.000   75.500    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:1639(search)

 7211540    9.528    0.000   15.306    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/abc.py:128(__instancecheck__)

    2279    6.745    0.003   88.108    0.039 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:506(_find_all)

 3605513    6.433    0.000   32.307    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:1665(_matches)

 3605513    6.096    0.000   49.311    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:1598(search_tag)

 3610071    5.019    0.000   14.571    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:1562(_normalize_search_value)

14422566    4.877    0.000    4.877    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_weakrefset.py:70(__contains__)

 7207137    4.552    0.000    4.552    0.000 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py:704(__getattr__)

Reply all
Reply to author
Forward
0 new messages