Configuring BeautifulSoup to disable lxml's global dictionary

48 views

Skip to first unread message

Jack Poulson

unread,

Dec 13, 2020, 6:26:24 PM12/13/20

to beautifulsoup

Hi Everyone,

I am pretty new to BeautifulSoup but managed to get a scraper for NLRB filings [1] working after realizing that the lxml parser was required to properly parse the tables.

Unfortunately, I found that there is a prohibitive memory leak in the lxml calls when I parse thousands of documents in a loop in a single script.

I believe that this leak is due to lxml's global dictionary described in this blog post:
https://benbernardblog.com/tracking-down-a-freaky-python-memory-leak-part-2/

Is there a way to configure BeautifulSoup to set 'collect_ids=False' in the lxml parser construction to sacrifice a bit of runtime for decreased memory usage?

With Thanks,
Jack Poulson

[1] https://gitlab.com/tech-inquiry/pynlrb/-/blob/master/pynlrb.py

Reply all

Reply to author

Forward

0 new messages