Configuring BeautifulSoup to disable lxml's global dictionary

48 views
Skip to first unread message

Jack Poulson

unread,
Dec 13, 2020, 6:26:24 PM12/13/20
to beautifulsoup
Hi Everyone,

I am pretty new to BeautifulSoup but managed to get a scraper for NLRB filings [1] working after realizing that the lxml parser was required to properly parse the tables.

Unfortunately, I found that there is a prohibitive memory leak in the lxml calls when I parse thousands of documents in a loop in a single script.

I believe that this leak is due to lxml's global dictionary described in this blog post:
https://benbernardblog.com/tracking-down-a-freaky-python-memory-leak-part-2/

Is there a way to configure BeautifulSoup to set 'collect_ids=False' in the lxml parser construction to sacrifice a bit of runtime for decreased memory usage?

With Thanks,
Jack Poulson
Reply all
Reply to author
Forward
0 new messages