Delete dictionary keys during an interation

32 views
Skip to first unread message

KP

unread,
Nov 29, 2020, 7:52:33 AM11/29/20
to beautifulsoup

When trying to use BS4 I got the following error:

Traceback (most recent call last):
  File "script.py", line 16, in <module>
    mycall()
  File "/home/user/mycode.py", line 63, in downloadDataset
    soup = BeautifulSoup(r, "html5lib")
  File "/home/user/.local/lib/python3.8/site-packages/bs4/__init__.py", line 348, in __init__
    self._feed()
  File "/home/user/.local/lib/python3.8/site-packages/bs4/__init__.py", line 434, in _feed
    self.builder.feed(self.markup)
  File "/home/user/.local/lib/python3.8/site-packages/bs4/builder/_html5lib.py", line 87, in feed
    doc = parser.parse(markup, **extra_kwargs)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 223, in parse
    self._parse(stream, innerHTML=False, encoding=encoding,
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 93, in _parse
    self.mainLoop()
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 187, in mainLoop
    new_token = phase.processStartTag(new_token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 468, in processStartTag
    return self.startTagHandler[token["name"]](token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 1258, in startTagMath
    self.parser.adjustForeignAttributes(token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 339, in adjustForeignAttributes
    for originalName in token["data"].keys():
RuntimeError: dictionary keys changed during iteration


I checked the function "adjustForeignAttributes" and it indeed changes dictionary keys while iterating over it. I've tried to replicate this in a separated snippet and it throws another error, so I guess doing this is problematic.

I'm a bit surprised because I was parsing something similar (maybe even the same) and I did not get this error before. Also, I haven't changed my python version for a while (3.8.6). By disabling this function (commenting line 1259) it performs the desired behavior. Questions:

1) Is calling this function necessary? What does it do in practice?
2) Why do I get this error? Why changing the dictionary keys is being done?

Thanks

leonardr

unread,
Nov 29, 2020, 7:53:30 AM11/29/20
to beautifulsoup
What's the markup you're using? html5lib has a lot of edge cases that only turn up with really specific markup.

Leonard

KP

unread,
Nov 29, 2020, 9:25:03 AM11/29/20
to beautifulsoup
After parsing the particular website for a while, I could extract a minimal example to reproduce the error. Parsing the following won't work:

<!DOCTYPE html>
<html>
<body>
    <math xmlns="http://www.w3.org/1998/Math/MathML"></math>
</body>
</html>


However, this works:

1) <asd xmlns="http://www.w3.org/1998/Math/MathML"></asd>

2) <math asd="http://www.w3.org/1998/Math/MathML"></math>

(the link itself seems to be irrelevant)

leonardr

unread,
Nov 29, 2020, 2:34:40 PM11/29/20
to beautifulsoup
My guess is that you're encountering a bug in html5lib which was fixed by this commit in 2016. That commit removes the line that caused the error in your traceback, and the goal of that commit -- "Preserve attribute order when parsing" -- sounds like it could fix the sort of problem you're having.

Check what version of html5lib you're using and see if upgrading fixes the problem.

Leonard
Reply all
Reply to author
Forward
0 new messages