extract() function may cause recursive error

26 views
Skip to first unread message

xiaoqi li

unread,
Jan 5, 2018, 7:12:36 AM1/5/18
to beautifulsoup
hi all:
I use beautifulsoup4 (4.6.0) with Python 2.7.11 in my centos.

when I use extract function to remove script tag and style tag in the tree.

After extract script tags and style tag.
I recursive the remaind tree. 
But some tags is gone.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('E2y4zvik1V.html'), 'html5lib')

[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]

for tag in soup.recursiveChildGenerator():
    tag_name = getattr(tag, "name", None)
    print tag_name

E2y4zvik1V.html is HERE:
<html>
<head>
<title>JFolder ������ͷ���޸İ�</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head>
<style>
body {
font-size: 14px;
font-family: ����;
background-color: #CCCCCC;
}
td {
font-size: 14px;
font-family: ����;
}

input.textbox {
border: black solid 1;
font-size: 12px;
height: 18px;
}

input.button {
font-size: 12px;
font-family: ����;
border: black solid 1;
}

td.datarows {
font-size: 14px;
font-family: ����;
height: 25px;
}

textarea {
border: black solid 1;
}
.inputLogin {font-size: 9pt;border:1px solid lightgrey;background-color: lightgrey;}
.table1 {BORDER:gray 0px ridge;}
.td2 {BORDER-RIGHT:#ffffff 0px solid;BORDER-TOP:#ffffff 1px solid;BORDER-LEFT:#ffffff 1px solid;BORDER-BOTTOM:#ffffff 0px solid;BACKGROUND-COLOR:lightgrey; height:18px;}
.tr1 {BACKGROUND-color:gray }
</style>
<script language="JavaScript">
<!--
function ltrim(str) {
while (str.indexOf(0) == " ")
str = str.substring(1);

return str;
}

function changeAction(obj) {
obj.submit();
}
//-->
</script>
<body>

<table align="center" width="600" border="0" cellpadding="2" cellspacing="0">
<form name="form1" method="get">
<tr bgcolor="#CCCCCC">
<td id="title"><!--[������ҳ]--></td>
<td align="right">
<select name="action" onChange="javascript:changeAction(document.form1)">
<option value="main">������ҳ</option>
<option value="filesystem">�ļ�ϵͳ</option>
<option value="command">ϵͳ����</option>
<option value="database">��ݿ�</option>
<option value="config">��������</option>
<option value="about">���ڳ���</option>
<option value="exit">�˳����</option>
</select>
<script language="JavaScript">
var action = "main"

var sAction = document.form1.action;
for (var i = 0; i < sAction.length; i ++) {
if (sAction[i].value == action) {
sAction[i].selected = true;
//title.innerHTML = "[" + sAction[i].innerHTML + "]";
}
}
</script>
</td>
</tr>
</form>
</table>

<table align="center" width="600" cellpadding="2" cellspacing="1" border="0" bgcolor="#CCCCCC">
<tr bgcolor="#FFFFFF">
<td colspan="2" align="center">��������Ϣ</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="300" align="center" class="datarows">��������</td>
<td align="center" class="datarows">tomcat4.ruletest.int.baidu.com</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="300" align="center" class="datarows">������˿�</td>
<td align="center" class="datarows">80</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="300" align="center" class="datarows">����ϵͳ</td>
<td align="center" class="datarows">Linux 3.10.0_1-0-0-8 amd64</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="300" align="center" class="datarows">��ǰ�û���</td>
<td align="center" class="datarows">root</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="300" align="center" class="datarows">��ǰ�û�Ŀ¼</td>
<td align="center" class="datarows">/root</td>
</tr>
</table>

</body>
</html>


The tag "tr" "td" is gone. 
i use  soup.find_all("td") . it return '[]'.
 

Jim Tittsler

unread,
Jan 5, 2018, 1:48:06 PM1/5/18
to beautifulsoup
On Thu, Jan 4, 2018 at 11:18 PM, xiaoqi li <liziqi...@gmail.com> wrote:
> After extract script tags and style tag.
> I recursive the remaind tree.
> But some tags is gone.

Your HTML file looks invalid, and it might confuse even the lenient
parsing of html5lib. If you have a number of these files and they all
have the same inconsistencies, it might be possible to do some string
processing on the files first to turn them into valid HTML, and then
produce 'soup' from that intermediate result.

xiaoqi li

unread,
Jan 7, 2018, 10:07:19 AM1/7/18
to beautifulsoup
In my python code.
If exchange line 5([s.extract() for s in soup('script')]) and line 6([s.extract() for s in soup('style')])
The result is right. the extract order influence the result.
So . I think is not the problem of parsing of html5lib. 

Another appearance: If i print the soup object after extract, is looklike good (direct use print).
But when i ues recursiveChildGenerator , is look like wrong. 

在 2018年1月6日星期六 UTC+8上午2:48:06,Jim Tittsler写道:
Reply all
Reply to author
Forward
0 new messages