Can't parse url, but diagnose shows no problem

74 views

Skip to first unread message

paul

unread,

Nov 6, 2021, 12:22:42 PM11/6/21

to beautifulsoup

Hi, I am trying to to parse this url:

http://www.pbc.gov.cn/zhengcehuobisi/125207/125213/125431/125475/4380459/index.html.

Using BeautifulSoup(data, 'html.parser') the content all looks like this:

/Mw==','w6xpw6bCnMOmfsKr','wobCrcKu','w6nDrMKAZWU=','w4HDkhDDvsKeYcOj','ccOwUVbDkQ==','RVobC8OHPQw=','wpJGfMOn','w4HDuz

But when I downloaded the file to my local drive to run the diagnose function, it worked fine:

Diagnostic running on Beautiful Soup 4.10.0 Python version 3.8.3 (default, Jul 2 2020, 11:26:31) [Clang 10.0.0 ] Found lxml version 4.6.3.0 Found html5lib version 1.1 Trying to parse your markup with html.parser Here's what html.parser did with the markup: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <link href="./Announcement on Open Market Operations No.216 [2021]_files/default.css" id="lhgdialoglink" rel="stylesheet"/> <title> Announcement on Open Market Operations No.216 [2021] </title> <meta content="2021-11-05 11:00:11" name="页面生成时间"/> <meta content="2021-10-21 17:38:26" name="缓存清理时间"/> <meta content="7.9.5" name="easysite版本"/> <meta content="Announcement on Open Market Operations No.216 2021" name="keywords"/> <meta content="Announcement on Open Market Operations No.216 2021 初始化频

I was hoping to find out why I couldn't read direct from the url, and if there is a way to resolve that.

Thank you.

leonardr

unread,

Nov 6, 2021, 6:35:39 PM11/6/21

to beautifulsoup

On Saturday, November 6, 2021 at 12:22:42 PM UTC-4 paul wrote:

Hi, I am trying to to parse this url:

http://www.pbc.gov.cn/zhengcehuobisi/125207/125213/125431/125475/4380459/index.html.

Using BeautifulSoup(data, 'html.parser') the content all looks like this:

/Mw==','w6xpw6bCnMOmfsKr','wobCrcKu','w6nDrMKAZWU=','w4HDkhDDvsKeYcOj','ccOwUVbDkQ==','RVobC8OHPQw=','wpJGfMOn','w4HDuz

This is the content as it is served from the pbc.gov.cn server. You can verify this with this curl command:

curl http://www.pbc.gov.cn/zhengcehuobisi/125207/125213/125431/125475/4380459/index.html

Or with this Python script which uses no Beautiful Soup code:

from urllib.request import urlopen

print(urlopen("http://www.pbc.gov.cn/zhengcehuobisi/125207/125213/125431/125475/4380459/index.html").read())

When you use a web browser to download this web page, you end up with the HTML that's generated after your web browser runs a lot of Javascript to decode an obfuscated web page. To get that final web page in a way that you can access from Python, I recommend using a scriptable web browser like Selenium, which will run the Javascript for you.