What is the encode format on this website?

19 views
Skip to first unread message

李哲

unread,
Aug 30, 2016, 5:33:27 AM8/30/16
to scrapy-users
http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=



I want to scrape this web, I should fill the &q1="some thing" or &q2= "something", with the Chinese character which is encoded, 

but the query content %C0%EF seems not utf - 8 encoding(or urlencoding, both I have tried).  How can I know the encoding format? 

What is the encoding format here ?  


Paul Tremberth

unread,
Aug 31, 2016, 9:15:47 AM8/31/16
to scrapy-users
Hello,

you can fetch this URL with scrapy shell and check the response encoding:

2016-08-31 15:05:58 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...
2016-08-31 15:05:59 [scrapy] DEBUG: Crawled (200) <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=> (referer: None)
(...)
In [1]: response.encoding
Out[1]: 'gb18030'


Note: in the examples below, I'm using Python 3.5

You can also verify the encoding of the URL using parse_qs[1] (in Python3 you can pass the encoding)

In [2]: from urllib.parse import parse_qs

In [3]: parse_qs('q1=%C0%EF&q2=&q3=&q4=', encoding='gb18030')
Out[3]: {'q1': ['里']}



When building Requests objects, when you build your URLs with Chinese characters
 you'll need to either pass safe URL strings with gb18030 encoding (here I pass the same 里 4 times (this is just an example obvisouly),

In [4]: from w3lib.url import safe_url_string

In [5]: safe_url_string('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=里&q3=里&q4=里', encoding=response.encoding)




or pass the encoding parameter to the Request constructor, otherwise, UTF-8 is used for query parameters before percent-escaping:

In [9]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里')

In [10]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里', encoding='gb18030')


Hope it helps.

Regards,
Paul.

李哲

unread,
Sep 1, 2016, 1:49:34 AM9/1/16
to scrapy-users
tnx I will read the doc carefully

在 2016年8月31日星期三 UTC+8下午9:15:47,Paul Tremberth写道:
Reply all
Reply to author
Forward
0 new messages