What is the encode format on this website?

李哲

unread,

Aug 30, 2016, 5:33:27 AM8/30/16

to scrapy-users

http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=

I want to scrape this web, I should fill the &q1="some thing" or &q2= "something", with the Chinese character which is encoded,

but the query content %C0%EF seems not utf - 8 encoding(or urlencoding, both I have tried). How can I know the encoding format?

What is the encoding format here ?

Paul Tremberth

unread,

Aug 31, 2016, 9:15:47 AM8/31/16

to scrapy-users

Hello,

you can fetch this URL with scrapy shell and check the response encoding:

$ scrapy shell "http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4="
2016-08-31 15:05:58 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...
2016-08-31 15:05:59 [scrapy] DEBUG: Crawled (200) <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=> (referer: None)
(...)
In [1]: response.encoding
Out[1]: 'gb18030'

Note: in the examples below, I'm using Python 3.5

You can also verify the encoding of the URL using parse_qs[1] (in Python3 you can pass the encoding)

In [2]: from urllib.parse import parse_qs

In [3]: parse_qs('q1=%C0%EF&q2=&q3=&q4=', encoding='gb18030')
Out[3]: {'q1': ['里']}

When building Requests objects, when you build your URLs with Chinese characters

you'll need to either pass safe URL strings with gb18030 encoding (here I pass the same 里 4 times (this is just an example obvisouly),

In [4]: from w3lib.url import safe_url_string

In [5]: safe_url_string('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=里&q3=里&q4=里', encoding=response.encoding)
Out[5]: 'http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF'

In [6]: from scrapy import Request

In [7]: Request('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF')
Out[7]: <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF>

or pass the encoding parameter to the Request constructor, otherwise, UTF-8 is used for query parameters before percent-escaping:

In [9]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里')
Out[9]: <GET http://chengyu.t086.com/chaxun.php?q1=%E9%87%8C&q2=%E9%87%8C>

In [10]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里', encoding='gb18030')
Out[10]: <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF>

Hope it helps.

Regards,

Paul.

[1] https://docs.python.org/3/library/urllib.parse.html#urllib.parse.parse_qs

李哲

unread,

Sep 1, 2016, 1:49:34 AM9/1/16

to scrapy-users

tnx I will read the doc carefully

在 2016年8月31日星期三 UTC+8下午9:15:47，Paul Tremberth写道：

Reply all

Reply to author

Forward