Scrap Chinese sites

283 views
Skip to first unread message

simon

unread,
Sep 21, 2009, 1:18:11 AM9/21/09
to scrapy-users
Hi All,
I'm trying to use Scrapy to crawl some Chinese websites.
But all Chinese characters in the response seems encoded, like
\xc8\xb8\xbd\xfd
Tried to ask Mr google but didnot have any luck.

Could anyone help me out?
What I want is when writing extracted response to a file, it should
shows in Chinese.

Thank you in advance!

-Simon

Pablo Hoffman

unread,
Sep 21, 2009, 12:00:20 PM9/21/09
to scrapy-users
Simon,

Looks like you're having some kind of encoding issue (unlikely to be a
Scrapy bug). Can you provide the steps to reproduce the problem?

Pablo.

Haisheng Wu

unread,
Sep 22, 2009, 1:22:40 PM9/22/09
to scrapy...@googlegroups.com
Hi, Pablo,
 Here is the step to see the problem ( with scrapy shell )
 a. python scrapy-ctl.py shell http://www.cncrk.com/downinfo/11390.html
     (The page is basically a software download information with lots of ADs)
 b. select the software name
     >>hxs.select("//div[@id='softtitle']/span/text()").extract()
You'll see the result is not in Chinese
 Same thing when pipeline the result to a csv file (csvwriter).

(I'll publish the 'study' project later but its idea is very like to those steps)

Thanks your help.

-Simon

simon

unread,
Sep 23, 2009, 8:28:08 AM9/23/09
to scrapy-users
Scrapy project is also available at
http://code.google.com/p/personal-study/source/browse/#svn/trunk/python-work/cncrk

Thanks.
-Simon

Pablo Hoffman

unread,
Sep 23, 2009, 8:38:12 AM9/23/09
to scrapy...@googlegroups.com
Simon,

When you type in the shell:

>>> hxs.select("//div[@id='softtitle']/span/text()").extract()
[u'\u5c04\u624b\u5f71\u97f3\u64ad\u653e\u5668\u524d\u536b\u7248 V2.3.709 \u7b80\u4f53\u4e2d\u6587\u7eff\u8272\u514d\u8d39\u7248 ']

You're actually seeing the repr() of the object. That's a convention used by
all Python consoles.

What you actually want is to print the contents, not seeing its Python
representation. So you can use "print" instead:

>>> print hxs.select("//div[@id='softtitle']/span/text()").extract()[0]
射手影音播放器前卫版 V2.3.709 简体中文绿色免费版

In both cases, the scraped data is the same. And you can write it to a file,
store it in a database, etc.

Pablo.

Haisheng Wu

unread,
Sep 23, 2009, 8:47:13 AM9/23/09
to scrapy...@googlegroups.com
Pablo, appreciate your quick response.
That's good to me.

-Simon
Reply all
Reply to author
Forward
0 new messages