some help for regex with scrapy

30 views
Skip to first unread message

peter zhu

unread,
Sep 3, 2016, 1:04:35 PM9/3/16
to scrapy-users
Hey,guys!
 http://www.nowdl.cn/all.html
my steps:
1,scrapy shell http://www.nowdl.cn/all.html
2,response.xpath('/html/body/div[3]/ul/li/a').extract()
i want to extract the content before suffix ".php"
for example:
u'<a href="http://www.nowdl.cn/city/beijing/beijing.php" target="_blank">\u5317\u4eac</a>',
i need bold fonts "beijing" and want to chang unicode "\u5317\u4eac" -->"北京市"
now my question is:
1,how to use the regex to extract the contents which i need?
2,how to change the unicode to chinese?
thks any suggestions!

Artem Utin

unread,
Sep 4, 2016, 12:47:23 AM9/4/16
to scrapy-users
Hello. 

I'd recommend to use as much selectors as possible before diving into regexes, especially if you're not good at it.
So, you can use response.xpath('/html/body/div[3]/ul/li/a/@href').extract() to extract anchor href's, and response.xpath('/html/body/div[3]/ul/li/a/text()').extract() to extract anchor's text (it's mentioned in docs btw)
Afterwards, you can try out regex for extracting cities names at pythex 

peter zhu

unread,
Sep 4, 2016, 8:29:12 AM9/4/16
to scrapy-users
Artem Utin,thks very much!
Reply all
Reply to author
Forward
0 new messages