some help for regex with scrapy

peter zhu

unread,

Sep 3, 2016, 1:04:35 PM9/3/16

to scrapy-users

Hey,guys!
http://www.nowdl.cn/all.html
my steps:
1,scrapy shell http://www.nowdl.cn/all.html
2,response.xpath('/html/body/div[3]/ul/li/a').extract()
i want to extract the content before suffix ".php"
for example:
u'<a href="http://www.nowdl.cn/city/beijing/beijing.php" target="_blank">\u5317\u4eac</a>',
i need bold fonts "beijing" and want to chang unicode "\u5317\u4eac" -->"北京市"
now my question is:
1,how to use the regex to extract the contents which i need?
2,how to change the unicode to chinese?
thks any suggestions!

Artem Utin

unread,

Sep 4, 2016, 12:47:23 AM9/4/16

to scrapy-users

Hello.

I'd recommend to use as much selectors as possible before diving into regexes, especially if you're not good at it.

So, you can use response.xpath('/html/body/div[3]/ul/li/a/@href').extract() to extract anchor href's, and response.xpath('/html/body/div[3]/ul/li/a/text()').extract() to extract anchor's text (it's mentioned in docs btw)

Afterwards, you can try out regex for extracting cities names at pythex

peter zhu

unread,

Sep 4, 2016, 8:29:12 AM9/4/16

to scrapy-users

Artem Utin,thks very much!

Reply all

Reply to author

Forward