xpath and specific sign

d4v1d

unread,

Jan 26, 2014, 4:35:16 PM1/26/14

to scrapy...@googlegroups.com

Hello

Is it possible to search in an url a specific text without having to specify a tag

Example, i would like to search all the texts 0 to 9 and with . before and after the sign $

It is probably possible with a regex but i don't know how use this type of tools on scrapy

Thanks for you help

Regards

d4v1d

unread,

Jan 27, 2014, 4:33:05 PM1/27/14

to scrapy...@googlegroups.com

is something like this is in the right direction ?

item['price'] = hxs.select('/html').re('[0-9]')

Mikołaj Roszkowski

unread,

Jan 27, 2014, 5:06:21 PM1/27/14

to scrapy...@googlegroups.com

You want to check the whole page's html content and then grab values with numbers?

2014-01-27 d4v1d <lang...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

David LANGLADE

unread,

Jan 28, 2014, 5:11:34 AM1/28/14

to scrapy...@googlegroups.com

Hello

Thanks for your feedback

Not really, i want to crawl all the page for find specific symbols + numeric sequence (for example 15.23€) and return this value

Regards

2014-01-27 Mikołaj Roszkowski <mikolaj.r...@gmail.com>

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/Q5YJPx3vEiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

--
David LANGLADE
5 rue du patuel
42800 Saint martin la plaine
Tel : 06.49.42.38.85

Mikołaj Roszkowski

unread,

Jan 28, 2014, 6:15:36 AM1/28/14

to scrapy...@googlegroups.com

It's hard to say without seeing the page's source code. The usual method to this task is to crawl the necessery nodes with xpath and then process those scraped items in the item pipeline to extract the values. http://doc.scrapy.org/en/latest/topics/item-pipeline.html

2014-01-28 David LANGLADE <lang...@gmail.com>

d4v1d

unread,

Jan 28, 2014, 4:44:54 PM1/28/14

to scrapy...@googlegroups.com

hello

yes, your are right my explanations are not clear

my objectif is to find on a web page the price, i supposed that the price is construct like this : 12,76 €

i have the different urls in a database, so i test each url and search the price with a specific regex but it didn't accept symbol €

Maybe i have to specify that the item['price'] is in utf8 but i don't know how ?

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        item = DmozItem()
        item['price'] = hxs.select('//span/text()').re('([0-9]+(?:[,.][0-9])?)\s')

        cur = self.db.cursor()
        cur.execute("select url from urls")
        for j in range(len(item['price'])):
            cursor = self.db.cursor()
            sql = "update urls set price_%s = '%s' where url = '%s'" % (j, item['price'][j], response.url)
            cursor.execute(sql)
            self.db.commit() 
        return item

I hope it's more clear

thanks in advance

regards

Rolando Espinoza La Fuente

unread,

Jan 28, 2014, 4:57:15 PM1/28/14

to scrapy...@googlegroups.com

You can use the euro symbol in your regex. Scrapy under the hood uses the flag re.UNICODE with allows you to do that. See:

In [33]: text = u"<span>12,76 €</span>"

In [34]: sel = Selector(text=text)

In [35]: sel.xpath('//span/text()').re(u'(\d+,\d+) €')

Out[35]: [u'12,76']

David LANGLADE

unread,

Jan 29, 2014, 6:26:21 AM1/29/14

to scrapy...@googlegroups.com

Thanks for your help

I just have un problem with the encoding :

Syntax-Error : Non-ASCII character '\x80' in file...

but no encoding declared; see http://www.python.org/peps/pep-0263.html

How can implement this encoding in scrapy?

Regards

2014-01-28 Rolando Espinoza La Fuente <dar...@gmail.com>

Paul Tremberth

unread,

Jan 29, 2014, 6:54:07 AM1/29/14

to scrapy...@googlegroups.com

You could declare the encoding of the Python script containing this "€" character,

with for example

#!/usr/bin/env python
# -*- coding: utf-8 -*-

at the top (adapt to the encoding used by your code editor)

or safer, but less readable, is the use the unicode Python representation or the "€" character

>>> text = u"<span>12,76 €</span>"

>>> [text]
[u'<span>12,76 \u20ac</span>']

so the regex becomes

sel.xpath('//span/text()').re(u'(\d+,\d+) \u20ac')

/Paul.

David LANGLADE

unread,

Jan 29, 2014, 4:16:38 PM1/29/14

to scrapy...@googlegroups.com

ok it works perfectly thanks

2014-01-29 Paul Tremberth <paul.tr...@gmail.com>

Reply all

Reply to author

Forward