xpath and specific sign

72 views
Skip to first unread message

d4v1d

unread,
Jan 26, 2014, 4:35:16 PM1/26/14
to scrapy...@googlegroups.com
Hello
Is it possible to search in an url a specific text without having to specify a tag
Example, i would like to search all the texts 0 to 9 and with . before and after the sign $
It is probably possible with a regex but i don't know how use this type of tools on scrapy
Thanks for you help
Regards


d4v1d

unread,
Jan 27, 2014, 4:33:05 PM1/27/14
to scrapy...@googlegroups.com
is something like this is in the right direction ?

item['price'] = hxs.select('/html').re('[0-9]€')

Mikołaj Roszkowski

unread,
Jan 27, 2014, 5:06:21 PM1/27/14
to scrapy...@googlegroups.com
You want to check the whole page's html content and then grab values with numbers? 


2014-01-27 d4v1d <lang...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

David LANGLADE

unread,
Jan 28, 2014, 5:11:34 AM1/28/14
to scrapy...@googlegroups.com
Hello
Thanks for your feedback
Not really, i want to crawl all the page for find specific symbols + numeric sequence (for example 15.23€) and return this value
Regards




2014-01-27 Mikołaj Roszkowski <mikolaj.r...@gmail.com>

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/Q5YJPx3vEiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.



--
David LANGLADE
5 rue du patuel
42800 Saint martin la plaine
Tel : 06.49.42.38.85

Mikołaj Roszkowski

unread,
Jan 28, 2014, 6:15:36 AM1/28/14
to scrapy...@googlegroups.com
It's hard to say without seeing the page's source code. The usual method to this task is to crawl the necessery nodes with xpath and then process those scraped items in the item pipeline to extract the values. http://doc.scrapy.org/en/latest/topics/item-pipeline.html


2014-01-28 David LANGLADE <lang...@gmail.com>

d4v1d

unread,
Jan 28, 2014, 4:44:54 PM1/28/14
to scrapy...@googlegroups.com
hello
yes, your are right my explanations are not clear
my objectif is to find on a web page the price, i supposed that the price is construct like this : 12,76 €
i have the different urls in a database, so i test each url and search the price with a specific regex but it didn't accept symbol €
Maybe i have to specify that the item['price'] is in utf8 but i don't know how ?

    def parse(self, response):
        hxs
= HtmlXPathSelector(response)
        item
= DmozItem()
        item
['price'] = hxs.select('//span/text()').re('([0-9]+(?:[,.][0-9])?)\s')

        cur
= self.db.cursor()
        cur
.execute("select url from urls")
       
for j in range(len(item['price'])):
            cursor
= self.db.cursor()
            sql
= "update urls set price_%s = '%s' where url = '%s'" % (j, item['price'][j], response.url)
            cursor
.execute(sql)
           
self.db.commit()
       
return item


I hope it's more clear
thanks in advance
regards

Rolando Espinoza La Fuente

unread,
Jan 28, 2014, 4:57:15 PM1/28/14
to scrapy...@googlegroups.com
You can use the euro symbol in your regex. Scrapy under the hood uses the flag re.UNICODE with allows you to do that. See:

In [33]: text = u"<span>12,76 €</span>"

In [34]: sel = Selector(text=text)

In [35]: sel.xpath('//span/text()').re(u'(\d+,\d+) €')
Out[35]: [u'12,76']

David LANGLADE

unread,
Jan 29, 2014, 6:26:21 AM1/29/14
to scrapy...@googlegroups.com
Thanks for your help
I just have un problem with the encoding :

Syntax-Error : Non-ASCII character '\x80' in file...
but no encoding declared; see http://www.python.org/peps/pep-0263.html

How can implement this encoding in scrapy?
Regards



2014-01-28 Rolando Espinoza La Fuente <dar...@gmail.com>

Paul Tremberth

unread,
Jan 29, 2014, 6:54:07 AM1/29/14
to scrapy...@googlegroups.com
You could declare the encoding of the Python script containing this "€" character,
with for example
#!/usr/bin/env python
# -*- coding: utf-8 -*-

at the top (adapt to the encoding used by your code editor)

or safer, but less readable, is the use the unicode Python representation or the "€" character
>>> text = u"<span>12,76 €</span>"
>>> [text]
[u'<span>12,76 \u20ac</span>']

so the regex becomes
sel.xpath('//span/text()').re(u'(\d+,\d+) \u20ac')

/Paul.

David LANGLADE

unread,
Jan 29, 2014, 4:16:38 PM1/29/14
to scrapy...@googlegroups.com
ok it works perfectly thanks


2014-01-29 Paul Tremberth <paul.tr...@gmail.com>
Reply all
Reply to author
Forward
0 new messages