Xpath function re:test does not work

496 views
Skip to first unread message

VivekG

unread,
Apr 29, 2012, 11:32:10 AM4/29/12
to scrapy-users
I have a very simple HTML doc as follows:

<html>
<body>
<table><tr><td>
<div>12/4/2012</div>
<div>something</div>
<div>another thing</div>
</td></tr></table>
</body> </html>

When I execute the following command in shell:

>>> hxs.select('//td/div/text()[re:test(., "\d{1,2}/\d{1,2}/\d{2,4}", "i")]').extract()
[u'12/4/2012', u'something', u'another thing']

It selects text from all three divs. It should have selected only the
first div.

If use contains(), it seems to work ok.
>>> hxs.select('//td/div/text()[contains(., "12")]').extract()
[u'12/4/2012']

I am unable to figure out what I am doing wrong with 're:test'
expression. Thanks very much for your help!

VivekG

unread,
May 9, 2012, 2:54:05 PM5/9/12
to scrapy-users
Any help on why scrapy is not evaluating the expression correctly?
The regex expression by itself works fine when I try it at: http://regexpal.com/

Steven Almeroth

unread,
May 10, 2012, 6:35:36 PM5/10/12
to scrapy...@googlegroups.com
I don't recognize the re:test() function, maybe it is an extension for some other system?  Have you tried testing it at http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm or in Firefox using the Firepath addon?

Try using Scrapy's XPathItemLoader:

>>> from scrapy.contrib.loader import XPathItemLoader
>>> l = XPathItemLoader(response=response)
>>> l.get_xpath('//td/div')
[u'<div>12/4/2012</div>', u'<div>something</div>', u'<div>another thing</div>']
>>> l.get_xpath('//td/div', re=r'\d{1,2}/\d{1,2}/\d{2,4}')
[u'12/4/2012']
>>> l = XPathItemLoader(response=response, item={'date': None})
>>> l.add_xpath('date', '//td/div', re=r'\d{1,2}/\d{1,2}/\d{2,4}')
>>> l.get_output_value('date')
[u'12/4/2012']
>>> l.load_item()
{'date': [u'12/4/2012']}

Vivek Gupta

unread,
May 10, 2012, 8:02:32 PM5/10/12
to scrapy...@googlegroups.com
Hi Steven,

I learned about re:test from this tutorial: http://manual.calibre-ebook.com/xpath.html
The website you refer to seems to give me a blank page when I try to load a sample file to test.

I will use XPathItemLoader instead. Thanks very much for your suggestion!

Regards
Vivek


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/XDcyRh7vdkcJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Pablo Hoffman

unread,
May 10, 2012, 9:52:12 PM5/10/12
to scrapy...@googlegroups.com, Steven Almeroth
You could use XPathSelector.re() method if re:test() doesn't work:
http://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions
--
Reply all
Reply to author
Forward
0 new messages