wildcards in classes/ids?

385 views
Skip to first unread message

bangers...@gmail.com

unread,
Mar 2, 2015, 12:04:02 PM3/2/15
to scrapy...@googlegroups.com
Is there any way to use the contains (or any other function) to extract divs that have a class that contains a certain word? Something like:

   //div[contains(@class, "content")]  #<-- not even sure that works, but you get the idea,right?

But it would capture classes like main-content, body-content, blog-content, etc? I am looking for any kind of wildcard or functionality beyond a simple match.  Thank you.

jeff.

Paul Tremberth

unread,
Mar 2, 2015, 6:05:39 PM3/2/15
to scrapy...@googlegroups.com
Interestingly, XPath 1.0 has starts-with() but does not have ends-with()

You might be interesting in CSS3 "Substring matching attribute selectors"

and you can try: div[class$=content] that would work for class attribute ending with "content"

Check this sample python shell session:

>>> import scrapy
>>> html = """
... <div class="main-content">main content</div>
... <div class="main-content-2">main content 2</div>
... <div class="content-first">content first</div>
... <div class="blog-content">blog content</div>
... <div class="content">content</div>
... """
>>> selector = scrapy.Selector(text=html)
>>> selector.xpath('//div[contains(@class, "content")]').extract()
[u'<div class="main-content">main content</div>', u'<div class="main-content-2">main content 2</div>', u'<div class="content-first">content first</div>', u'<div class="blog-content">blog content</div>']
>>> selector.css('div[class$="-content"]').extract()
[u'<div class="main-content">main content</div>', u'<div class="blog-content">blog content</div>']
>>> selector.css('div[class$="content"]').extract()
[u'<div class="main-content">main content</div>', u'<div class="blog-content">blog content</div>', u'<div class="content">content</div>']
>>> 
Reply all
Reply to author
Forward
0 new messages