Code to remove text from Scrapy output

35 views
Skip to first unread message

VR Tech

unread,
Dec 9, 2015, 5:24:18 AM12/9/15
to scrapy-users

Below is a sample piece of HTML code that I want to scrape with scrapy.


<body>
<h2 class="post-title entry-title">Sample Header</h2>
    <div class="entry clearfix">
        <div class="sample1">
            <p>Hello</p>
        </div>
        <!--start comment-->
        <div class="sample2">
            <p>World</p>
        </div>
        <!--end comment-->
    </div>
<ul class="post-categories">
<li><a href="123.html">Category1</a></li>
<li><a href="456.html">Category2</a></li>
<li><a href="789.html">Category3</a></li>
</ul>
</body>



Right now I am using the below working scrapy code:


from
scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from isbullshit.items import IsBullshitItem class IsBullshitSpider(CrawlSpider): name = 'isbullshit' start_urls = ['http://sample.com'] rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')] def parse_blogpost(self, response): hxs = HtmlXPathSelector(response) item = IsBullshitItem() item['title'] = hxs.select('//h2[@class="post-title entry-title"]/text()').extract()[0] item['tag'] = hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0] item['article_html'] = hxs.select("//div[@class='entry clearfix']").extract()[0] return item



It gives me the following xml output:


<?
xml version="1.0" encoding="utf-8"?> <items> <item> <article_html> <div class="entry clearfix"> <div class="sample1"> <p>Hello</p> </div> <!--start comment--> <div class="sample2"> <p>World</p> </div> <!--end comment--> </div> </article_html> <tag> Category1 </tag> <title> Sample Header </title> </item> </items>



I want to know how to achieve the following output:


<?
xml version="1.0" encoding="utf-8"?> <items> <item> <article_html> <div class="entry clearfix"> <div class="sample1"> <p>Hello</p> </div> <!--start comment--> <!--end comment--> </div> </article_html> <tag> Category1,Category2,Category3 </tag> <title> Sample Header </title> </item> </items>


Note: The number of categories depends on the post. In the above example, there are 3 categories. There could be more or less.

Help would be much appreciated. Cheers.

Steven Almeroth

unread,
Jan 17, 2016, 3:47:37 PM1/17/16
to scrapy-users
If you want to get all categories in "tag" you can remove the "take-first" predicate [1].  If you want to ignore all markup between two (comment) tags, then you might want to do that with Python, not Xpath.  Also CrawSpider was removed from "contrib" in Scrapy; same with extractors.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from isbullshit.items import IsBullshitItem

class IsBullshitSpider(CrawlSpider):
    name = 'isbullshit'
    start_urls = ['http://sample.com']
    rules = (
        Rule(LinkExtractor(allow=r'page/\d+')),
        Rule(LinkExtractor(allow=r'\w+'), callback='parse_blogpost'),
    )

    def parse_blogpost(self, response):
        item = IsBullshitItem()
        item['title'] = response.select('//h2[@class="post-title entry-title"]/text()').extract_first()
        item['tag'] = response.select('//ul[@class="post-categories"]/li/a/text()').extract_first()
        item['article_html'] = response.select("//div[@class='entry clearfix']").extract_first()

        return item
Reply all
Reply to author
Forward
0 new messages