Shadow,
take a look at the solution to this problem.
The end result is a .json output that can be saved.
It's a simple 'hack' that gets the job done.
I'm sure there are cleaner more sophisticated solutions but check it out and let me know your thoughts.
Remember, this solution is specific to the website being crawled in this problem but the end results
are what you were looking for (json output).
Walk-Through:
# items.py
from scrapy.item import Item, Field
class DirbotItem(Item):
name = Field()
description = Field()
class Website(DirbotItem):
url = Field()
class Category(Item):
name = Field()
url = Field()
description = Field()
# I use XPathItemLoader() to pack the items
# the_spider.pyfrom scrapy.contrib.loader import XPathItemLoader
from scrapy.spider import BaseSpider
from dirbot.items_solution import Website
from scrapy import log
class DmozSpider(BaseSpider):
name = "dmoz_solution"
allowed_domains = ["
dmoz.org"]
start_urls = [
"
http://www.dmoz.org/Computers/Programming/Languages/Python/Personal_Pages/"
]
def parse(self, response):
sites = XPathItemLoader(item=Website(), response=response)
log.msg('Adding SITES', level=log.INFO)
sites.add_xpath('name', "//ul[contains(@class, 'directory-url')]/li/a/text()")
sites.add_xpath('description', "//ul[contains(@class, 'directory-url')]/li/text()")
sites.add_xpath('url', '//ul[contains(@class, "directory-url")]/li/a/@href')
log.msg('Loading Items', level=log.INFO)
# dictionary = { 'name' : list[] } items = sites.load_item()
# specific to this example
# need to eliminate empty '' scraped items
for i in items['description']:
if i.strip() == '':
items['description'].pop(items['description'].index(i))
# if you want to see the items going into the pipelines file
# print items log.msg("returning items", level=log.INFO)
return items
# the pipeline is specific to the problem, which needs to be recreated for other crawling projects
# pipelines.py
from dirbot.items_solution import Category
from scrapy import log
class FilterWordsPipeline(object):
"""A pipeline for filtering out items which contain certain words in their
description"""
log.msg('Checking the pipes', level=log.INFO)
# put all words in lowercase
# this example can only take one word to filter. If you want more words you must edit the
# process_item()
words_to_filter = ['weblog']
'''
The process_item() is specific to this example.
Depending on what you want to do you will have to adjust it accordingly.
In this case, we want to process a dictionary-list as a result of the XPathItemLoader() call.
We then want .json type output from the filtered results
'''
def process_item(self, item, spider):
log.msg('Items in Pipe: %s' % item, level=log.INFO)
# set of indices that will be used to mark the items that pass the filter keep = set(range(0, len(item['description'])))
# construct lists to pass to the final dictionary output description = []
url = []
name = []
cat = Category()
for word in self.words_to_filter:
# identify indices to filter out that contain the word
# make a set to figure out which ones are out
to_pop = set([i for i, j in enumerate(item['description']) if word in unicode(j).lower()])
# select which items to keep keep = list(to_pop.symmetric_difference(keep))
log.msg('keeping: %s' % keep, level=log.INFO)
for i in keep:
description.append(item['description'][i])
url.append(item['url'][i])
name.append(item['name'][i])
# populate Category()
cat['description'] = description
cat['url'] = url
cat['name'] = name
log.msg('Returning Item from Pipe', level=log.INFO)
log.msg('%s' % cat, level=log.INFO)
return cat
Notice that in the pipeline I import the Category class from items.py.
This acts just like the other items classes in the spider file.
As the data packet flows through the pipeline, the word is filtered for each item['description']
and repackaged into the Category() field cat.
The output is then a dictionary-list just like how you wanted, which can be saved as .json.
What I learned is that the pipeline is specific to what the type of solution you are looking for.
Since you want a specific output then the pipeline has to be constructed as so.
The itemloaders and pipelines are still confusing to me so this is what it is, a 'hack'.