Is there a simpler way to access all scraped items in scrapy item-pipeline at the same time than that?!

166 views
Skip to first unread message

Salvad0r

unread,
Apr 10, 2016, 6:04:44 AM4/10/16
to scrapy-users

I would like to act on all items, in other word s collect all items and wirte them once to a file adding a header wrapping all the items. A good place for this seem to bee the pipelines, here only item by item is handeled.

 

I fond this solution:  „How to access all scraped items in Scrapy item pipeline?“  https://stackoverflow.com/questions/12768247/how-to-access-all-scraped-items-in-scrapy-item-pipeline


But the way seems to be more complex than nessesary.  Is there a smarter, shorter, easier or more elegant way? THX!

Jakob de Maeyer

unread,
Apr 12, 2016, 5:17:36 AM4/12/16
to scrapy-users
Hey Salvador,

if you don't scrape too many items (too many as in "cannot fit into your memory"), just save the items in an attribute of the pipeline and write them out on the spider close signal:

class MyPipeline(object):

   
def __init__(self):
       
self.items = []

   
def process_item(self, item, spider):
       
self.items.append(item)
       
return item

   
def close_spider(self, spider):
        # Save self.items


Cheers,
-Jakob

Dimitris Kouzis - Loukas

unread,
Apr 14, 2016, 4:59:05 PM4/14/16
to scrapy-users
By the way there's almost certainly a very nice elegant and cool way to do what you want to do without having the entire datastream in memory (sketches/streaming algorithms, Chapter 4 here)


On Sunday, April 10, 2016 at 11:04:44 AM UTC+1, Salvad0r wrote:

Salvad0r

unread,
Apr 18, 2016, 4:22:12 PM4/18/16
to scrapy-users
Thanks a lot! That's what I consider "elegant". :) Could not figure it out... THX.
Reply all
Reply to author
Forward
0 new messages