hi,
how supposed to be defined two pipelines if each one of them doing
completely different SQL queries?
Two classes in the same pipelines.py file?
I've tried:
# Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting
DB tables.
from .tables.scanfirst import ScanFirst
from .tables.scansecond import ScanSecond
from test.items import FirstItem
from test.items import SecondItem
When later on, spider with SecondPipeline activated from:
class FirstSpider(CrawlSpider):
settings.overrides['ITEM_PIPELINES'] =
['test.pipelines.SecondPipeline']
... (rest of spider code)
than I see that code for FirstPipeline activated, why?
On Wed, Nov 24, 2010 at 08:19:43PM -0800, vitsin wrote: > hi, > how supposed to be defined two pipelines if each one of them doing > completely different SQL queries? > Two classes in the same pipelines.py file? > I've tried:
> # Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting > DB tables. > from .tables.scanfirst import ScanFirst > from .tables.scansecond import ScanSecond > from test.items import FirstItem > from test.items import SecondItem
> When later on, spider with SecondPipeline activated from: > class FirstSpider(CrawlSpider): > settings.overrides['ITEM_PIPELINES'] = > ['test.pipelines.SecondPipeline'] > ... (rest of spider code)
> than I see that code for FirstPipeline activated, why?
> 10x, > --vs
> -- > You received this message because you are subscribed to the Google Groups "scrapy-users" group. > To post to this group, send email to scrapy-users@googlegroups.com. > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
> You can't override settings like this in your spiders like your code does:
> class FirstSpider(CrawlSpider):
> settings.overrides['ITEM_PIPELINES'] = ...
> And you can't customize the item pipelines per spider.
> What you could do is check the spider in the process_item() of your pipeline,
> and ignore certain ones. For example:
> def process_item(self, item, spider):
> if spider.name not in ['myspider1', 'myspider2', 'myspider3']:
> return item
> Hope this helps,
> Pablo.
> On Wed, Nov 24, 2010 at 08:19:43PM -0800, vitsin wrote:
> > hi,
> > how supposed to be defined two pipelines if each one of them doing
> > completely different SQL queries?
> > Two classes in the same pipelines.py file?
> > I've tried:
> > # Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting
> > DB tables.
> > from .tables.scanfirst import ScanFirst
> > from .tables.scansecond import ScanSecond
> > from test.items import FirstItem
> > from test.items import SecondItem
> > When later on, spider with SecondPipeline activated from:
> > class FirstSpider(CrawlSpider):
> > settings.overrides['ITEM_PIPELINES'] =
> > ['test.pipelines.SecondPipeline']
> > ... (rest of spider code)
> > than I see that code for FirstPipeline activated, why?
> > 10x,
> > --vs
> > --
> > You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> > To post to this group, send email to scrapy-users@googlegroups.com.
> > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
But there are some nice alternatives for achieving that functionality. For example, you can choose a spider attribute to define which pipelines will be enabled for each spider, and then check that attribute in your pipelines.
Here's how your spiders would look:
class SomeSpider(CrawlSpider): pipelines = ['first']
class AnotherSpider(CrawlSpider): pipelines = ['first', 'second']
And your pipelines:
class FirstPipeline(object): def process_item(self, item, spider): if 'first' not in getattr(spider, 'pipelines', []): return item
# ... pipeline code here ...
class SecondPipeline(object): def process_item(self, item, spider): if 'second' not in getattr(spider, 'pipelines', []): return item
# ... pipeline code here ...
Btw, this code can be easily made more performant by using sets instead of lines for the pipelines attribute, and by caching the pipelines per spider.
On Thu, Nov 25, 2010 at 07:14:12AM -0800, vitsin wrote: > hi, > are you planning may be to add support for custom pipeline per spider? > 10x, > --vs
> On Nov 25, 9:28 am, Pablo Hoffman <pablohoff...@gmail.com> wrote: > > Hi vitsin,
> > You can't override settings like this in your spiders like your code does:
> > class FirstSpider(CrawlSpider): > > settings.overrides['ITEM_PIPELINES'] = ...
> > And you can't customize the item pipelines per spider.
> > What you could do is check the spider in the process_item() of your pipeline, > > and ignore certain ones. For example:
> > def process_item(self, item, spider): > > if spider.name not in ['myspider1', 'myspider2', 'myspider3']: > > return item
> > Hope this helps, > > Pablo.
> > On Wed, Nov 24, 2010 at 08:19:43PM -0800, vitsin wrote: > > > hi, > > > how supposed to be defined two pipelines if each one of them doing > > > completely different SQL queries? > > > Two classes in the same pipelines.py file? > > > I've tried:
> > > # Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting > > > DB tables. > > > from .tables.scanfirst import ScanFirst > > > from .tables.scansecond import ScanSecond > > > from test.items import FirstItem > > > from test.items import SecondItem
> > > When later on, spider with SecondPipeline activated from: > > > class FirstSpider(CrawlSpider): > > > settings.overrides['ITEM_PIPELINES'] = > > > ['test.pipelines.SecondPipeline'] > > > ... (rest of spider code)
> > > than I see that code for FirstPipeline activated, why?
> > > 10x, > > > --vs
> > > -- > > > You received this message because you are subscribed to the Google Groups "scrapy-users" group. > > > To post to this group, send email to scrapy-users@googlegroups.com. > > > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com. > > > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "scrapy-users" group. > To post to this group, send email to scrapy-users@googlegroups.com. > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
> You can't override settings like this in your spiders like your code does:
> class FirstSpider(CrawlSpider):
> settings.overrides['ITEM_PIPELINES'] = ...
> And you can't customize the item pipelines per spider.
> What you could do is check the spider in the process_item() of your pipeline,
> and ignore certain ones. For example:
> def process_item(self, item, spider):
> if spider.name not in ['myspider1', 'myspider2', 'myspider3']:
> return item
> Hope this helps,
> Pablo.
> On Wed, Nov 24, 2010 at 08:19:43PM -0800, vitsin wrote:
> > hi,
> > how supposed to be defined two pipelines if each one of them doing
> > completely different SQL queries?
> > Two classes in the same pipelines.py file?
> > I've tried:
> > # Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting
> > DB tables.
> > from .tables.scanfirst import ScanFirst
> > from .tables.scansecond import ScanSecond
> > from test.items import FirstItem
> > from test.items import SecondItem
> > When later on, spider with SecondPipeline activated from:
> > class FirstSpider(CrawlSpider):
> > settings.overrides['ITEM_PIPELINES'] =
> > ['test.pipelines.SecondPipeline']
> > ... (rest of spider code)
> > than I see that code for FirstPipeline activated, why?
> > 10x,
> > --vs
> > --
> > You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> > To post to this group, send email to scrapy-users@googlegroups.com.
> > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
> You can't override settings like this in your spiders like your code does:
> class FirstSpider(CrawlSpider):
> settings.overrides['ITEM_PIPELINES'] = ...
> And you can't customize the item pipelines per spider.
> What you could do is check the spider in the process_item() of your pipeline,
> and ignore certain ones. For example:
> def process_item(self, item, spider):
> if spider.name not in ['myspider1', 'myspider2', 'myspider3']:
> return item
> Hope this helps,
> Pablo.
> On Wed, Nov 24, 2010 at 08:19:43PM -0800, vitsin wrote:
> > hi,
> > how supposed to be defined two pipelines if each one of them doing
> > completely different SQL queries?
> > Two classes in the same pipelines.py file?
> > I've tried:
> > # Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting
> > DB tables.
> > from .tables.scanfirst import ScanFirst
> > from .tables.scansecond import ScanSecond
> > from test.items import FirstItem
> > from test.items import SecondItem
> > When later on, spider with SecondPipeline activated from:
> > class FirstSpider(CrawlSpider):
> > settings.overrides['ITEM_PIPELINES'] =
> > ['test.pipelines.SecondPipeline']
> > ... (rest of spider code)
> > than I see that code for FirstPipeline activated, why?
> > 10x,
> > --vs
> > --
> > You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> > To post to this group, send email to scrapy-users@googlegroups.com.
> > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
So the same excluding check, as you did in process_item(), should be
done also for open_spider() and close_spider() or its enough to have
it in process_item()?
On Nov 25, 10:32 am, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> But there are some nice alternatives for achieving that functionality. For
> example, you can choose a spider attribute to define which pipelines will be
> enabled for each spider, and then check that attribute in your pipelines.
> Here's how your spiders would look:
> class SomeSpider(CrawlSpider):
> pipelines = ['first']
> class AnotherSpider(CrawlSpider):
> pipelines = ['first', 'second']
> And your pipelines:
> class FirstPipeline(object):
> def process_item(self, item, spider):
> if 'first' not in getattr(spider, 'pipelines', []):
> return item
> # ... pipeline code here ...
> class SecondPipeline(object):
> def process_item(self, item, spider):
> if 'second' not in getattr(spider, 'pipelines', []):
> return item
> # ... pipeline code here ...
> Btw, this code can be easily made more performant by using sets instead of
> lines for the pipelines attribute, and by caching the pipelines per spider.
> Pablo.
> On Thu, Nov 25, 2010 at 07:14:12AM -0800, vitsin wrote:
> > hi,
> > are you planning may be to add support for custom pipeline per spider?
> > 10x,
> > --vs
> > On Nov 25, 9:28 am, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> > > Hi vitsin,
> > > You can't override settings like this in your spiders like your code does:
> > > And you can't customize the item pipelines per spider.
> > > What you could do is check the spider in the process_item() of your pipeline,
> > > and ignore certain ones. For example:
> > > def process_item(self, item, spider):
> > > if spider.name not in ['myspider1', 'myspider2', 'myspider3']:
> > > return item
> > > Hope this helps,
> > > Pablo.
> > > On Wed, Nov 24, 2010 at 08:19:43PM -0800, vitsin wrote:
> > > > hi,
> > > > how supposed to be defined two pipelines if each one of them doing
> > > > completely different SQL queries?
> > > > Two classes in the same pipelines.py file?
> > > > I've tried:
> > > > # Both, ScanFirst and ScanSecond, are SQLAlchemy mappings to exsiting
> > > > DB tables.
> > > > from .tables.scanfirst import ScanFirst
> > > > from .tables.scansecond import ScanSecond
> > > > from test.items import FirstItem
> > > > from test.items import SecondItem
> > > > When later on, spider with SecondPipeline activated from:
> > > > class FirstSpider(CrawlSpider):
> > > > settings.overrides['ITEM_PIPELINES'] =
> > > > ['test.pipelines.SecondPipeline']
> > > > ... (rest of spider code)
> > > > than I see that code for FirstPipeline activated, why?
> > > > 10x,
> > > > --vs
> > > > --
> > > > You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> > > > To post to this group, send email to scrapy-users@googlegroups.com.
> > > > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
> > > > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
> > --
> > You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> > To post to this group, send email to scrapy-users@googlegroups.com.
> > To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/scrapy-users?hl=en.