Only able to see a few items

ignorant

unread,

Oct 27, 2016, 12:35:38 AM10/27/16

to scrapy-users

Hi there,

I am a noob and trying to test this on different product grids. I am not able to get more than a few (6 to 8) items per page.

For example,

import scrapy


class NordstromSpider(scrapy.Spider):
    name = "nordstrom"
    start_urls = [
        'http://shop.nordstrom.com/c/womens-dresses-new?origin=leftnav&cm_sp=Top%20Navigation-_-New%20Arrivals'
    ]


    def parse(self, response):
        for dress in response.css('article.npr-product-module'):
            yield {
                'src': dress.css('img.product-photo').xpath('@src').extract_first(),
                'url': dress.css('a.product-photo-href').xpath('@href').extract_first()
            }


    def noparse(self, response):
        page = response.url.split("/")[-2]
        filename = 'nordstrom-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

This gave only 6 items. So I tried another site -

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "rtr"
    start_urls = [
        'https://www.renttherunway.com/products/dress'
    ]


    def parse(self, response):
        for dress in response.css('div.cycle-image-0'):
            yield {
                'image-url': dress.xpath('.//img/@src').extract_first(),
            }

This only gave 12 items even though the page has a lot more.

I am guessing that I'm missing a setting somewhere. Any pointers are appreciated.

Thanks,

ignorant

unread,

Oct 31, 2016, 8:28:44 PM10/31/16

to scrapy-users

Hi,

It seems every site has this issue. How did you get around it?

Thanks,

ignorant

unread,

Oct 31, 2016, 10:53:23 PM10/31/16

to scrapy-users

Got answer on stack overflow -

http://stackoverflow.com/questions/40353052/cant-crawl-more-than-a-few-items-per-page/40353344#40353344

I would recommend newcomers posting there instead of this group.

On Thursday, October 27, 2016 at 12:35:38 AM UTC-4, ignorant wrote:

Erik Dominguez

unread,

Nov 4, 2016, 9:48:25 PM11/4/16

to scrapy-users

It is a React site so the DOM will be changed dynamically. The reason you get 6 is because if you check the source there are only 6 articles with that class. Scrapy only sees the raw html response, what you were seeing is the DOM that was generated by javascript. As a rule of thumb, always double check to make sure you can find the same stuff from chrome dev tools to the raw page source.

I do this all of the time to make sure I don't have to do a json.loads() on what most likely be a raw string inside the html source that will contain the data.

From my experience, tons of these sites are moving into using React so I started to look inside for the data inside the <script> tags.

Reply all

Reply to author

Forward