Only able to see a few items

28 views
Skip to first unread message

ignorant

unread,
Oct 27, 2016, 12:35:38 AM10/27/16
to scrapy-users
Hi there,

I am a noob and trying to test this on different product grids. I am not able to get more than a few (6 to 8) items per page.

For example, 

import scrapy


class NordstromSpider(scrapy.Spider):
    name
= "nordstrom"
    start_urls
= [
       
'http://shop.nordstrom.com/c/womens-dresses-new?origin=leftnav&cm_sp=Top%20Navigation-_-New%20Arrivals'
   
]


   
def parse(self, response):
       
for dress in response.css('article.npr-product-module'):
           
yield {
               
'src': dress.css('img.product-photo').xpath('@src').extract_first(),
               
'url': dress.css('a.product-photo-href').xpath('@href').extract_first()
           
}


   
def noparse(self, response):
        page
= response.url.split("/")[-2]
        filename
= 'nordstrom-%s.html' % page
       
with open(filename, 'wb') as f:
            f
.write(response.body)
       
self.log('Saved file %s' % filename)



This gave only 6 items. So I tried another site -

import scrapy


class QuotesSpider(scrapy.Spider):
    name
= "rtr"
    start_urls
= [
       
'https://www.renttherunway.com/products/dress'
   
]


   
def parse(self, response):
       
for dress in response.css('div.cycle-image-0'):
           
yield {
               
'image-url': dress.xpath('.//img/@src').extract_first(),
           
}



This only gave 12 items even though the page has a lot more.
I am guessing that I'm missing a setting somewhere. Any pointers are appreciated.

Thanks,

ignorant

unread,
Oct 31, 2016, 8:28:44 PM10/31/16
to scrapy-users
Hi,

It seems every site has this issue. How did you get around it?

Thanks,

ignorant

unread,
Oct 31, 2016, 10:53:23 PM10/31/16
to scrapy-users
Got answer on stack overflow -


I would recommend newcomers posting there instead of this group.


On Thursday, October 27, 2016 at 12:35:38 AM UTC-4, ignorant wrote:

Erik Dominguez

unread,
Nov 4, 2016, 9:48:25 PM11/4/16
to scrapy-users
It is a React site so the DOM will be changed dynamically. The reason you get 6 is because if you check the source there are only 6 articles with that class. Scrapy only sees the raw html response, what you were seeing is the DOM that was generated by javascript. As a rule of thumb, always double check to make sure you can find the same stuff from chrome dev tools to the raw page source.

I do this all of the time to make sure I don't have to do a json.loads() on what most likely be a raw string inside the html source that will contain the data. 

From my experience, tons of these sites are moving into using React so I started to look inside for the data inside the <script> tags.
Reply all
Reply to author
Forward
0 new messages