Scraper seems to be skipping pages of results

1,114 views
Skip to first unread message

Scott Fallick

unread,
Sep 25, 2014, 12:53:35 PM9/25/14
to web-s...@googlegroups.com
Hi, every time I scrape this site, I get about half the results as expected, and notice it is skipping half the start urls.  Please take a look and tell me what you think.  Also I think I have extra "lost" selectors in the sitemap, but I don't think that's affecting the data.

{"selectors":[{"parentSelectors":["element"],"type":"SelectorLink","multiple":false,"id":"group","selector":"a.biz-name","delay":""},{"parentSelectors":["group"],"type":"SelectorText","multiple":false,"id":"name","selector":"h1.biz-page-title","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorHTML","multiple":false,"id":"stars","selector":"div.rating-info div.rating-very-large","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorText","multiple":false,"id":"phone","selector":"span.biz-phone","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorText","multiple":false,"id":"website","selector":"div.biz-website a","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_1","selector":"li:nth-of-type(1) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_2","selector":"li:nth-of-type(2) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_3","selector":"li:nth-of-type(3) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_4","selector":"li:nth-of-type(4) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_5","selector":"li:nth-of-type(5) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_6","selector":"li:nth-of-type(6) li.user-name a.user-display-name","delay":""},{"parentSelectors":["_root"],"type":"SelectorText","multiple":false,"id":"city","selector":"span.query-location","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"num","selector":"_parent_","regex":"","delay":""},{"parentSelectors":["pagination"],"type":"SelectorLink","multiple":false,"id":"pagination","selector":"a.page-option.prev-next","delay":""},{"parentSelectors":["_root"],"type":"SelectorElement","multiple":true,"id":"element","selector":"span.indexed-biz-name","delay":""}],"startUrl":"http://www.yelp.com/search?find_desc=food+truck&find_loc=Austin%2C+TX&ns=1#start=[0-27]0&cflt=foodtrucks","_id":"food_truck"}

Mārtiņš Balodis

unread,
Sep 29, 2014, 1:52:41 PM9/29/14
to Scott Fallick, web-s...@googlegroups.com
Hi,
This site loads data dynamically after the page is loaded. By default web scraper extracts data after 0.5 seconds when the page is loaded. If the page loads data after that the web scraper might miss it. You can configure "Page load delay" before scraping the site. I scraped the site with 500 ms delay and some data got lost. Scraping it with 2000 ms delay no data was lost. Note that on slow connections a higher delay might be needed. I wanted to make the web scraper to wait for all dynamic requests to finish before extracting data but as I recall that that functionality wasn't available in chrome API.

On Thu, Sep 25, 2014 at 7:53 PM, Scott Fallick <scott....@gmail.com> wrote:
Hi, every time I scrape this site, I get about half the results as expected, and notice it is skipping half the start urls.  Please take a look and tell me what you think.  Also I think I have extra "lost" selectors in the sitemap, but I don't think that's affecting the data.

{"selectors":[{"parentSelectors":["element"],"type":"SelectorLink","multiple":false,"id":"group","selector":"a.biz-name","delay":""},{"parentSelectors":["group"],"type":"SelectorText","multiple":false,"id":"name","selector":"h1.biz-page-title","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorHTML","multiple":false,"id":"stars","selector":"div.rating-info div.rating-very-large","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorText","multiple":false,"id":"phone","selector":"span.biz-phone","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorText","multiple":false,"id":"website","selector":"div.biz-website a","regex":"","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_1","selector":"li:nth-of-type(1) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_2","selector":"li:nth-of-type(2) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_3","selector":"li:nth-of-type(3) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_4","selector":"li:nth-of-type(4) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_5","selector":"li:nth-of-type(5) li.user-name a.user-display-name","delay":""},{"parentSelectors":["group"],"type":"SelectorLink","multiple":false,"id":"elite_6","selector":"li:nth-of-type(6) li.user-name a.user-display-name","delay":""},{"parentSelectors":["_root"],"type":"SelectorText","multiple":false,"id":"city","selector":"span.query-location","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"num","selector":"_parent_","regex":"","delay":""},{"parentSelectors":["pagination"],"type":"SelectorLink","multiple":false,"id":"pagination","selector":"a.page-option.prev-next","delay":""},{"parentSelectors":["_root"],"type":"SelectorElement","multiple":true,"id":"element","selector":"span.indexed-biz-name","delay":""}],"startUrl":"http://www.yelp.com/search?find_desc=food+truck&find_loc=Austin%2C+TX&ns=1#start=[0-27]0&cflt=foodtrucks","_id":"food_truck"}

--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages