Help - scraper pagination skips pages then only extracts 1 entry from each page ?

189 views
Skip to first unread message

Nathan Shane

unread,
Aug 21, 2015, 1:42:17 PM8/21/15
to Web Scraper
Hi, here are the EXACT problems I'm having.  The scraper should take the selected info from each business listed on each page, however, it is only taking it from 1 business out of every 5 pages.  So the pagination is working but only for every 5 pages ???? and only scraping the info from 1 business for each of the 5 pages.

So there is roughly 420,000 General Contractors who's info I'm trying to scrape, but after scraping is done, there is only about 500 records.

attached is a screenshot of my graph as well.



{"selectors":[{"parentSelectors":["busname"],"type":"SelectorText","multiple":false,"id":"owner","selector":"div.info-list-label:nth-of-type(2) div.info-list-text","regex":"","delay":""},{"parentSelectors":["busname"],"type":"SelectorLink","multiple":false,"id":"website","selector":"a.proWebsiteLink","delay":""},{"parentSelectors":["_root","pagination"],"type":"SelectorLink","multiple":true,"id":"pagination","selector":"ul.pagination a","delay":""},{"parentSelectors":["_root","pagination"],"type":"SelectorLink","multiple":true,"id":"busname","selector":"a#_h_url_paid_pro1.pro-title","delay":""},{"parentSelectors":["busname"],"type":"SelectorText","multiple":false,"id":"phone","selector":"div.pro-contact-methods > span.pro-contact-text","regex":"","delay":""},{"parentSelectors":["busname"],"type":"SelectorText","multiple":false,"id":"address","selector":"div.info-list-label:nth-of-type(3)","regex":"","delay":""}],"startUrl":"http://www.houzz.com/professionals/general-contractor","_id":"pagenationskipping"}

scraper diagram.jpg

Mārtiņš Balodis

unread,
Aug 24, 2015, 2:44:04 PM8/24/15
to Nathan Shane, Web Scraper
Hi,
The "busname" selector was only selecting the first link of each page. That's why it was extracting only few records. If you are going to scrape 420, 000 pages I would suggest to use our enterprise service which can do the scraping part for you. http://webscraper.io/service

{"selectors":[{"parentSelectors":["busname"],"type":"SelectorText","multiple":false,"id":"owner","selector":"div.info-list-label:nth-of-type(2) div.info-list-text","regex":"","delay":""},{"parentSelectors":["busname"],"type":"SelectorLink","multiple":false,"id":"website","selector":"a.proWebsiteLink","delay":""},{"parentSelectors":["_root","pagination"],"type":"SelectorLink","multiple":true,"id":"pagination","selector":"ul.pagination a","delay":""},{"parentSelectors":["_root","pagination"],"type":"SelectorLink","multiple":true,"id":"busname","selector":"a.pro-title","delay":""},{"parentSelectors":["busname"],"type":"SelectorText","multiple":false,"id":"phone","selector":"div.pro-contact-methods > span.pro-contact-text","regex":"","delay":""},{"parentSelectors":["busname"],"type":"SelectorText","multiple":false,"id":"address","selector":"div.info-list-label:nth-of-type(3)","regex":"","delay":""}],"startUrl":"http://www.houzz.com/professionals/general-contractor","_id":"pagenationskipping"}

--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages