Extracting data from a set of tables, next, extract, next extract

916 views
Skip to first unread message

Mark S

unread,
Oct 24, 2014, 5:24:53 PM10/24/14
to web-s...@googlegroups.com
I would like to scrape the the data from all 3545 tables from the following url :-


I can't understand how to :-

- scrape table
- click 'Next page'
- repeat (until no more pages)

Am I missing a loop construct or perhaps a way to create a loop?

Any guidance would be very much appreciated.

-- 
Mark

Mark S

unread,
Oct 24, 2014, 5:43:52 PM10/24/14
to web-s...@googlegroups.com
I was using Link, but see now that this is for scraping links rather than following them.

ElementClick seems better for my needs and I can get two tables, so now it is the 'loop' that eludes me, I would like to get the table from each of the pages available.

My sitemap is currently :-

{"_id":"test3","startUrl":"http://spectruminfo.ofcom.org.uk/spectrumInfo/licences?service=all&submit=Submit+search&googoffset=458.1&ne=(58.950008233357046%2c+3.1640625)&nw=(58.950008233357046%2c+-12.83203125)&unit=GHz&sw=(49.66762782262192%2c+-12.83203125)&freqStop=9999999&googloc=(54.57206165565852%2c+-4.833984375)&se=(49.66762782262192%2c+3.1640625)&freqStart=0&page=1","selectors":[{"parentSelectors":["_root","next"],"type":"SelectorTable","multiple":true,"id":"gettable","selector":"table.licencesTable","tableHeaderRowSelector":"thead tr","tableDataRowSelector":"tbody tr","columns":[{"header":"Licence number","name":"Licence number","extract":true},{"header":"Sector","name":"Sector","extract":true},{"header":"Class","name":"Class","extract":true},{"header":"Licensee","name":"Licensee","extract":true},{"header":"Frequencies","name":"Frequencies","extract":true},{"header":"Location(s)","name":"Location(s)","extract":true}],"delay":""},{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"next","selector":"table.licencesTable","clickElementSelector":"a:nth-of-type(10)","clickElementUniquenessType":"uniqueHTMLText","clickType":"clickMore","discardInitialElements":false,"delay":""}]}

-- 
Mark
 

Mārtiņš Balodis

unread,
Oct 27, 2014, 9:55:32 AM10/27/14
to Mark S, web-s...@googlegroups.com
Hi,
The site didn't load for me so I couldn't test you sitemap. The link selector works for link url extraction and also for link following. If after clicking a pagination link the url changes then link selector should be the solution. Element click selector should be used if the paginated data is loaded dynamically. Here are two video tutorials on handling pagination with link selector and with element click selector:
http://webscraper.io/tutorials

--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark S

unread,
Oct 27, 2014, 3:54:19 PM10/27/14
to web-s...@googlegroups.com, markl...@gmail.com
Thanks for trying to use my sitemap Mārtiņš.

I agree that I am now using the correct selectors, but the part that stumps me is how to continue 'Element Click' through *all* of the 'Next Pages' gather the table for each (there are more than 2000 'Next page"').  Is the an implied 'loop' or iteration back to root each time - if so how to drive a repeat ('Element Click' on Next page, select table, 'Element Click' on Next page, select table....)

-- 
Mark

Mārtiņš Balodis

unread,
Oct 29, 2014, 4:33:44 PM10/29/14
to Mark S, web-s...@googlegroups.com
Hi,
You used element click selector but in this case link selector was needed. Link selector can navigate to all pagination pages and extract data from them. Here is a sitemap for this site. Look at the selector graph and you should understand how it works.

{"selectors":[{"parentSelectors":["_root","pagination"],"type":"SelectorTable","multiple":true,"id":"gettable","selector":"table.licencesTable","tableHeaderRowSelector":"thead tr","tableDataRowSelector":"tbody tr","columns":[{"header":"Licence number","name":"Licence number","extract":true},{"header":"Sector","name":"Sector","extract":true},{"header":"Class","name":"Class","extract":true},{"header":"Licensee","name":"Licensee","extract":true},{"header":"Frequencies","name":"Frequencies","extract":true},{"header":"Location(s)","name":"Location(s)","extract":true}],"delay":""},{"parentSelectors":["_root","pagination"],"type":"SelectorLink","multiple":true,"id":"pagination","selector":"a:nth-of-type(n+3)","delay":""}],"_id":"test3","startUrl":"http://spectruminfo.ofcom.org.uk/spectrumInfo/licences?service=all&submit=Submit+search&googoffset=458.1&ne=(58.950008233357046%2c+3.1640625)&nw=(58.950008233357046%2c+-12.83203125)&unit=GHz&sw=(49.66762782262192%2c+-12.83203125)&freqStop=9999999&googloc=(54.57206165565852%2c+-4.833984375)&se=(49.66762782262192%2c+3.1640625)&freqStart=0&page=1"}

Mark S

unread,
Nov 19, 2014, 4:57:35 PM11/19/14
to web-s...@googlegroups.com, markl...@gmail.com
By clicking through the website, it finally clicked that the reason for my problem was that the mechanism of link selection was wrong.

On the first page of results, the 'Next Page' link was considered the "a:nth-of-type(10)" however on the second page the  "First Page" and "Previous Page" links became active making "Next Page" to be "a:nth-of-type(10)"; which meant webscraper was clicking the correct link, just not the one I wanted.

I resolved my issue by starting on the last page and working back through "Previous Page" which was consistently "p:nth-of-type(5) a:nth-of-type(2)" right until the First page at least.

So what I was missing is how best to click a based on the the displayed -  "Next Page" - even when it's relative position in the array changes.

Currently I am making do (by traversing the site backwards), but if anyone can point me to how to click on a link based on the text of the link rather than the relative position?
My perfect arrangement would be to 

- gather data from the first page, click "Next Page" ("a:nth-of-type(10)")
- gather the page displayed  and then click "Next Page" ("a:nth-of-type(12)") right through until the last Page, where I just capture the table content and stop.

-- 
Mark

Mārtiņš Balodis

unread,
Nov 24, 2014, 12:39:30 PM11/24/14
to Mark S, web-s...@googlegroups.com
Hi,
You can write a custom css selector like this a:contains('next')
You can find more about css selectors here:
http://www.w3schools.com/cssref/css_selectors.asp


--

Brad

unread,
Jan 10, 2015, 10:17:23 AM1/10/15
to web-s...@googlegroups.com, markl...@gmail.com

Mārtiņš,

First let say what a excellent tool you've developed, thank you! My issue is similar to Marks. I tried the pagination described in the tutorial and received issues were it would scrape 1, 2 then jump to page 5 back to 4 and then loop infinitely. 
Next I tried Marks method of working backwards, That worked great from page 18 back to 3 then skipped 2 and went to 1 then looped between 3 and 1. Can you help with this? 
I have included the sitemap for Marks method of working backwards. 

{"_id":"jegs","startUrl":"http://www.jegs.com/webapp/wcs/stores/servlet/KeywordSearchCmd?No=1530&Nty=0&catalogId=10002&Ntk=all&Jnar=0&Ne=1%202%203%2013%201147708%201500000%201500000%201500000%201500000%201500000%201500000%201500000%201500000%201500000+1500000&itemPerPage=90&langId=-1&storeId=10001&N=0&Ntt=094+","selectors":[{"parentSelectors":["_root","pagination"],"type":"SelectorElement","multiple":true,"id":"items","selector":"table.searchresults > tbody > tr > td","delay":""},{"parentSelectors":["items"],"type":"SelectorText","multiple":false,"id":"price","selector":"div.price","regex":"","delay":""},{"parentSelectors":["items"],"type":"SelectorText","multiple":false,"id":"sku","selector":"table.searchitem > tbody > tr > td.partno a","regex":"","delay":""},{"parentSelectors":["items"],"type":"SelectorText","multiple":false,"id":"description","selector":"tr:nth-of-type(5) table","regex":"","delay":""},{"parentSelectors":["_root","pagination"],"type":"SelectorLink","multiple":true,"id":"pagination","selector":"tr:nth-of-type(6) a:nth-of-type(2)","delay":""}]} 

Mārtiņš Balodis

unread,
Jan 15, 2015, 4:02:43 AM1/15/15
to Brad, web-scraper, Mark S
Hi Brad,
If you wait till the end of the scraping process you will see that web scraper would eventually visits the pages skipped at the beginning. It is skipping links at the beginning because of underlying data structures and algorithms (lifo). In this case I would suggest that you select all pagination links instead of selecting only one link.
Reply all
Reply to author
Forward
0 new messages