Time or Data Limitation for Scraping?

2,025 views
Skip to first unread message

Mark Stiner

unread,
Sep 16, 2014, 12:43:24 AM9/16/14
to web-s...@googlegroups.com
Can you help me understand if there is a limitation on either the length of a scrape overall or the amount of data that will be scraped?

I have approximately 85 pages that I am going through and pagination seems to be working fine, but it stops after about 5 pages of scraping.  Do you happen to know why?

I would paste the export, but it is for a site that requires login and the scraping needed is specific to my account...

Thanks!
Mark

Mārtiņš Balodis

unread,
Sep 16, 2014, 4:53:31 AM9/16/14
to Mark Stiner, web-s...@googlegroups.com
There is no time, page or data size limit. The only limit is that there is a chance that chrome might crash but this happens when you scrape thousands of pages. Probably the problem here is that there were only 5 pagination links visible from the initial page and you probably didn't make the selectors in a way that new pagination links need to be discovered in pagination pages. If this is the case then you should make the pagination link selector a parent to itself. Here is also a video tutorial on pagination. At the end of the video you can see how to make a pagination selector discover new pagination links.


--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Stiner

unread,
Sep 16, 2014, 6:06:20 AM9/16/14
to web-s...@googlegroups.com, mst...@gmail.com
Thanks for the response and this is great information to know.  For the pagination with this site, the nice thing is that there is a next button.  under the root after I have traversed the page, I put another selector of type element click (I could not use link in this case) and it is the same css id every time to pick up on.  I'll keep working with it and will see if I can figure it out.  Thanks!

Mark Stiner

unread,
Sep 16, 2014, 6:23:40 AM9/16/14
to web-s...@googlegroups.com, mst...@gmail.com
I wanted to add another thing.  I also have the option of using your square brackets in the URL solution for pagination.  I have tried to add [1 - x] in the URL where x is the the number of pages, but it does not seem to work.  The trick, in this specific URL, is that the /pages/[1-x]/ is right in the middle of the URL and I can't put it on the end as you have in your documentation.

So, my question is whether this tool is currently coded to only handle this parameter at the end of the URL?  Or, should it work if used in the middle as well?

Also, even though I noted I had to be logged in on this site, it seems that you don't have to be to view the page I'm on.  You can try this URL and sitemap to see if you can replicate my issue with pagination.


{"startUrl":"https://www.fanduel.com/contest/5391808/scoring/page/1/lineup/41447572/","selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"click_teams","selector":"div.slot.even div.roster","clickElementSelector":"tr:nth-of-type(1) a.truncate, tr:nth-of-type(2) a.truncate, tr:nth-of-type(3) a.truncate, tr:nth-of-type(4) a.truncate, tr:nth-of-type(5) a.truncate, tr:nth-of-type(6) a.truncate, tr:nth-of-type(7) a.truncate, tr:nth-of-type(8) a.truncate, tr:nth-of-type(9) a.truncate, tr:nth-of-type(10) a.truncate","clickType":"clickOnce","discardInitialElements":false,"delay":"600"},{"parentSelectors":["click_teams"],"type":"SelectorHTML","multiple":true,"id":"player_html","selector":"div.roster-row","regex":"","delay":""},{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":false,"id":"next","selector":"button.button.next","clickElementSelector":"button.button.next","clickType":"clickMore","discardInitialElements":false,"delay":"200"}],"_id":"fanduel3"}

Mārtiņš Balodis

unread,
Sep 16, 2014, 2:18:16 PM9/16/14
to Mark Stiner, web-s...@googlegroups.com
I added the range url and it worked for me. It shouldn't matter where it is placed the only limitation is that you can use only one range in an url. Could you confirm that this sitemap also isn't going through pages 1 to 5?

{"selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"click_teams","selector":"div.slot.even div.roster","clickElementSelector":"tr:nth-of-type(1) a.truncate, tr:nth-of-type(2) a.truncate, tr:nth-of-type(3) a.truncate, tr:nth-of-type(4) a.truncate, tr:nth-of-type(5) a.truncate, tr:nth-of-type(6) a.truncate, tr:nth-of-type(7) a.truncate, tr:nth-of-type(8) a.truncate, tr:nth-of-type(9) a.truncate, tr:nth-of-type(10) a.truncate","clickType":"clickOnce","discardInitialElements":false,"delay":"600"},{"parentSelectors":["click_teams"],"type":"SelectorHTML","multiple":true,"id":"player_html","selector":"div.roster-row","regex":"","delay":""}],"startUrl":"https://www.fanduel.com/contest/5391808/scoring/page/[1-5]/lineup/41447572/","_id":"fanduel3-multiple-start-urls"}

Mark Stiner

unread,
Sep 17, 2014, 3:19:07 PM9/17/14
to web-s...@googlegroups.com, mst...@gmail.com
Thank you for the response.  1-5 does work, that's great!   I tried 1-85, though and when it tries to go to page 85, for some odd reason it is going to page 54 .. and then for page 84, it goes to 54, 83 goes to 54 and so-on.  Once it gets to 53, then it starts paginating correctly and goes to 53, 52, 51 and down to 1.

Can you also try a number above 54 and see if you get the same?  Thanks again for all of your great support.

Mark Stiner

unread,
Sep 17, 2014, 3:27:56 PM9/17/14
to web-s...@googlegroups.com, mst...@gmail.com
I figured out why it keeps going to 54 .. it's not that the pagination isn't working, it's just the nature of how this site works..  my team was in placement on the 54th page.  When you are above page 54, my team name is on the top in the first spot on the team listing.  The team_click selector is selecting that first and the website goes to the page on which my team is listed based on my position.

I've worked around this by basically ignoring the 1st and 11th items in the list so that it never clicks on my team name and jumps pages.  This will work just fine for me.


On Tuesday, September 16, 2014 2:18:16 PM UTC-4, Mārtiņš Balodis wrote:

Eric Crocker

unread,
Feb 3, 2015, 2:16:43 PM2/3/15
to web-s...@googlegroups.com, mst...@gmail.com
Mark,

How did you get the extraction combined with pagination to work without the brackets in the URL?  I took your sitemap and ran it using a more recent contest (https://www.fanduel.com/contest/10251842/scoring/), however, the data extracted is only from the middle 9 entries on page 3 (4 pages total in this example)...  I'm aware that if you put the page numbers in brackets in the URL this works, but I'm trying to avoid having to visit the URL first in order to identify the number of pages. 

Martins - Alternatively to using the "Next" button repeatedly, is there a way to dynamically find the total number of pages and input that into the URL?

Appreciate the help,
Eric

Mark's original sitemap w/ updated URL:
{"selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"click_teams","selector":"div.slot.even div.roster","clickElementSelector":"tr:nth-of-type(1) a.truncate, tr:nth-of-type(2) a.truncate, tr:nth-of-type(3) a.truncate, tr:nth-of-type(4) a.truncate, tr:nth-of-type(5) a.truncate, tr:nth-of-type(6) a.truncate, tr:nth-of-type(7) a.truncate, tr:nth-of-type(8) a.truncate, tr:nth-of-type(9) a.truncate, tr:nth-of-type(10) a.truncate","clickType":"clickOnce","discardInitialElements":false,"delay":"600"},{"parentSelectors":["click_teams"],"type":"SelectorHTML","multiple":true,"id":"player_html","selector":"div.roster-row","regex":"","delay":""},{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":false,"id":"next","selector":"button.button.next","clickElementSelector":"button.button.next","clickType":"clickMore","discardInitialElements":false,"delay":"200"}],"startUrl":"https://www.fanduel.com/contest/10251842/scoring/","_id":"import_test"}

Mark Stiner

unread,
Feb 5, 2015, 2:02:04 PM2/5/15
to Eric Crocker, web-s...@googlegroups.com
Hello.  I ended up using the square brackets per the documentation and it did work.  Two things: Because of the tricky pagination I had to only extract the 9 entries.  Basically if in the pagination it hit my own entry (which would show up in the first row of the table, it threw it off.  To circumvent that I went after 9 and figured it would give me a good enough sampling of data for my needs.

Also, I had to go to the site ahead of time and modify the sitemap each time for the number of pages.  I did not go far enough to determine an alternate route!  Good luck!
Reply all
Reply to author
Forward
0 new messages