Resuming after a timeout

43 views
Skip to first unread message

Darrell Smith

unread,
Apr 30, 2013, 10:33:16 AM4/30/13
to scrap...@googlegroups.com
Hi
I've got my scraper working fine and it resumes from the last record after a timeout etc.

The problem is how to make it resume automatically as at the moment I have to watch it run, wait for it to timeout and then restart it.
Is there a function that can either

restart the scrape after it timeouts
or
Run a certain number of CPU seconds and then restart itself ?


Cheers
Darrell


Aidan Hobson Sayers

unread,
Apr 30, 2013, 12:12:34 PM4/30/13
to scrap...@googlegroups.com
Hi

I'm a little unclear about what you want from this - by 'timeout' are you talking about the scraper run-time limit (which you seem to be)?

For obvious reasons I doubt the ScraperWiki team are eager to support/suggest ways of bypassing this. It's an interesting problem and I can think of one way to do it, but can also easily see possible abuse.

Why do you need this? Is there a big backlog of data you're trying to get to begin with?

Aidan



Darrell


--
You received this message because you are subscribed to the Google Groups "ScraperWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scraperwiki...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Darrell Smith

unread,
Apr 30, 2013, 12:33:28 PM4/30/13
to scrap...@googlegroups.com
Hi
Yeah its a big ask as the timeouts are obviously there to protect the system.
My situation is that I want to scrape a large site with several hundred pages of products. Each page has a potential index of A0 through to Z300 (so 7800 potential pages, although the actual page count is around 550-600) but I dont have a record of the actual indexes so I'm systematically going through each page index possibility (A1, A2, A3 .... z297,z298,z299,z300).
I'm skipping any that don't have relevant data in them and scraping the ones that do.

I'm finding that I'm coming across several timeouts to process the entire site.

I know this is an awkward way to do this but without any idea of the page indexes I need I have to check all possibilities.

I've found that the timeouts occur when I'm hitting a lot of pages that have relevant data so its not the sheer number of requests that's timing out the scrape but the CPU cycles needed to processes the page
and strip the data.

I'm currently using save_var and get_var to 'save my place' between timeouts. So one possibility is to run an hourly scrape and allow each scrape to pickup from where the last one stopped.

Any thoughts would be great

Darrell

Aidan Hobson Sayers

unread,
Apr 30, 2013, 12:50:22 PM4/30/13
to scrap...@googlegroups.com
Do you have a link you can share to your scraper?

Aidan

Aidan Hobson Sayers

unread,
Apr 30, 2013, 1:28:26 PM4/30/13
to scrap...@googlegroups.com
Ah, found it - https://scraperwiki.com/scrapers/php_test_2/edit/

Looking at it I would imagine you can cut down on your CPU time by rearranging

        $url "http://www.helpfulholidays.com/property.asp?ref=".$letter.$index."&year=2013";
        $dom new simple_html_dom();
        $html file_get_html($url);
        $dom->load($html);

        if(fourOhfour($dom)){
            $index++;
            continue;
        }

Into

        $url "http://www.helpfulholidays.com/property.asp?ref=".$letter.$index."&year=2013";
        $html scraperWiki::scrape($url);
        if(fourOhfour($html)){
            $index++;
            continue;
        }
        $dom str_get_html($html);

With a new definition of 'fourOhfour' like

function fourOhFour($html){
    if(strpos($html,"<h1>No Active 2013 Property details found for A29</h1>",10000)!==false){
        return true;
    }
    return false;
}

To summarise:
  • You were parsing the html into DOM with both file_get_html, and then loading it into a different object with $dom->load. Now you get the html string and load that instead.
  • Don't bother initialising a DOM object and parsing the entire DOM and searching the entire DOM and comparing the string, just search for the string in the source.
  • Search from character 10000 because above there seems to be all boilerplate, but is a (very) safe distance before the value we're looking for. You could probably bump it up to 20000.
  • Make the string search case sensitive - it's going to be generated by the backend and will therefore almost certainly be the same thing every time.
  • I prefer the style of explicitly returning booleans :)
This should be faster (I can't imagine it being slower!). Whether it's usefully faster, I don't know. I'd be interested to know, having never used PHP before.


Aidan

Aidan Hobson Sayers

unread,
Apr 30, 2013, 1:52:46 PM4/30/13
to scrap...@googlegroups.com
Oops, the fourOhFour search string should be "<h1>No Active 2013 Property details found for "

Aidan

Páll Hilmarsson

unread,
Apr 30, 2013, 2:02:45 PM4/30/13
to scrap...@googlegroups.com

This may be totally useless (I haven't looked at scraper nor the source) but could you get away with making a HEAD request? So if that returns anything other than 200 then skip...

P

pal...@gogn.in | http://gogn.in | http://twitter.com/pallih | https://github.com/pallih

PGP: C266 603E 9918 A38B F11D 9F9B E721 347C 45B1 04E9

Darrell Smith

unread,
Apr 30, 2013, 2:29:29 PM4/30/13
to scrap...@googlegroups.com
Hi Aiden
Thanks for the tips, Some good points there. I admit its a bit quick and dirty. Its part trying out scraperWiki and part coding the scraper and very much a work in progress :)

Originally I was working in beta scraperWiki and did try using scraperWiki::scrape but I kept getting errors so this was just a  copy and paste job into the main scraperWiki. Hense the scrappy use of simple_html_dom. I'd of tidied up if I know you were popping round ;)



Páll


thats a fair point but sadly its not that easy as the page simply has "No Active 2013 Property details found" in big letters. So its a not a true 404, just a page that contains no data I need.
Reply all
Reply to author
Forward
0 new messages