Crawl updated to 500k URLs

16 views
Skip to first unread message

Patrick Meenan

unread,
Dec 1, 2014, 11:01:37 AM12/1/14
to httpa...@googlegroups.com
FYI, the crawl for 12/1 was expanded to crawl 500k URLs (up from 300k).  That mmeans results will probably be a couple of days later than normal.

Thanks,

-Pat

Charlie Clark

unread,
Dec 1, 2014, 2:41:35 PM12/1/14
to httpa...@googlegroups.com
Am .12.2014, 17:01 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> FYI, the crawl for 12/1 was expanded to crawl 500k URLs (up from 300k).
> That mmeans results will probably be a couple of days later than normal.

Thanks very much for the info, Patrick.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

Charlie Clark

unread,
Dec 8, 2014, 5:50:47 AM12/8/14
to httpa...@googlegroups.com
Am .12.2014, 17:01 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> FYI, the crawl for 12/1 was expanded to crawl 500k URLs (up from 300k).
> That mmeans results will probably be a couple of days later than normal.

Hiya Pat,

from the New Relic dashboard it looks like the number crunching has
finished and the mobile stats are available but not the desktop ones. Are
they still being processed? Or has there been a hold-up?

Patrick Meenan

unread,
Dec 8, 2014, 8:06:36 AM12/8/14
to httpa...@googlegroups.com
After the main crawl is done there are a few retry cycles for straggler URLs that failed the first time around and then the aggregation.  Looking at the test backlog it looks like it may still be doing small batches of the cleanup sweeps: http://httparchive.webpagetest.org/getLocations.php

At some point it would probably be worthwhile to resubmit the failures as they are detected instead of at the end (or some other logic to minimize the retries).

--
You received this message because you are subscribed to the Google Groups "HTTP Archive" group.
To unsubscribe from this group and stop receiving emails from it, send an email to httparchive+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charlie Clark

unread,
Dec 8, 2014, 11:06:01 AM12/8/14
to httpa...@googlegroups.com
Am .12.2014, 14:06 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> After the main crawl is done there are a few retry cycles for straggler
> URLs that failed the first time around and then the aggregation. Looking
> at the test backlog it looks like it may still be doing small batches of
> the cleanup sweeps: http://httparchive.webpagetest.org/getLocations.php

Looks empty now. Does it then run the stat generation?

> At some point it would probably be worthwhile to resubmit the failures as
> they are detected instead of at the end (or some other logic to minimize
> the retries).

How many is it? Likely to be exponential with the new intake of websites.

Patrick Meenan

unread,
Dec 8, 2014, 12:28:34 PM12/8/14
to httpa...@googlegroups.com
Sorry, I don't know the specifics, hopefully Steve will chime in.  Looking at the newrelic graphs it looks like there were a couple of bumps after the main crawl finished (and when I peeked a minute ago there were 20k tests).  I know Steve recently increased the retry count but I'm not sure what it is set to or how the logic works.

Charlie Clark

unread,
Dec 8, 2014, 1:13:53 PM12/8/14
to httpa...@googlegroups.com
Am .12.2014, 18:28 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> Sorry, I don't know the specifics, hopefully Steve will chime in.
> Looking
> at the newrelic graphs it looks like there were a couple of bumps after
> the
> main crawl finished (and when I peeked a minute ago there were 20k
> tests).
> I know Steve recently increased the retry count but I'm not sure what it
> is
> set to or how the logic works.

Now done to about 12k which suggests another couple of hours. It was
totally idle a couple of hours ago.

Charlie Clark

unread,
Dec 9, 2014, 3:35:23 AM12/9/14
to httpa...@googlegroups.com
Am .12.2014, 19:13 Uhr, schrieb Charlie Clark
<charli...@clark-consulting.eu>:

> Now done to about 12k which suggests another couple of hours. It was
> totally idle a couple of hours ago.

Everything seems to have finished now. Guessing based on the New Relic
dashboard (unhelpfully not localised in region and timezone) suggests
everything finished around 00:00 GMT on 9th December.
Reply all
Reply to author
Forward
0 new messages