Data from 15th March running late

Charlie Clark

unread,

Mar 25, 2015, 2:10:28 PM3/25/15

to httpa...@googlegroups.com

Hiya Patrick,

from the new relic dashboard it looks like there was a problem with the
server this month. Would this explain why the main data set from the 15th
isn't available yet?

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

Steve Souders

unread,

Mar 25, 2015, 2:24:05 PM3/25/15

to httpa...@googlegroups.com

Just running slow. Should be done soon.

Patrick Meenan

unread,

Mar 25, 2015, 2:36:22 PM3/25/15

to httpa...@googlegroups.com

Yeah, not entirely sure what's going on with the inconsistent bandwidth utilization that is dragging things out. The agents and server are all fine so I'm assuming there is some connectivity flakiness but I need to check with the hosting provider and see if it is still going on. We had some alarms go off about ping failures that correspond with some of the dips (though I could reach the site fine) which makes me think it was a peering or similar issue.

--
You received this message because you are subscribed to the Google Groups "HTTP Archive" group.
To unsubscribe from this group and stop receiving emails from it, send an email to httparchive...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charlie Clark

unread,

Mar 26, 2015, 4:48:06 AM3/26/15

to httpa...@googlegroups.com

Am .03.2015, 19:36 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> Yeah, not entirely sure what's going on with the inconsistent bandwidth
> utilization that is dragging things out. The agents and server are all
> fine so I'm assuming there is some connectivity flakiness but I need to
> check with the hosting provider and see if it is still going on. We had
> some alarms go off about ping failures that correspond with some of the
> dips (though I could reach the site fine) which makes me think it was a
> peering or similar issue.

Just based on the CSV export it looks like a lot of sites were missed in
this run.

Patrick Meenan

unread,

Mar 26, 2015, 9:09:34 AM3/26/15

to httpa...@googlegroups.com

OK, I think I found the root cause and it's totally my fault (the 3/15 run is probably not going to have much if any data).

We are getting ready to deploy additional servers and to get ready for it I created a new subnet and virtual interface on the private network so we'd have enough address space for all of the VM's. The regular private net is on eth1 and I set up the new interface as eth1:1. Unfortunately when I configured the IP on the new subnet I had it set up as eth1 (without the :1) so the VM's could sort of talk to the server but all of the routing and NAT was messed up. It's all fixed now and I ran a bunch of one-off tests to make sure things look ok.

There's still a chance that something else was causing the issues but I think it was probably my messed up config.

Sorry about that.

-Pat

--
You received this message because you are subscribed to the Google Groups "HTTP Archive" group.

To unsubscribe from this group and stop receiving emails from it, send an email to httparchive+unsubscribe@googlegroups.com.

Patrick Meenan

unread,

Apr 6, 2015, 10:36:19 AM4/6/15

to httpa...@googlegroups.com

TL;DR: network issues are ongoing (ticket open with the NOC to investigate) but the current crawl is also running late as a result. Turns out it wasn't related to my changes after all.

Charlie Clark

unread,

Apr 6, 2015, 10:39:43 AM4/6/15

to httpa...@googlegroups.com

Am .04.2015, 16:36 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> TL;DR: network issues are ongoing (ticket open with the NOC to
> investigate)
> but the current crawl is also running late as a result. Turns out it
> wasn't related to my changes after all.

Thanks for the update: these things happen. The stats for the relevant
runs will need updating to reflect the crawls that didn't succeed.

I don't know if it's related but we noticed problems with one of the
monitors that we use (not via WPT but one of the partners) that appeared
to be related to changes at the data centre.

Patrick Meenan

unread,

Apr 6, 2015, 11:03:06 AM4/6/15

to httpa...@googlegroups.com

A WPT monitor? The HA infrastructure is completely separate from WPT. It all runs out of a data center on the west coast and is on dedicated hardware. It is a stand-alone private instance that is running on hardware that archive.org owns.

Charlie Clark

unread,

Apr 6, 2015, 11:25:47 AM4/6/15

to httpa...@googlegroups.com

Am .04.2015, 17:02 Uhr, schrieb Patrick Meenan <patm...@gmail.com>:

> A WPT monitor? The HA infrastructure is completely separate from WPT.
> It
> all runs out of a data center on the west coast and is on dedicated
> hardware. It is a stand-alone private instance that is running on
> hardware
> that archive.org owns.

I understand that, so it's probably unrelated. It's a completely different
monitor but it turned out that there were problems at a data centre that
were affecting all of the monitors running there. The cause was apparently
a change in the configuration but I don't know the real details.

Patrick Meenan

unread,

Apr 6, 2015, 12:20:51 PM4/6/15

to httpa...@googlegroups.com

ok, fingers crossed I actually found and fixed the root cause this time. We updated the server late last year and it appears that in the process the sysctl config settings that we use to tune the NAT were depricated and replaced with different settings (in particular, the netfilter connection tracking limit). I just updated all of them to their newer counterparts and things are back to humming along. Watching the tracked connection count, it looks like we hover right near the 65535 default limit under normal operation which is probably why it didn't get noticed (though I'm not sure what triggered the death spirals the last 2 crawls).

Reply all

Reply to author

Forward