Migration to SN testing node urgent

5 views
Skip to first unread message

Sebastian Silva

unread,
Oct 15, 2014, 12:34:05 PM10/15/14
to Aleksey Lim, sugar-...@googlegroups.com, Sugar Labs Systems
Alsroot,
Greetings,
We're observing downtime about twice a day now in production instance of Sugar Network central node.

Every time I have to log into jita and issue:

 sudo /etc/init.d/sugar-network stop 
 ps -o pid,comm,user,thcount -u www-data | wc -l       
 # ^^ is useful to give an idea of traffic
 # goes down to ~12 when all SN threads die after a few seconds
 sudo /etc/init.d/sugar-network start

It's probably all the traffic, but also it seems to have gotten worse after the downtime jita had some time ago.

It's stressful for editors/admins and annoying for users.

As I understand it, new node implementation does not have this problem.

I can do the migration myself, if you provide me with some details:

* procedure for migrating database
* current release tarballs/sources for putting in production

I think it's even better if I do it, then I will have a better sense of how the clockwork ticks. I'll attempt to document as I go. Maybe we can setup some uptime monitoring this time around (cc: systems@ for this purpose).

It would be helpful to coincide when you are online on this task. For me a good time would be starting Friday 17th at 21:00 (UTC -5 / Bogota) - but I'm open to accomodate to your timezone/schedule/convenience. This way we can test over the weekend and have a working service by Monday 20th.

Let me know so we can announce the planned maintenance downtime.

We've gotten this far and have engaged some active users. I think there is a bright future for Sugar Network. We just need to keep rowing. Thanks for your commitment.

--
Sebastian Silva
"icarito" #sugar #somosazucar (freenode IRC)
Somos Azúcar - Fuente Libre - Sugar Labs

"Las maestras y los maestros democráticos intervenimos en el mundo a través del cultivo de la curiosidad" - P.Freire

Aleksey Lim

unread,
Oct 16, 2014, 12:25:31 AM10/16/14
to Sebastian Silva, sugar-...@googlegroups.com, Sugar Labs Systems
The issue is not w/ SN node in particular but with Apache connections
pool, next time restart Apache. Last time we decided to not migrate
node.sl.o to intermediate code base release. So, if the only issue is
unavailable node, it could be started on separate IP out of Apache.

If I got it right, it is possible to grant jita new external IP,
so, we need to ask Bernie. Then,

1. node.sugarlabs.org DNS should be re-pointed to the new IP

2. /srv/sugar-network/.config/sugar-network/config should be tuned:

[node]
host = <NEWIP>
port = 80

3. /etc/apache2/sites-enabled/node.sugarlabs.org should be tuned:

<VirtualHost *:80>
ServerName node.sugarlabs.org
ProxyPass / http://<NEWIP>:80/
ProxyPassReverse / http://<NEWIP>:80/
</VirtualHost>

4. sugar-network-node restarted and Apache reloaded.

--
Aleksey
signature.asc

Sebastian Silva

unread,
Oct 16, 2014, 1:06:27 AM10/16/14
to Bernie Innocenti, Aleksey Lim, sugar-...@googlegroups.com, Sugar Labs Systems
Very well then,
This is a good solution.
Thanks Alsroot.

Dear systems@ and Bernie,
We need to use another IP address for the Sugar Network. Is this possible, could you please indicate which one? I also would like to request access to signing the DNS records for node.sugarlabs.org or assistance in this step, from the following procedure outlined by Aleksey.

Thanks in advance for your help.
--
Sebastian Silva
"icarito" #sugar #somosazucar (freenode IRC)
Somos Azúcar - Fuente Libre - Sugar Labs

"Las maestras y los maestros democráticos intervenimos en el mundo a través del cultivo de la curiosidad" - P.Freire

Sebastian Silva

unread,
Oct 18, 2014, 2:38:40 PM10/18/14
to Aleksey Lim, sugar-...@googlegroups.com, Sugar Labs Systems
Hola Aleksey,
Bernie and I were poking at jita + apache + sugar network node last night.
After analyzing the issue, Bernie concludes it's actually the Sugar Network Node leaking file descriptors.
I'll try to setup a test script and setup monitoring this weekend.
Hopefully with your help we can find this leak too!
Regards,
--
Sebastian Silva
"icarito" #sugar #somosazucar (freenode IRC)
Somos Azúcar - Fuente Libre - Sugar Labs

"Las maestras y los maestros democráticos intervenimos en el mundo a través del cultivo de la curiosidad" - P.Freire

El mié, 15 de oct 2014 a las 11:26 PM, Aleksey Lim <als...@sugarlabs.org> escribió:

Aleksey Lim

unread,
Oct 20, 2014, 4:12:22 PM10/20/14
to Bernie Innocenti, Sebastian Silva, sugar-...@googlegroups.com, Sugar Labs Systems
On Wed, Oct 15, 2014 at 11:10:57PM -0700, Bernie Innocenti wrote:
> On 15/10/14 22:06, Sebastian Silva wrote:
> > Very well then,
> > This is a good solution.
> > Thanks Alsroot.
> >
> > Dear systems@ and Bernie,
> > We need to use another IP address for the Sugar Network. Is this
> > possible, could you please indicate which one? I also would like to
> > request access to signing the DNS records for node.sugarlabs.org or
> > assistance in this step, from the following procedure outlined by Aleksey.
> >
> > Thanks in advance for your help.
>
> We do have spare IPs, but first I'd like to understand why Apache is
> tipping over using a single IP and would work better with 2 IPs.
>
> I assume you don't have a problem of too many idle connections lingering
> around, because a single IP can take tens of thousands. So it's probably
> Apache rejecting connections when you hit some configurable limit
> (MaxClients, ServerLimit, etc) which are meant to protect the server
> from DoS and overload conditions.
>
> If the limits are set too low, we can just increase them, but bypassing
> them altogether would be unwise. If, for example, at peak time we
> receive 1000 simultaneous connections, but the server has enough memory
> only to handle 800 connections, the system will start trashing and
> OOMing, causing *all* users to be permanently unable to connect until
> the processes are restarted. Under some conditions, the kernel might
> even kill some vital process and require a manual reboot.
>
> A more scientific approach for tuning things would be:
>
> - Setup good graphs for memory usage, cpu usage, number of active
> connections, numver of 500 errors served, etc. This can be done with Munin.
>
> - Send test traffic until the system overloads. Ideally we'd do this in
> a test environment without disrupting real traffic, but that's a bit
> complicated.
>
> - See which resource is topping: Is it memory? Is it disk I/O?
>
> - What's the maxiumum QPS (queries per second) you can get? Is it
> plenty more than what you get at peak time? If so, you're done.
>
> - If the QPS is not sufficient, provision the VM with more resources as
> needed. If you can't, consider sharding the service on multiple machines.
>
> Remember not to leave the limits disabled after the load test. It will
> just cripple your server on the first spike of traffic.
>
> Again, adding IPs is possible, but before doing so try figuring out
> what's causing the outage. I'm available on IRC to help debug this.
>
> Also, resist the temptation of putting application servers written in
> Python and Ruby directly on the front line. They also speak HTTP, but
> typically they're insufficiently protected against various kinds of
> attacks, they have bad support for SSL, and they're very slow at serving
> static files. Plus, you'd loose Apache's logging and monitoring
> features, which can help with debugging.

Sorry for long delay, the case is that current clients in the field
and prod server suffer from wrong design decision when clients open
long-living connections. Last time I tried to tune
MaxClients/ServerLimit/ProxyTimeout and it seemed to work. Since
Apache started to be irresponsive again, idea was to run SN node (or
Apache) of separate IP to not affect connections poll for other Apache
sites. But, since Gitorious (the most visitable server in the past) is
not actual any more after moving all git projects to another hosting,
I guess it is ok to experiment w/ current Apache.

--
Aleksey
Reply all
Reply to author
Forward
0 new messages