memcache problem

515 views
Skip to first unread message

r.gi...@sheffield.ac.uk

unread,
Feb 7, 2017, 12:58:10 PM2/7/17
to SimpleSAMLphp
We run two load-balanced simpleSAMLservers with memcache to store the session data:

    'memcache_store.servers' => array(
        array(
            array('hostname' => 'sssso1'),
        ),
        array(
            array('hostname' => 'sssso2'),
        ),
    ),

We recently needed to upgrade the servers so we created two new ones sssso3 & sssso4.  These were added to the memcache server array, as above, one in each of four server groups, with the same configuration on all four servers.  For a time we had all four servers in use to make sure all sessions were stored on all servers.  We then took sssso1&2 out of the load-balancer, and things worked fine with all SSO traffic going to sssso3&4.  Happy that everything was working fine we switched off sssso1 and errors started appearing in the log file, e.g. 

Feb  6 15:00:25 sssso3 simplesamlphp[40099]: 3 [263d69416d] SimpleSAML_Error_Exception: Error 8 - MemcachePool::get(): Server sssso1 (tcp 11211, udp 0) failed with: Connection refused (111)

with traceback.  There were no complaints from users until we turned off sssso2 fifteen mins later, at which point people complained of not being able to login to Google.  The logs were now showing problems connecting to both the turned off servers, e.g.

Feb  6 15:15:33 sssso3 simplesamlphp[40452]: 3 [00187dd55e] SimpleSAML_Error_Exception: Error 8 - MemcachePool::set(): Server sssso1 (tcp 11211, udp 0) failed with: Connection timed out (110)
Feb  6 15:15:33 sssso3 simplesamlphp[40452]: 3 [00187dd55e] SimpleSAML_Error_Exception: Error 8 - MemcachePool::get(): Server sssso2 (tcp 11211, udp 0) failed with: Connection refused (111)

The failure to connect is not surprising, but I am surprised that it is reported in such a way with a traceback in the logfile, because the unavailability of a memcache server should not be a problem; it just tries the next one.

What I am really asking about, however, is the fact that users were unable to login to Google after we turned off the second server, sssso2.  Does this suggest there is a problem with memcache on the new servers sssso3&4?  Things are running OK at the moment because we turned sssso1&2 back on.  Does this make sense?

Thank you very much

Richard



pat...@cirrusidentity.com

unread,
Feb 9, 2017, 12:38:25 PM2/9/17
to SimpleSAMLphp
What version of SSP are you running?

When I shut down all our memcache servers in test I get a different messages than you.

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] SimpleSAML_Error_Error: MEMCACHEDOWN

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] Backtrace:

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 7 /var/simplesamlphp/lib/SimpleSAML/Memcache.php:116 (SimpleSAML_Memcache::get)

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 6 /var/simplesamlphp/lib/SimpleSAML/Store/Memcache.php:42 (SimpleSAML_Store_Memcache::get)

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 5 /var/simplesamlphp/lib/SimpleSAML/SessionHandlerStore.php:52 (SimpleSAML_SessionHandlerStore::loadSession)

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 4 /var/simplesamlphp/lib/SimpleSAML/Session.php:325 (SimpleSAML_Session::getSession)

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 3 /var/simplesamlphp/lib/SimpleSAML/Session.php:245 (SimpleSAML_Session::getSessionFromRequest)

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 2 /var/simplesamlphp/lib/SimpleSAML/Auth/Simple.php:54 (SimpleSAML_Auth_Simple::isAuthenticated)

Feb  3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19] 1 /var/simplesamlphp/modules/core/www/authenticate.php:34 (require)




What version of memcache are you using? (pecl info memcache)

Did you set  memcache.allow_failover = Off in php.ini

How long did you run all 4 memcache servers? Maybe the users that had issues had their sessions stored prior to you adding new servers?

-Patrick

Richard Gilbert

unread,
Feb 13, 2017, 9:00:38 AM2/13/17
to simple...@googlegroups.com
Hi Patrick,

Thank you very much for getting back to me.

On 9 February 2017 at 17:38, <pat...@cirrusidentity.com> wrote:
> What version of SSP are you running?

1.14.7

> When I shut down all our memcache servers in test I get different
> messages than you.
>
> Feb 3 20:42:22 ip-192-168-66-56 ssp-idp[6517]: 3 [TR21b12e19]
> SimpleSAML_Error_Error: MEMCACHEDOWN

There were no MEMCACHEDOWN errors (but we didn't shut down all the
memcache servers).

> What version of memcache are you using? (pecl info memcache)

3.0.8 (beta)

> Did you set memcache.allow_failover = Off in php.ini?

No, I failed to do this. phpinfo shows it is set to 1. When I
noticed that I hadn't turned it off, I thought it wouldn't matter as
there is only one server in each server group, and that SSP is
handling the failover itself. Is that the problem?

> How long did you run all 4 memcache servers? Maybe the users that had
> issues had their sessions stored prior to you adding new servers?

Many days. The two new servers, which are now the only ones in the
load balancer for SSO, have been running for months so should both
have all sessions, if they are set up properly.

As I said, when we turned off one old server (only used in the
memcache config) nobody noticed. When we turned off the second, users
started reporting problems. I would have expected SSP to failover to
the third and fourth servers which were still running, being the
current SSP servers themselves.

So as not to clutter up the message, I have attached the log extract
for one of the sessions after the second memcache server was turned
off.
--
Richard Gilbert
Corporate Information and Computing Services
University of Sheffield, Sheffield, S10 2FN, UK
Phone: +44 114 222 3028
00187dd55e.txt

Richard Gilbert

unread,
Feb 13, 2017, 11:57:02 AM2/13/17
to simple...@googlegroups.com
Hi Patrick,

Please ignore the log extract I attached to my previous message. I
don't think it illustrated the problem. I have attached one that
does.

Richard
25910351e0.txt

Patrick Radtke

unread,
Feb 13, 2017, 12:05:34 PM2/13/17
to simple...@googlegroups.com
What is your setting for the php setting 'error_reporting'?
memcache extension is setting an error_report() at the notice level for the message you are seeing.
So if you have 'error_reporting' configured at the notice level then you are going to get that message in your logs even though it isn't a real error condition. At least that is my theory.

More thoughts below
 

There were no MEMCACHEDOWN errors (but we didn't shut down all the
memcache servers).

Yup. It just seemed like SSP thought all your memcache servers were down if it stopped working when you removed the original memcache servers.


> Did you set  memcache.allow_failover = Off in php.ini?

No, I failed to do this.  phpinfo shows it is set to 1.  When I
noticed that I hadn't turned it off, I thought it wouldn't matter as
there is only one server in each server group, and that SSP is
handling the failover itself.  Is that the problem?

No, I think your understanding is correct. My interpretation is the same - It would only impact having multiple servers in a server group.
 

> How long did you run all 4 memcache servers? Maybe the users that had
> issues had their sessions stored prior to you adding new servers?

Many days.  The two new servers, which are now the only ones in the
load balancer for SSO, have been running for months so should both
have all sessions, if they are set up properly.

As I said, when we turned off one old server (only used in the
memcache config) nobody noticed.  When we turned off the second, users
started reporting problems.  I would have expected SSP to failover to
the third and fourth servers which were still running, being the
current SSP servers themselves.

Yes, that is how it behave for us.  In SSP Memchace.php calls

$memcache->addServer($hostname, $port, true, $weight, $timeout, $timeout, true); 

and doesn't check if true or false are returned. So in theory the 'adding' of the 3rd and 4th servers can fail and you wouldn't notice.

Have you tried (when all 4 servers are configured) to manually verify a session is present on all 4? (using telnet for example)

-Patrick

pa...@thisispads.me.uk

unread,
Jul 4, 2017, 10:17:08 AM7/4/17
to SimpleSAMLphp
Hi there,

I also have a similar problem to this. If any server is unavailable for whatever reason then a new session cannot be started because the following code raises an error:

MemcachePool::set(): Server localhost (tcp 11211, udp 0) failed with: Connection refused (111)

I've seen that the only way to get around this is to suppress (prepending @):

$serializedInfo = $server->get($key);

and

$server->set($key, $savedInfoSerialized, 0, $expire);

My question is - why is it designed this way? It's a valid and likely scenario that you could lose a memcache server at the point a new user needs to login and create a session. Shouldn't the code ignore connection problems unless no servers are available?

My SimpleSAMLPHP version is 1.14.11 and memcache is 3.0.8 and an example config:

'memcache_store.servers' => array(
        array(
            array('hostname' => 'localhost', 'port' => 11211),
            array('hostname' => 'localhost', 'port' => 11212),
        ),
        array(
            array('hostname' => 'localhost', 'port' => 11311),
            array('hostname' => 'localhost', 'port' => 11312),
        ),
    ),

Stopping any single local memcache server causes the error above.
Reply all
Reply to author
Forward
0 new messages