SSL_connect and SSL_accept deadlock!

Md Lazreg

unread,

Nov 2, 2010, 9:25:39 PM11/2/10

to

I have an SSL client that connects to an SSL server. The server is able to process 1000s of clients just fine on a variety of platforms [Window/Linux/HP/Solairs] for long periods of time.

The problem that is driving me nuts is that from time to time like once every 24 hours some client fails to connect to the server at the handshaking phase. This happens only on Linux and HP. Other platforms do not experience this issue.

Here is a sketch of my client and server code. Please note that I am using non blocking sockets:

common code:

---------------------

int ssl_retry(int ret)

{

int r;

fd_set fds;

struct timeval tv, *ptv=0;

tv.tv_sec = 1;/*do a select for 1 second each time*/

tv.tv_usec = 0;

ptv=&tv;

FD_ZERO(&fds);

switch(SSL_get_error(m_ssl, ret)

{

case SSL_ERROR_NONE:

r = 1;

break;

case SSL_ERROR_WANT_READ:

FD_SET(m_sock_fd, &fds);

r=select(m_sock_fd + 1, &fds, 0, 0, ptv);

if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we timed out with EAGAIN try again*/

{

r = 1;

}

break;

case SSL_ERROR_WANT_WRITE:/

FD_SET(m_sock_fd, &fds);

r=select(m_sock_fd + 1, 0, &fds, 0, ptv);

if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we timed out with EAGAIN try again*/

{

r = 1;

}

break;

case SSL_ERROR_ZERO_RETURN:/*The socket closed*/

r = 0;

break;

case SSL_ERROR_SYSCALL:

case SSL_ERROR_SSL:

r = -1;

break;

default:

r = -1;

}

return r;

client code:

-----------------

int time0 = time(NULL);

timeout=10 seconds;

while (t<timeout)

{

r = SSL_connect(m_ssl);

if (r > 0)

{

break;

}

r = ssl_retry(r);

if ( r <= 0)

{

break;

}

t = time(NULL) - time0;

}

if (t>=timeout)

{

I timed out:(

}

if (r>0)

{

We are connected. Do work.

}

else

{

Some kind of an issue.

}

server code:

-----------------

int time0 = time(NULL);

timeout=10 seconds;

while (t<timeout)

{

r = SSL_accept(m_ssl);

if (r > 0)

{

break;

}

r = ssl_retry(r);

if ( r <= 0)

{

break;

}

t = time(NULL) - time0;

}

if (t>=timeout)

{

I timed out:(

}

if (r>0)

{

We are connected. Do work.

}

else

{

Some kind of an issue.

}

When this problem happens both the client and the server end up in the red line above "I timed out"

With some debugging efforts I see that when this problem hits, both the client and the server go repeatedly into the green section above, each one of them seems to want to perform a read as the returned code is SSL_ERROR_WANT_READ from both the SSL_connect and the SSL_accept calls.

This looks to me as a deadlock situation where both my server and my client are wanting to do a READ until both of them timeout!

Can someone please suggest to me what is wrong with the above code and how is this deadlock possible?? I am using openssl-1.0.0a

mdlazreg

unread,

Nov 3, 2010, 8:42:58 AM11/3/10

to

Sorry after I posted the below I realized that not everyone has a
mail reader that supports colors.

The green section I refer to below is this:

case SSL_ERROR_WANT_READ:
FD_SET(m_sock_fd, &fds);
r=select(m_sock_fd + 1, &fds, 0, 0, ptv);
if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we
timed out
with EAGAIN try again*/
{
r = 1;
}
break;

this is the section of code where both the server and the client go
through many times before they finally time out. I do not know why
both of them are wanting a READ during their SSL handshake...
Fortunately this happens very rarely but it does happen!

Thank you for any suggestions or if you spot a logic error in the
below code.

David Schwartz

unread,

Nov 3, 2010, 9:12:37 AM11/3/10

to

On 11/2/2010 6:25 PM, Md Lazreg wrote:

> r=select(m_sock_fd + 1, &fds, 0, 0, ptv);
> if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we timed
> out with EAGAIN try again*/
> {
> r = 1;
> }

This code is broken. If 'select' returns zero, checking errno is a
mistake. (What is 'Errno' anyway?)

> r = SSL_connect(m_ssl);
> if (r > 0)
> {
> break;
> }
> r = ssl_retry(r);
> if ( r <= 0)
> {
> break;
> }
> t = time(NULL) - time0;
> }

Err, what? Is an ssl_retry return of zero supposed to indicate a fatal
error? The code in ssl_retry doesn't seem to follow this rule. (For
example, consider if 'select' returns zero and errno is zero. That would
indicate a timeout, not a fatal error.)

> int time0 = time(NULL);
> timeout=10 seconds;
> while (t<timeout)
> {
> r = SSL_accept(m_ssl);
> if (r > 0)
> {
> break;
> }
> r = ssl_retry(r);
> if ( r <= 0)
> {
> break;
> }
> t = time(NULL) - time0;
> }
> if (t>=timeout)

There no code to initially set 't'.

Also, an overall comment: Maybe it's just my taste, but your code seems
to have a 'worst of both worlds' quality to it. It uses non-blocking
sockets, but then finds clever ways to make the non-blocking operations
act like blocking ones.

Is the server multithreaded? If so, I could see this as mere laziness
(or, efficient use of coding resources to be more charitable) rather
than actual poor design.

DS

______________________________________________________________________
OpenSSL Project http://www.openssl.org
User Support Mailing List openss...@openssl.org
Automated List Manager majo...@openssl.org

mdlazreg

unread,

Nov 3, 2010, 11:40:43 AM11/3/10

to

Checking for the errno EAGAIN was my attempt to fix this issue... So
you can ignore that check and the problem still persists.

Errno is the usual errno (Just a wrapper for platforms porting
purposes).
The code sets 't' to 0 initially (sorry I forgot that line from the
stripped code I showed below).

If select returns 0 and errno is 0, then you are right it is
technically a timeout and that is exactly what was happening which I
tried to "fix" it by checking the errno.

Now If I heed your advice and remove the errno check [which was my
original code], then when the problem hits I see that both the client
and the server return 0 from their select in the SSL_ERROR_WANT_READ
code block.

Even if I increase the select timeout to 10 seconds both the client
and the server will timeout on that select line right after they
reported SSL_ERROR_WANT_READ ...

My question is, under what conditions both the server and the client
are waiting on SSL_ERROR_WANT_READ and how to get out of that deadlock
state?

Yes my server is multi threaded and although I am sure my design is
not the best it has been serving 1000s of clients on different
platforms sometimes for days without dropping a single connection.
Then just randomly some of my clients [only on Linux and HP platforms]
will report this handshake issue!. Debugging it shows that when this
happens both the client and the server are timing out on the select
line right after the SSL_ERROR_WANT_READ.

> User Support Mailing List openssl-us...@openssl.org
> Automated List Manager majord...@openssl.org

Jeffrey Walton

unread,

Nov 3, 2010, 1:17:14 PM11/3/10

to

On Wed, Nov 3, 2010 at 9:12 AM, David Schwartz <dav...@webmaster.com> wrote:
> On 11/2/2010 6:25 PM, Md Lazreg wrote:
>
>> r=select(m_sock_fd + 1, &fds, 0, 0, ptv);
>> if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we timed
>> out with EAGAIN try again*/
>> {
>> r = 1;
>> }
>
> This code is broken. If 'select' returns zero, checking errno is a mistake.
> (What is 'Errno' anyway?)
>

> [SNIP]

>
> Is the server multithreaded? If so, I could see this as mere laziness (or,
> efficient use of coding resources to be more charitable) rather than actual
> poor design.

lol....

______________________________________________________________________
OpenSSL Project http://www.openssl.org

mdlazreg

unread,

Nov 5, 2010, 2:13:58 PM11/5/10

to

Hi David,

I followed your advice and here is my new client code:

bool ssl_connection()
{
bool done = false;
bool closed = false;
bool err = false;
bool timeout = false;
fd_set fds;
struct timeval tv, *ptv;
ptv = &tv;
ptv->tv_sec=30;/*The maximum seconds I am willing to wait*/
while(!done && !closed && !err && !timeout)
{
r = SSL_connect(m_ssl);
switch(SSL_get_error(m_ssl, r))
{
case SSL_ERROR_NONE:
done = true;
break;
case SSL_ERROR_WANT_READ:
FD_ZERO(&fds);
FD_SET(m_sock_fd, &fds);
r = select(m_sock_fd + 1, &fds, 0, 0, ptv);
if (r < 0)
{
err = true;
}
else if (r == 0)
{
timeout = true;
}
break;
case SSL_ERROR_WANT_WRITE:
FD_ZERO(&fds);
FD_SET(m_sock_fd, &fds);
r = select(m_sock_fd + 1, 0, &fds, 0, ptv);
if (r < 0)
{
err = true;
}
else if (r == 0)
{
timeout = true;
}
break;
case SSL_ERROR_ZERO_RETURN:
closed = true;

break;
case SSL_ERROR_SYSCALL:
case SSL_ERROR_SSL:

err = true;
break;
default:
err = true;
}
}
if (closed)
{
cout << "The SSL connection closed out!" << endl << flush;
return false;
}
else if (timeout)
{
cout << "The SSL connection timed out!" << endl << flush;
return false;
}
else if (err)
{
cout << "The SSL connection errored out!" << endl << flush;
return false;
}
else
{
cout << "Congratulations! You are connected securely. Go ahead
with your secrets!" << endl << flush;
return true;
}
}

My server code is exactly as above. The only difference is the use of
SSL_accept in place of SSL_connect.

Unfortunately I am still seeing the deadlock issue.

When this deadlock happens and with the help of some debug printing, I
see that both the server and the client are timing out (the select
call returns 0) after going through this section of the code:

case SSL_ERROR_WANT_READ:
FD_ZERO(&fds);
FD_SET(m_sock_fd, &fds);
r = select(m_sock_fd + 1, &fds, 0, 0, ptv);
if (r < 0)
{
err = true;
}
else if (r == 0)
{
timeout = true;
}
break;

Both of my client and server seem to want a READ as the returned value
from SSL_get_error is SSL_ERROR_WANT_READ then the select call returns
0 for both of them. Which means that for 30 seconds both my client and
my server were not able to make any progression in their SSL
handshaking...

I hope you can spot what is wrong with the above code.

Many thanks.

David Schwartz

unread,

Nov 7, 2010, 1:56:26 AM11/7/10

to

This may be a stretch, but did you confirm the socket is within the
range of sockets your platform allows you to 'select' on? For example,
Linux by default doesn't permit you to 'select' on socket numbers 1,025
and up, though you can have more than 1,024 file descriptors in use
without a problem.

DS

Message has been deleted

lang....@gmail.com

unread,

Dec 8, 2014, 4:58:19 AM12/8/14

to

I also experience the same problem. Did you meanwhile solve this problem? Maybe you could poste your solution?

In my case the handshake is stopped after the server sends its certificate and the ServerHelloDone. The log of my application looks like this:

08:35:52.489Z [DtlsSession.<0-0-0>] <5> Debug: Start accepting with timeout: 1000000 us
08:35:52.489Z [DtlsSession.<0-0-0>] <5> Trace: Start SSL_accept
08:35:52.490Z [DtlsSession.<0-0-0>] <0> Trace: Finished SELECT in SSL_connect
08:35:52.490Z [DtlsSession.<0-0-0>] <0> Trace: Start SSL_connect
08:35:52.490Z [DtlsSession.<0-0-0>] <0> Trace: Finished SSL_connect with -1
08:35:52.490Z [DtlsSession.<0-0-0>] <0> Trace: SSL State: DTLS1 read hello verify request A
08:35:52.490Z [DtlsSession.<0-0-0>] <0> Trace: Start SELECT in SSL_connect because of SSL_ERROR_WANT_READ
08:35:52.490Z [DtlsSession.<0-0-0>] <0> Trace: Finished SELECT in SSL_connect
08:35:52.491Z [DtlsSession.<0-0-0>] <0> Trace: Start SSL_connect
08:35:52.491Z [DtlsSession.<0-0-0>] <0> Trace: Finished SSL_connect with -1
08:35:52.491Z [DtlsSession.<0-0-0>] <0> Trace: SSL State: DTLS1 read hello verify request A
08:35:52.491Z [DtlsSession.<0-0-0>] <0> Trace: Start SELECT in SSL_connect because of SSL_ERROR_WANT_READ
08:35:52.491Z [DtlsSession.<0-0-0>] <0> Trace: Finished SELECT in SSL_connect
08:35:52.491Z [DtlsSession.<0-0-0>] <0> Trace: Start SSL_connect
08:35:52.492Z [DtlsContext] <0> Debug: VerifyPeerCallback finished successfully with <1-2-3>
08:35:52.492Z [DtlsSession.<0-0-0>] <5> Trace: Finished SSL_accept with -1
08:35:52.499Z [DtlsSession.<0-0-0>] <5> Trace: SSL State: SSLv3 read client certificate A
08:35:52.499Z [DtlsSession.<0-0-0>] <5> Trace: Start SELECT in SSL_accept because of SSL_ERROR_WANT_READ
08:35:52.502Z [DtlsSession.<0-0-0>] <5> Trace: Finished SELECT in SSL_accept
08:35:52.502Z [DtlsSession.<0-0-0>] <5> Trace: Start SSL_accept
08:35:52.502Z [DtlsContext] <5> Debug: VerifyPeerCallback finished successfully with <1-2-3>
08:35:52.512Z [DtlsSession.<1-2-3>] <0> Trace: Finished SSL_connect with -1
08:35:52.512Z [DtlsSession.<1-2-3>] <0> Trace: SSL State: SSLv3 read server session ticket A
08:35:52.512Z [DtlsSession.<1-2-3>] <0> Trace: Start SELECT in SSL_connect because of SSL_ERROR_WANT_READ
08:35:52.512Z [DtlsSession.<1-2-3>] <5> Trace: Finished SSL_accept with -1
08:35:52.513Z [DtlsSession.<1-2-3>] <5> Trace: SSL State: SSLv3 read certificate verify A
08:35:52.513Z [DtlsSession.<1-2-3>] <5> Trace: Start SELECT in SSL_accept because of SSL_ERROR_WANT_READ
08:35:53.512Z [DtlsSession.<1-2-3>] <5> Trace: Finished SELECT in SSL_accept
08:35:53.512Z [DtlsSession.<1-2-3>] <5> Error: Failed to perform the DTLS handshake in SSL_accept with 1-2-3-0-0: peer is not answering
08:35:53.512Z [DtlsSession.<1-2-3>] <5> Error: SSL State: SSLv3 read certificate verify A
08:35:53.512Z [DtlsSession.<1-2-3>] <5> Error: Failed to finish the SSL handshake with 127.0.0.1 in Accept(int)

The <5> is the Server and the <0> is the client.

larry.k...@gmail.com

unread,

Dec 10, 2014, 10:56:08 AM12/10/14

to

> With some debugging efforts I see that when this problem hits, both the client and the server go repeatedly into the green section above, each one of them seems to want to perform a read as the returned code is SSL_ERROR_WANT_READ from both the SSL_connect and the SSL_accept calls.
>
> This looks to me as a deadlock situation where both my server and my client are wanting to do a READ until both of them timeout!
>
>
> Can someone please suggest to me what is wrong with the above code and how is this deadlock possible?? I am using openssl-1.0.0a

I find in my non-blocking, threaded implementation that SSL_accept() mistakenly
thinks the handshake is complete and indicates SSL_ERROR_NONE.
Note the Sid is still NULL.

thr=6 TCP::sslAcceptConn(): TIsockNotReady _sslAcceptWaitOnRead=1, _sslAccept
WaitOnWrite=0
thr=4 TCP::sslConnect(50) Want_Read SSL_connect() rtn=-1, errno=11
thr=6 TCP::sslAcceptConn(52) SUCCESS SSL_accept() rtn=1
Sid: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
thr=6 TCP::sslGetData(52, Buff=0xb2da110c, Size=0) Want_Read SSL_read() rtn=2
, errno=11
Deadlocked - both client and server Want_Read