The green section I refer to below is this:
case SSL_ERROR_WANT_READ:
FD_SET(m_sock_fd, &fds);
r=select(m_sock_fd + 1, &fds, 0, 0, ptv);
if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we
timed out
with EAGAIN try again*/
{
r = 1;
}
break;
this is the section of code where both the server and the client go
through many times before they finally time out. I do not know why
both of them are wanting a READ during their SSL handshake...
Fortunately this happens very rarely but it does happen!
Thank you for any suggestions or if you spot a logic error in the
below code.
> r=select(m_sock_fd + 1, &fds, 0, 0, ptv);
> if (r <= 0 && (Errno == EAGAIN || Errno == EINTR))/*if we timed
> out with EAGAIN try again*/
> {
> r = 1;
> }
This code is broken. If 'select' returns zero, checking errno is a
mistake. (What is 'Errno' anyway?)
> r = SSL_connect(m_ssl);
> if (r > 0)
> {
> break;
> }
> r = ssl_retry(r);
> if ( r <= 0)
> {
> break;
> }
> t = time(NULL) - time0;
> }
Err, what? Is an ssl_retry return of zero supposed to indicate a fatal
error? The code in ssl_retry doesn't seem to follow this rule. (For
example, consider if 'select' returns zero and errno is zero. That would
indicate a timeout, not a fatal error.)
> int time0 = time(NULL);
> timeout=10 seconds;
> while (t<timeout)
> {
> r = SSL_accept(m_ssl);
> if (r > 0)
> {
> break;
> }
> r = ssl_retry(r);
> if ( r <= 0)
> {
> break;
> }
> t = time(NULL) - time0;
> }
> if (t>=timeout)
There no code to initially set 't'.
Also, an overall comment: Maybe it's just my taste, but your code seems
to have a 'worst of both worlds' quality to it. It uses non-blocking
sockets, but then finds clever ways to make the non-blocking operations
act like blocking ones.
Is the server multithreaded? If so, I could see this as mere laziness
(or, efficient use of coding resources to be more charitable) rather
than actual poor design.
DS
______________________________________________________________________
OpenSSL Project http://www.openssl.org
User Support Mailing List openss...@openssl.org
Automated List Manager majo...@openssl.org
Errno is the usual errno (Just a wrapper for platforms porting
purposes).
The code sets 't' to 0 initially (sorry I forgot that line from the
stripped code I showed below).
If select returns 0 and errno is 0, then you are right it is
technically a timeout and that is exactly what was happening which I
tried to "fix" it by checking the errno.
Now If I heed your advice and remove the errno check [which was my
original code], then when the problem hits I see that both the client
and the server return 0 from their select in the SSL_ERROR_WANT_READ
code block.
Even if I increase the select timeout to 10 seconds both the client
and the server will timeout on that select line right after they
reported SSL_ERROR_WANT_READ ...
My question is, under what conditions both the server and the client
are waiting on SSL_ERROR_WANT_READ and how to get out of that deadlock
state?
Yes my server is multi threaded and although I am sure my design is
not the best it has been serving 1000s of clients on different
platforms sometimes for days without dropping a single connection.
Then just randomly some of my clients [only on Linux and HP platforms]
will report this handshake issue!. Debugging it shows that when this
happens both the client and the server are timing out on the select
line right after the SSL_ERROR_WANT_READ.
> User Support Mailing List openssl-us...@openssl.org
> Automated List Manager majord...@openssl.org
I followed your advice and here is my new client code:
bool ssl_connection()
{
bool done = false;
bool closed = false;
bool err = false;
bool timeout = false;
fd_set fds;
struct timeval tv, *ptv;
ptv = &tv;
ptv->tv_sec=30;/*The maximum seconds I am willing to wait*/
while(!done && !closed && !err && !timeout)
{
r = SSL_connect(m_ssl);
switch(SSL_get_error(m_ssl, r))
{
case SSL_ERROR_NONE:
done = true;
break;
case SSL_ERROR_WANT_READ:
FD_ZERO(&fds);
FD_SET(m_sock_fd, &fds);
r = select(m_sock_fd + 1, &fds, 0, 0, ptv);
if (r < 0)
{
err = true;
}
else if (r == 0)
{
timeout = true;
}
break;
case SSL_ERROR_WANT_WRITE:
FD_ZERO(&fds);
FD_SET(m_sock_fd, &fds);
r = select(m_sock_fd + 1, 0, &fds, 0, ptv);
if (r < 0)
{
err = true;
}
else if (r == 0)
{
timeout = true;
}
break;
case SSL_ERROR_ZERO_RETURN:
closed = true;
break;
case SSL_ERROR_SYSCALL:
case SSL_ERROR_SSL:
err = true;
break;
default:
err = true;
}
}
if (closed)
{
cout << "The SSL connection closed out!" << endl << flush;
return false;
}
else if (timeout)
{
cout << "The SSL connection timed out!" << endl << flush;
return false;
}
else if (err)
{
cout << "The SSL connection errored out!" << endl << flush;
return false;
}
else
{
cout << "Congratulations! You are connected securely. Go ahead
with your secrets!" << endl << flush;
return true;
}
}
My server code is exactly as above. The only difference is the use of
SSL_accept in place of SSL_connect.
Unfortunately I am still seeing the deadlock issue.
When this deadlock happens and with the help of some debug printing, I
see that both the server and the client are timing out (the select
call returns 0) after going through this section of the code:
case SSL_ERROR_WANT_READ:
FD_ZERO(&fds);
FD_SET(m_sock_fd, &fds);
r = select(m_sock_fd + 1, &fds, 0, 0, ptv);
if (r < 0)
{
err = true;
}
else if (r == 0)
{
timeout = true;
}
break;
Both of my client and server seem to want a READ as the returned value
from SSL_get_error is SSL_ERROR_WANT_READ then the select call returns
0 for both of them. Which means that for 30 seconds both my client and
my server were not able to make any progression in their SSL
handshaking...
I hope you can spot what is wrong with the above code.
Many thanks.
DS