Disclaimer: I am by no means an 'expert', but I have been intensely working
with IOCP the past few months. I'll just offer some of what I know in an
attempt to help.
> 1. What do you do when GetQueuedCompletionStatus() extracts a failed I/O which was issued by WSASend/WSARecv?
Log the error if you wish. Most of the time, you just cleanup anything
context wise that is needed and churn along. I would assume you have some
user object associated with the OVERLAPPED structure, so if you need anything
to be done there, now's the time you would do so.
> - Does it mean the socket has broken and should be closed immediately?
No, not necessarily. it just means something went wrong with the operation
(check each API function 'remarks' section). Perhaps the socket closed on the
remote side, maybe the operating system could not lock a non-paged pool
buffer. It's all a best guess. You kind of just have to program for the
generic "there was an error, I don't care what error it was, log it, and then
handle recovery" case.
Doing test cases to actually replicate possible errors is hard, so it is
vital that you code in an approporiate logging mechanism for errors that your
program can report back to you when they happen. I've found myself it's
really hard to replicate real world scenarios on code.
> - Or are there cases when it's worth to retry sending/receiving?
This depends on your protocol. If you are using UDP (I know you are not, but
just saying for the sake of argument) it would not matter. If you are using
TCP though, then you will definitely want to signal an error for whatever
logic you are trying to have the packet resent.
> - What are possible error codes returned by GetLastError()?
With Windows, it's amazing how many different error codes are reported for a
function :) You won't know until you hit them. MSDN docs are not always
complete in terms of telling what errors can be returned. Sometimes you get a
generic error code that you can't even decipher, such as WSAEINVAL.
> 2. Furthermore, when GetQueuedCompletionStatus() fails to extract an I/O packet:
> - Does it alway mean the IOCP has broken? Should we close the IOCP
> immediately?
No, it does not always mean that. For example, if you try to send via UDP to
a valid host that does not have a specific port open, the function will fail
and error code ERROR_PORT_UNREACHABLE is returned. Likewise, if you look at
WinError.h, there are other error code surrounding that one that look like
possible candidates.
> - Or can we just ignore the error in some cases?
Some cases you may wish to ignore errors, other times not. It just depends
on your program. The biggest issue you have is making sure you do not leak
any resources! As a programmer, you know you have to expect errors, so in
your program logic, you always need to consider "what if this response packet
is never received.." and so on.
> - What are possible causes/error codes returned by GetLastError()?
As mentioned before, it just varies. All I can recommend is that you log all
your errors and start to handle the general "there was an error" case rather
than trying to find specific errors and handle them all individually.
Eventually, you might be able to have a comprehensible error set, but that
takes time.
> Thanks alot,
I' not sure how useful this was, but good luck!
Thanks alot for your advices in logging and defensive programming. I'm just
wondering how to
recover from error in some cases.
Let me clear my situation: I'm writing a TCP server in which IOCP is used
merely for I/O operations.
When a socket is accepted, it is attached to the IOCP for sending/receiving.
1. When GetQueuedCompletionStatus(INFINITE) returns FALSE with not-null
overlapped
This means a WSASend/WSARecv has been failed. Do you think one of the reason
is that there is not enough non-paged pool buffer? I read somewhere this
error only exposes
while calling WSASend/WSARecv (which returns WSAENOBUF).
In all cases, what do you mean "recovery" here? I guess the proper behavior
is to close the socket (and do necessary clean-up ofcourse).
2. When GetQueuedCompletionStatus(INFINITE) return FALSE with null overlapped
This means IOCP has failed to extract I/O packet. I think this error does
not relate to any specific
socket. I read somewhere that if this error happens, it tends to
continuously happens all the time
since the IOCP has broken and the IOCP should be closed (which means we have
to shut down the
server). What do you do for recovery in this case? Can I just close the
IOCP, create a new one and
attach all connected socket again so that the server wiil continue to work?
(Given that I know the sending/receving state of each socket and know where
to issue next
send/receive).
That is certainly a possibility if things go well. In the book "Network
Programming for Microsoft Windows 2nd Edition" (highly recommended) I'll
quote:
"If the system does run out of non-paged pool, there are two possibilities.
In the best-case scenario, Winsock calls will fail with WSAENOBUFS. The
worst-case scenario is the system crashes with a terminal error." (
http://www.microsoft.com/mspress/books/sampchap/5726a.aspx )
That book has a nice coverage of IOCP and TCP. I actually started my IOCP
work with TCP, but quickly switched to UDP due to resource constraints I hit
with TCP and needing UDP for game related programming.
So, recovery wise, I think it's better to program in such a way that error
shouldn't even happen. I.e., take a look at your server's resources and only
allow so many connections + locked buffers that should never surpass that
amount. How to go about that, well that's not so clear, but you should not
allow an unbounded amount of incoming connections (which is a DoS
vulnerability as well).
> 2. When GetQueuedCompletionStatus(INFINITE) return FALSE with null overlapped. This means IOCP has failed to extract I/O packet. I think this error does not relate to any specific socket.
I'd like to point you to some code I've found and studied when I got
started. It's got a few flaws, but it comes from a "trusted" site,
http://win32.mvps.org/network/sockhim.html
One of his comments in that code pertaining to that question says:
"if ( result == FALSE ) // something gone awry?
{
// something failed. Was it the call itself, or did we just eat
// a failed socket I/O?
if ( pov == 0 ) // the IOCP op failed
{
// no key, no byte count, no overlapped info available!
// how do I handle this?
// by ignoring it. I don't even have a socket handle to close.
// You might want to abort instead, as something is seriously
// wrong if we get here.
}
else
{
// key, byte count, and overlapped are valid.
// tear down the connection and requeue the socket
// (unless the listener socket is closed, in which case
// this is a failed AcceptEx(), and we stop accepting)
DoClose( *pov, true, listener == INVALID_SOCKET? false: true );
}
}
"
So, for when the overlapped is NULL, you are correct, something seriously is
broken and you should probably exit and have logged enough to determine what.
If just the IOCP extraction fails, then you close the socket and dispose of
it and create a new one as needed.
For some additional references that might help:
http://msdn.microsoft.com/en-us/library/cc500404.aspx
http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx
http://msdn.microsoft.com/en-us/magazine/cc302334.aspx
http://www.amazon.com/exec/obidos/ASIN/0735615799/qid=1063940984/sr=2-1/ref=sr_2_1/002-4725947-6716828
I've consulted all of those so far in my progress using IOCP and UDP (which
I don't think there are any complete working examples, which is what I hope
to change ). I'll get back around to IOCP and TCP after my UDP work is done
though.
Best of luck!