I have a Server Session that works properly but if the client crashes I'm having trouble getting the server to reestablish the connection to the restarted client. I see that the server sends the test request after the client crashes and doesn't receive anything back so it aborts my session. What is the proper method for setting up the server so that if the session is aborted it resets the session so that a client can reconnect?
I have created manual server connect/disconnect logic similar to what I created for my client side connections to allow for manual disconnects and reconnects. I thought this might help provide a manual method of getting the client and server talking again. But it seems that the disconnection logic that works for the client side fails to work on the server side. It seems that an exception is thrown from the Poco library that I'm unable to catch using either FIX8::f8Exception, Poco::Net::NetException, Poco::IOException and ... . The stack track of the exception is:
Poco::Net::SocketImpl::error()
FIX8::Connection::stop /fix8/runtime/connection.cpp 322
So the call that causes the exception is:
_reader.socket()->shutdownReceive();
in:
void Connection::stop()
{
scout_debug << "Connection::stop()";
_writer.stop();
_writer.join();
_reader.stop();
_reader.join();
>> _reader.socket()->shutdownReceive();
}
I haven't been able to figure out why the Poco exception isn't being caught within my try/catch logic. Is there something special I need to do so that I can catch the errors thrown by Poco/Fix8 within my code?
To manual disconnection I just send a logoff message using:
m_pAcceptorRouterSessionServerInst->session_ptr()->send(new FIX8::ITGC_FIXServerInterface::Logout);
This calls the SessionServer::state_change method with the FIX8::States::SessionStates::st_session_terminated as the new state. From this callback I clear out my server session wrapper using:
if ( m_pAcceptor != nullptr )
{
delete m_pAcceptor;
}
if ( m_pAcceptorRouterSessionServerInst != nullptr )
{
auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
try
{
m_pAcceptorRouterSessionServerInst = nullptr;
}
catch ( FIX8::f8Exception& ex )
{
Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
}
catch ( Poco::Net::NetException& ex )
{
Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
}
catch ( Poco::IOException& ex )
{
Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
}
catch (...)
{
Log( "Exception while clearing out server instance " );
}
}
if ( m_pSessionServer != nullptr )
{
m_pSessionServer = nullptr;
}
To manual connect I use:
m_pSessionServer = std::unique_ptr<FIX8::ServerSession<SessionServer>>(new FIX8::ServerSession<SessionServer>(FIX8::ITGC_FIXServerInterface::ctx(), m_szInputConfiguration, m_szInputConfigurationSection));
m_pAcceptorRouterSessionServerInst = std::unique_ptr<FIX8::SessionInstance<SessionServer>>(new FIX8::SessionInstance<SessionServer>(*m_pSessionServer));
auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
m_pAcceptor = new Acceptor<OrderCallbackFunction>( *pAcceptorRouterSessionServerInst, m_pAcceptorRouterSessionServerInst.get(), m_pStatusCallback );
m_pAcceptorRouterSessionServerInst->start(false);
On Thursday, April 9, 2015 at 10:46:43 PM UTC+10, astern.f...@gmail.com wrote:
A nullptr assignment will delete the object if the object pointer is wrapped inside a std::unique_ptr as is show in the connect code. The destructor is causing an exception when it calls into Poco and the socket is already in a bad state or disconnected.
if ( m_pAcceptor != nullptr )
{
delete m_pAcceptor;
}
auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
try
{
m_pAcceptorRouterSessionServerInst = nullptr;
}
I have read that page multiple times but we don't have a sequence issue. The issue is that the client crashes which causes Fix8 to send a test request. When it doesn't get a response since the client is no longer listening it deletes the session that it was talking to previously. When the client is brought back up it never even attempts to reconnect. I believe this is due to the socket connect wrapped inside the deleted session being gone and thus is no longer really listening. Since the disconnect test request heartbeat code waits just 20% longer than the heartbeat time, it doesn't give us enough time to fix whatever is wrong with the client since it is usually ~36 seconds and our alert system takes longer than that to notify us that there is a problem.
This is one of the reasons that I've moved in a direction of manual disconnect/connect. The other is that there are cases where we need to disconnect from the client/server due to operational issues on their side and we don't want to connect until they are ready.
If the connection isn't killed due to a disconnection that was manually requested I kick off a short timer to wait a bit then start the connection process. This should sync up many of the sequence number issues with replays where needed. I've looked at the ReliableClient connection. It looks like it tries a few times to get connected but still the heartbeat code will disconnect the session if the client crashes. There is no code for a ReliableServer connection that I can find in SessionWrapper.
Our application has two server connection: Order Entry and a drop copy. It also has multiple client connections to different exchanges. If the drop copy has an issue I want to be able to get it going again without stopping the trading so restarting the server isn't an option.
The handling of the fix messages is much nicer than what we had in QuickFix and my testing has show fix8 to be much faster. This is why we moved from QuickFix but I think QuickFix handles more of the connect/disconnect issues better because I don't remember ever needing to spend time in this area of the code while we were using QuickFix. Bring up and down clients and forcing a crash on servers and clients just seems to work in QuickFix. These changes I'm submitting (on GitHub) along with these questions I'm asking will hopefully get us to the reliability that we need for our systems. I think we are very close to a having a fully implemented solution but everything will need to be though a very intensive QA.
I am attempting to get a project using fix8 into production. I'm not flaming the project but I am having some very real issues with recovery when clients and servers crash. I have reported what I have seen as I step through the code under GCC 4.8.1. The asssignment of nullptr very clearly ends up calling the destructor. I am on this board to both provide and get support with using this open source project.