Recovery of client disconnection from a Fix8 server application.

524 views
Skip to first unread message

astern.f...@gmail.com

unread,
Apr 7, 2015, 12:30:50 PM4/7/15
to fix8-s...@googlegroups.com
I have a Server Session that works properly but if the client crashes I'm having trouble getting the server to reestablish the connection to the restarted client. I see that the server sends the test request after the client crashes and doesn't receive anything back so it aborts my session. What is the proper method for setting up the server so that if the session is aborted it resets the session so that a client can reconnect?

I have created manual server connect/disconnect logic similar to what I created for my client side connections to allow for manual disconnects and reconnects. I thought this might help provide a manual method of getting the client and server talking again. But it seems that the disconnection logic that works for the client side fails to work on the server side. It seems that an exception is thrown from the Poco library that I'm unable to catch using either FIX8::f8Exception, Poco::Net::NetException, Poco::IOException and ... . The stack track of the exception is:
Poco::Net::SocketImpl::error()
FIX8::Connection::stop /fix8/runtime/connection.cpp 322

So the call that causes the exception is:
_reader.socket()->shutdownReceive();

in:
void Connection::stop()
{
scout_debug << "Connection::stop()";
_writer.stop();
_writer.join();
_reader.stop();
_reader.join();
>> _reader.socket()->shutdownReceive();
}

I haven't been able to figure out why the Poco exception isn't being caught within my try/catch logic. Is there something special I need to do so that I can catch the errors thrown by Poco/Fix8 within my code?

To manual disconnection I just send a logoff message using:
m_pAcceptorRouterSessionServerInst->session_ptr()->send(new FIX8::ITGC_FIXServerInterface::Logout);

This calls the SessionServer::state_change method with the FIX8::States::SessionStates::st_session_terminated as the new state. From this callback I clear out my server session wrapper using:

if ( m_pAcceptor != nullptr )
{
delete m_pAcceptor;
}

if ( m_pAcceptorRouterSessionServerInst != nullptr )
{
auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
try
{
m_pAcceptorRouterSessionServerInst = nullptr;
}
catch ( FIX8::f8Exception& ex )
{
Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
}
catch ( Poco::Net::NetException& ex )
{
Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
}
catch ( Poco::IOException& ex )
{
Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
}
catch (...)
{
Log( "Exception while clearing out server instance " );
}
}

if ( m_pSessionServer != nullptr )
{
m_pSessionServer = nullptr;
}


To manual connect I use:
m_pSessionServer = std::unique_ptr<FIX8::ServerSession<SessionServer>>(new FIX8::ServerSession<SessionServer>(FIX8::ITGC_FIXServerInterface::ctx(), m_szInputConfiguration, m_szInputConfigurationSection));
m_pAcceptorRouterSessionServerInst = std::unique_ptr<FIX8::SessionInstance<SessionServer>>(new FIX8::SessionInstance<SessionServer>(*m_pSessionServer));

auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
m_pAcceptor = new Acceptor<OrderCallbackFunction>( *pAcceptorRouterSessionServerInst, m_pAcceptorRouterSessionServerInst.get(), m_pStatusCallback );
m_pAcceptorRouterSessionServerInst->start(false);

Mazz Barker

unread,
Apr 8, 2015, 9:58:28 PM4/8/15
to fix8-s...@googlegroups.com, astern.f...@gmail.com
On Wednesday, April 8, 2015 at 2:30:50 AM UTC+10, astern.f...@gmail.com wrote:
I have a Server Session that works properly but if the client crashes I'm having trouble getting the server to reestablish the connection to the restarted client.  I see that the server sends the test request after the client crashes and doesn't receive anything back so it aborts my session.  What is the proper method for setting up the server so that if the session is aborted it resets the session so that a client can reconnect?

Seems to me that you don't really want the session to be reset. You want the client to reconnect and resume. The procedure is described in some detail here.
 

I have created manual server connect/disconnect logic similar to what I created for my client side connections to allow for manual disconnects and reconnects.  I thought this might help provide a manual method of getting the client and server talking again.  But it seems that the disconnection logic that works for the client side fails to work on the server side.  It seems that an exception is thrown from the Poco library that I'm unable to catch using either FIX8::f8Exception, Poco::Net::NetException, Poco::IOException and ...  . The stack track of the exception is:
  Poco::Net::SocketImpl::error()    
  FIX8::Connection::stop  /fix8/runtime/connection.cpp  322

So the call that causes the exception is:
        _reader.socket()->shutdownReceive();

in:
void Connection::stop()
{
        scout_debug << "Connection::stop()";
        _writer.stop();
        _writer.join();
        _reader.stop();
        _reader.join();
>>        _reader.socket()->shutdownReceive();
}

I haven't been able to figure out why the Poco exception isn't being caught within my try/catch logic.  Is there something special I need to do so that I can catch the errors thrown by Poco/Fix8 within my code?

To manual disconnection I just send a logoff message using:
  m_pAcceptorRouterSessionServerInst->session_ptr()->send(new FIX8::ITGC_FIXServerInterface::Logout);

This calls the SessionServer::state_change method with the FIX8::States::SessionStates::st_session_terminated as the new state.  From this callback I clear out my server session wrapper using:

  if ( m_pAcceptor != nullptr )
  {
          delete m_pAcceptor;
  }

There is no need to check for nullptrs when using delete in C++.
 

  if ( m_pAcceptorRouterSessionServerInst != nullptr )
  {
        auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
        try
        {
                m_pAcceptorRouterSessionServerInst = nullptr;
        }

How can assigning a nullptr raise an exception?
 
        catch ( FIX8::f8Exception& ex )
        {
                Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
        }
        catch ( Poco::Net::NetException& ex )
        {
                Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
        }
        catch ( Poco::IOException& ex )
        {
                Log( "FIXInterface::ClearCopySession detail %s", ex.what() );
        }
        catch (...)
        {
                Log( "Exception while clearing out server instance " );
        }
  }

  if ( m_pSessionServer != nullptr )
  {
          m_pSessionServer = nullptr;
  }


ditto
 

To manual connect I use:
  m_pSessionServer = std::unique_ptr<FIX8::ServerSession<SessionServer>>(new FIX8::ServerSession<SessionServer>(FIX8::ITGC_FIXServerInterface::ctx(), m_szInputConfiguration, m_szInputConfigurationSection));
  m_pAcceptorRouterSessionServerInst = std::unique_ptr<FIX8::SessionInstance<SessionServer>>(new FIX8::SessionInstance<SessionServer>(*m_pSessionServer));
                
  auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
m_pAcceptor = new Acceptor<OrderCallbackFunction>( *pAcceptorRouterSessionServerInst, m_pAcceptorRouterSessionServerInst.get(), m_pStatusCallback );
  m_pAcceptorRouterSessionServerInst->start(false);

Don't use the manual method. 
Looking at this code it seems that you probably aren't using the fix8 server idiom correctly. I suggest you study the examples more closely. You might also want to review the tickets in the bug tracker which describe issues around this sort of stuff.

Mazz
 

astern.f...@gmail.com

unread,
Apr 9, 2015, 8:46:43 AM4/9/15
to fix8-s...@googlegroups.com, astern.f...@gmail.com
A nullptr assignment will delete the object if the object pointer is wrapped inside a std::unique_ptr as is show in the connect code. The destructor is causing an exception when it calls into Poco and the socket is already in a bad state or disconnected.

I have read that page multiple times but we don't have a sequence issue. The issue is that the client crashes which causes Fix8 to send a test request. When it doesn't get a response since the client is no longer listening it deletes the session that it was talking to previously. When the client is brought back up it never even attempts to reconnect. I believe this is due to the socket connect wrapped inside the deleted session being gone and thus is no longer really listening. Since the disconnect test request heartbeat code waits just 20% longer than the heartbeat time, it doesn't give us enough time to fix whatever is wrong with the client since it is usually ~36 seconds and our alert system takes longer than that to notify us that there is a problem.

This is one of the reasons that I've moved in a direction of manual disconnect/connect. The other is that there are cases where we need to disconnect from the client/server due to operational issues on their side and we don't want to connect until they are ready.

If the connection isn't killed due to a disconnection that was manually requested I kick off a short timer to wait a bit then start the connection process. This should sync up many of the sequence number issues with replays where needed. I've looked at the ReliableClient connection. It looks like it tries a few times to get connected but still the heartbeat code will disconnect the session if the client crashes. There is no code for a ReliableServer connection that I can find in SessionWrapper.

Our application has two server connection: Order Entry and a drop copy. It also has multiple client connections to different exchanges. If the drop copy has an issue I want to be able to get it going again without stopping the trading so restarting the server isn't an option.

The handling of the fix messages is much nicer than what we had in QuickFix and my testing has show fix8 to be much faster. This is why we moved from QuickFix but I think QuickFix handles more of the connect/disconnect issues better because I don't remember ever needing to spend time in this area of the code while we were using QuickFix. Bring up and down clients and forcing a crash on servers and clients just seems to work in QuickFix. These changes I'm submitting (on GitHub) along with these questions I'm asking will hopefully get us to the reliability that we need for our systems. I think we are very close to a having a fully implemented solution but everything will need to be though a very intensive QA.






Ian McKane

unread,
Apr 11, 2015, 9:48:07 PM4/11/15
to fix8-s...@googlegroups.com
On Thursday, April 9, 2015 at 10:46:43 PM UTC+10, astern.f...@gmail.com wrote:

A nullptr assignment will delete the object if the object pointer is wrapped inside a std::unique_ptr as is show in the connect code.  The destructor is causing an exception when it calls into Poco and the socket is already in a bad state or disconnected. 


 
You cannot test a unique_ptr this way. This variable is clearly a common pointer. You cannot assign a nullptr to a unique_ptr in C++11/14 (however this will be possible in C++17); 



if ( m_pAcceptor != nullptr ) 

    delete m_pAcceptor; 



You don't need to test a ptr for null before deleting in C++. 
The session_ptr() method returns a Session pointer, so pAcceptorRouterSessionServerInst is clearly a regular pointer, not a unique_ptr therefore assigning this nullptr will not call the dtor. The dtor is being called elsewhere.
 

auto pAcceptorRouterSessionServerInst = m_pAcceptorRouterSessionServerInst->session_ptr();
try
{
    m_pAcceptorRouterSessionServerInst = nullptr;
}


From these code snippets it looks like you have some confusion as to what you are doing. If this isn't the case, your given examples are erroneous.
 

I have read that page multiple times but we don't have a sequence issue.  The issue is that the client crashes which causes Fix8 to send a test request.  When it doesn't get a response since the client is no longer listening it deletes the session that it was talking to previously.  When the client is brought back up it never even attempts to reconnect.  I believe this is due to the socket connect wrapped inside the deleted session being gone and thus is no longer really listening.  Since the disconnect test request heartbeat code waits just 20% longer than the heartbeat time, it doesn't give us enough time to fix whatever is wrong with the client since it is usually ~36 seconds and our alert system takes longer than that to notify us that there is a problem. 

This is one of the reasons that I've moved in a direction of manual disconnect/connect.  The other is that there are cases where we need to disconnect from the client/server due to operational issues on their side and we don't want to connect until they are ready. 

If the connection isn't killed due to a disconnection that was manually requested I kick off a short timer to wait a bit then start the connection process.  This should sync up many of the sequence number issues with replays where needed.  I've looked at the ReliableClient connection.  It looks like it tries a few times to get connected but still the heartbeat code will disconnect the session if the client crashes.  There is no code for a ReliableServer connection that I can find in SessionWrapper. 

Our application has two server connection: Order Entry and a drop copy.  It also has multiple client connections to different exchanges.  If the drop copy has an issue I want to be able to get it going again without stopping the trading so restarting the server isn't an option. 

The handling of the fix messages is much nicer than what we had in QuickFix and my testing has show fix8 to be much faster.  This is why we moved from QuickFix but I think QuickFix handles more of the connect/disconnect issues better because I don't remember ever needing to spend time in this area of the code while we were using QuickFix.  Bring up and down clients and forcing a crash on servers and clients just seems to work in QuickFix.  These changes I'm submitting (on GitHub) along with these questions I'm asking will hopefully get us to the reliability that we need for our systems.  I think we are very close to a having a fully implemented solution but everything will need to be though a very intensive QA. 


 
I don't think anyone in this group is much interested in your views on quickfix vs Fix8. Most of us have plenty of experience with the former.
Ian

Mazz Barker

unread,
Apr 12, 2015, 9:05:03 PM4/12/15
to fix8-s...@googlegroups.com
Fix8 does have a few issues with session reconnect although we have never seen the problems astern.f carries on about. Most FIX engines have some problems in this area.
Regarding the C++ stuff, I had a look at the same code and commented earlier, along the same lines as Ian just has.

By all means, astern.f, use quickfix. If you want a slow, clunky and (IMHO) badly written library. Go ahead, knock yourself out.

Mazz.B

astern.f...@gmail.com

unread,
Apr 14, 2015, 8:35:17 AM4/14/15
to fix8-s...@googlegroups.com
I am attempting to get a project using fix8 into production. I'm not flaming the project but I am having some very real issues with recovery when clients and servers crash. I have reported what I have seen as I step through the code under GCC 4.8.1. The asssignment of nullptr very clearly ends up calling the destructor. I am on this board to both provide and get support with using this open source project.

Mazz Barker

unread,
Apr 14, 2015, 8:58:11 AM4/14/15
to fix8-s...@googlegroups.com, astern.f...@gmail.com
On Tuesday, April 14, 2015 at 10:35:17 PM UTC+10, astern.f...@gmail.com wrote:
I am attempting to get a project using fix8 into production.  I'm not flaming the project but I am having some very real issues with recovery when clients and servers crash. I have reported what I have seen as I step through the code under GCC 4.8.1.  The asssignment of nullptr very clearly ends up calling the destructor.  I am on this board to both provide and get support with using this open source project.  

No way. 
Mazz

astern.f...@gmail.com

unread,
Apr 14, 2015, 9:30:53 AM4/14/15
to fix8-s...@googlegroups.com, astern.f...@gmail.com
Not sure what your comment means. If you would like to reproduce my issues I will gladly help you. That way we can work on it together. I'm using RHEL with the toolkit 2.0 that they produce that contains gcc 4.8.1. My code uses c++ x11 so I have that switch turned on. Much of the code is based on the samples using the session wrapper. Start with the sample and use centos if you don't have Redhat. Try control c on the client and

andrew stern

unread,
Apr 14, 2015, 4:55:56 PM4/14/15
to fix8-s...@googlegroups.com

 make sure you try it with the pthread option since that is related to the issue.

Ian McKane

unread,
Apr 14, 2015, 8:51:31 PM4/14/15
to fix8-s...@googlegroups.com, astern.f...@gmail.com
What Mazz is saying is that there is no way that assigning a nullptr to that regular pointer is calling the destructor. I concur. Since you refuse to heed sound advice from people in the know and stubbornly maintain that the problem is where you insist it is, I don't think you are going to get anyone here to help you.

Lets see what the project maintainers think....
Ian
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages