GTM replication monitoring

Attila Csikai

unread,

Oct 6, 2016, 7:49:47 PM10/6/16

to

Hi,

I am working on monitoring GTM replication and I have two questions:

1. I wonder how one can detect a situation when the replication (data flow) stops but both the source and the receiver server processes are out there.
mupip replicate -source/receiver -checkhealth shows no problem and also
mupip replicate -source/receiver -showbacklog looks OK.
(Well, after a -short- while the source side starts to accumulate backlog, but that is all.)

The only place to indicate an error is the source server log containing hard and soft connection attempts.

Is the log the only place to detect disconnect ?

2. My monitoring agent is a C program and I have identified the following options to get information about replication status (short of mining the log):
a, running "mupip replicate -source/receiver ... " by opening a pipe -popen- and grabbing the output of the command,
b, %PEEKBYNAME() may provide some information.

Unfortunately option "a" is somewhat resource intensive carrying a significant overhead. Option "b" is only available in version 6.3 and I either spin up a MUMPS process every time to read the information (significant overhead) or keep one alive to periodically emit the required info -then I need to have IPC and manage this process as well.

The best would be to invoke mupip functionality as a shared library function but it is not possible as far as I know.

Are there other possibilities?

Thank you,
Attila

K.S. Bhaskar

unread,

Oct 7, 2016, 10:04:42 AM10/7/16

to

Attila --

It is not possible to call into mupip as a shared library. However, you don't have to spin up a mumps process every time - you can call a MUMPS routine from your C program (look at Chapter 11 - Integrating External Routines - of the Programmers Guide; it even has a downloadable working example of calling M code from C code).

Did you consider using InfoHub for your monitoring?

Regards
-- Bhaskar

Attila Csikai

unread,

Oct 7, 2016, 2:58:19 PM10/7/16

to

Bashkar,

Thank you for bringing up the CI interface. Actually I know it and used it (previously) but obviously overlooked this possibility.

I had superficial knowledge about Infohub but now took a more serious look.
However as I only have to do data gathering for an existing monitoring tool I doubt there is room for Infohub as a whole.
Nevertheless the IHRRPCmdXXXXXXXXX.m routine is very interesting.

But even in that routine I do not seem to find the answer to my first question about how to detect a broken replication dataflow. I wonder if you can give a hint.

Thank you,
Attila

K.S. Bhaskar

unread,

Oct 10, 2016, 10:14:45 AM10/10/16

to

Attila --

You seem to have a situation where a Source Server and a Receiver Server have a connection (or had a connection), which is not a normal situation. You will need to diagnose what went wrong, and for that you will need to look at the logs.

For monitoring, perhaps you can try a command like: mupip replic -source -jnlpool -show | & grep "Processing State"

An output like: SRC # 0 : Processing State WAITING_FOR_CONNECTION
indicates the source server has not yet connected with a receiver.

Once connected, you can expect to see: SRC # 0 : Processing State SENDING_JNLRECS

Below is the list of possible values.

"DUMMY_STATE",
"START",
"WAITING_FOR_CONNECTION",
"WAITING_FOR_RESTART",
"SEARCHING_FOR_RESTART",
"SENDING_JNLRECS",
"WAITING_FOR_XON",
"CHANGING_MODE"

Regards
-- Bhaskar

Attila Csikai

unread,

Oct 12, 2016, 7:25:24 PM10/12/16

to

Bhaskar,

I believe I have a fairly good handle on why the replication stalls.
If you live "in the cloud" then you may have very (actually close to zero) influence on the network gear and its operation between your sites. So I learned that if reboot our network equipment or the service provider decides to rearrange the network then we end up in a situation where the source realizes that the connection is broken and tries to reconnect. Unfortunately the receiver does not and keeps the connection open - with netstat one may just see it. Now since the receiver thinks it has a valid connection it refuses the connection attempt from the source. Nice deadlock. It can stay that way for a considerable time - even days. We are in the process to upgrade to GTM version 6 and I hope this behavior is fixed.

Well, if not then I can at least monitor it with the command you provided. Thank you for that. I was checking the documentation if something similar exists for the receiver side but no luck.

I was having a second thought about Infohub and I wonder where the Infohub database should reside ? The documentations says: " ...one InfoHub can monitor multiple data sources, and a single data source can be monitored by multiple InfoHubs. "
But it is not clear for me neither from the text nor from the diagram if the datasource and the Infohub DB should reside in the same machine.

Attila

K.S. Bhaskar

unread,

Oct 13, 2016, 9:25:19 AM10/13/16

to

Yes, disconnects are the bane of TCP connections. One side can't just unilaterally drop a connection - there is a handshake ritual to go through to disconnect, and if one side drops the connection, it can take a significant amount of time for the other side to realize the connection is gone. It's no different from a phone call: if the parties say Bye and hangup, it's a clean end of the call. But if the line goes quiet, one tends to say, "Hello", "Hello", "Hello, can you hear me?" or some such before deciding that the connection has dropped. It's equally frustrating either way. (I think someone once calculated that in theory, it can take a week to drop a TCP connection if one side just goes away.)

Unless you are using GT.CM, an InfoHub database should be on the same node as the one being monitored. But its database can be different from that of an application being monitored, or it can be part of it (if the latter, I'd suggest putting it in a different region).

Regards
-- Bhaskar

Attila Csikai

unread,

Oct 16, 2016, 6:02:52 PM10/16/16

to

Bhaskar,

Thank you for the detailed explanation on the communication error. I believe you implied that I should not expect a different behavior from the newer version of GTM as it relies on the TCP stack and it works as it works.

I have never used GT.CM before so I had to consult the documentation (http://tinco.pair.com/bhaskar/gtm/doc/books/ao/UNIX_manual/webhelp/content/ch13.html) and I also downloaded and read through the functional specification (TB5-021B.doc).

Two questions I had after the reading that you may know the answer to:
1. What is that (proprietary ?) GNP protocol ?
2. What is the performance of this client-server connection.

Thank you,
Attila

K.S. Bhaskar

unread,

Oct 17, 2016, 10:10:43 AM10/17/16

to

Attila, GT.CM uses GNP (GT.M Network Protocol) layered on TCP. It's not proprietary - it's open source, and there are non GT.M clients (e.g., the node.js client in the latest NodeM that David Wicksell just announced, and also Dave Heller's PHP client at https://github.com/dmheller/gtcm_gnp_client).

I can't quantify GT.CM performance for you - since it is layered on TCP, it will be slower than in-process calls. But it's likely to be adequate for InfoHub because I know that production systems used to run on earlier versions of GT.CM.

Regards
-- Bhaskar

Attila Csikai

unread,

Oct 18, 2016, 6:57:57 PM10/18/16

to

Thank you Bhaskar for the information, I will take a look at those implementations.

Attila

Jan Barinka

unread,

Nov 29, 2021, 3:53:37 AM11/29/21

to

Hello all,

we've run into the same problem of the broken connection between master and slave not being detected by the slave. We are using some older version of GTM so my question is if there is something new regarding this issue in newer versions? For exmaple optional keep alive support or other solution?

Jan B.

Dne čtvrtek 13. října 2016 v 1:25:24 UTC+2 uživatel attila....@gmail.com napsal: