modcluster with tcp

449 views
Skip to first unread message

Tom Eicher

unread,
Feb 9, 2023, 12:09:06 PM2/9/23
to WildFly

Hello, I have a hard time trying to setup a cluster of standalone wildflys (26).

I somewhat followed the "ha-singleton-service" quickstart and all is fine.
Added the distributable tag to the web.xml and the session (including
attributes) is available on all nodes, fails over etc.

Now, our datacenter does not support multicast on the internal (virtual) network,
so I need to switch everything to tcp. (here: virtual network 10.10.10.x)

Singleton-deployment wise (as in "ha-singleton-service"), I changed MPING to
TCPPING, made a jgroups-tcp socket bindings, added property "initial_hosts",
and this works and fails-over correctly.

Now for modcluster, I finally figured out the config (it seems to change a lot
over the version), this is as seen in XML:

<subsystem xmlns="urn:jboss:domain:modcluster:5.0">
    <proxy name="default" proxies="clusterNode1 clusterNode2 clusterNode3" advertise="false" listener="ajp">
...

<socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
    <outbound-socket-binding name="clusterNode1">
        <remote-destination host="10.10.10.2" port="9990"/>
    </outbound-socket-binding>
    <outbound-socket-binding name="clusterNode2">
        <remote-destination host="10.10.10.3" port="9990"/>
    </outbound-socket-binding>
    <outbound-socket-binding name="clusterNode3">
        <remote-destination host="10.10.10.4" port="9990"/>
    </outbound-socket-binding>

Now what port is this supposed to be? WF High Availabiliy Guide uses 9999 not seen elsewhere.
I thought maybe I need to connect to management port here ? (default http 9990)
Or maybe have a "private" binding group. (But I may have only one...)

With config as above, I get
ERROR [org.jboss.modcluster] (UndertowEventHandlerAdapterService - 1) MODCLUSTER000042: Error null sending INFO command to /10.10.10.2:9990, configuration will be reset: null

I probably don't know enough about modcluster.
To what address, port, service in what protocol do the "proxies" need to talk to each other to replicate the session?
Do I need to change infinispan config? (I'm on plain standalone-ha.xml here)

Thanks for any pointers.
Cheers Tom.

Paul Ferraro

unread,
Feb 9, 2023, 4:29:55 PM2/9/23
to WildFly
The "proxies" property should reference an outbound-socket-binding for the MCMP endpoint of the mod_cluster load balancer itself.

Tom Eicher

unread,
Feb 10, 2023, 3:17:41 PM2/10/23
to WildFly
Apologies, if I get this wrong, but,
when you say mod_cluster with an underscore,
I assume you are talking about an external apache httpd server with mod_cluster module or such?

All I am trying to set up, is for 3 wildfly instances on different
machines to replicate/sync the user sessions among themselves.
The way they do automatically out-of-the-box (for a "distributable" war)
when multicast is available; just using tcp.
Am I on the wrong way here with modcluster?

So do I need to tell modcluster or undertow to listen to MCMP messages?
How would I do that?
I searched here
https://docs.wildfly.org/26/wildscribe/subsystem/modcluster/index.html
(and in undertow) but to no avail.

Thanks&Cheers Tom.

Tom Eicher GMail

unread,
Feb 11, 2023, 4:05:07 AM2/11/23
to wil...@googlegroups.com
Apologies, if I get this wrong, but,
when you say mod_cluster with an underscore,
I assume you are talking about an external apache httpd server with
mod_cluster module or such?

All I am trying to set up, is for 3 wildfly instances on different
machines to replicate/sync the user sessions among themselves.
The way they do automatically out-of-the-box (for a "distributable" war)
when multicast is available; just using tcp.
Am I on the wrong way here with modcluster?

So do I need to tell modcluster or undertow to listen to MCMP messages?
How would I do that?
I searched here
https://docs.wildfly.org/26/wildscribe/subsystem/modcluster/index.html
(and in undertow) but to no avail.

Thanks&Cheers Tom.


Am 09.02.23 um 22:29 schrieb Paul Ferraro:
> /10.10.10.2:9990 <http://10.10.10.2:9990>, configuration will be
> reset: null
>
> I probably don't know enough about modcluster.
> To what address, port, service in what protocol do the "proxies"
> need to talk to each other to replicate the session?
> Do I need to change infinispan config? (I'm on plain
> standalone-ha.xml here)
>
> Thanks for any pointers.
> Cheers Tom.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "WildFly" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/wildfly/gClh7zpLKN0/unsubscribe
> <https://groups.google.com/d/topic/wildfly/gClh7zpLKN0/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> wildfly+u...@googlegroups.com
> <mailto:wildfly+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/wildfly/1235e4c5-5f18-48f5-8b5f-0203149e0136n%40googlegroups.com <https://groups.google.com/d/msgid/wildfly/1235e4c5-5f18-48f5-8b5f-0203149e0136n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Paul Ferraro

unread,
Feb 11, 2023, 2:02:18 PM2/11/23
to WildFly
The modcluster subsystem provides server-side support for using the httpd+mod_cluster load balancer.
Is this what you are trying to configure?


On Friday, February 10, 2023 at 3:17:41 PM UTC-5 Tom Eicher wrote:
Apologies, if I get this wrong, but,
when you say mod_cluster with an underscore,
I assume you are talking about an external apache httpd server with mod_cluster module or such?

Yes - since you seem to be trying to configure the modcluster subsystem, I assumed you must be using httpd+mod_cluster or Undertow+mod_cluster for load balancing.
Is that correct?
 
All I am trying to set up, is for 3 wildfly instances on different
machines to replicate/sync the user sessions among themselves.

The modcluster subsytem has nothing to do with HttpSession replication, but rather is the server-side support for the httpd+mod_cluster (or Undertow+mod_cluster) load balancer.
 
The way they do automatically out-of-the-box (for a "distributable" war)
when multicast is available; just using tcp.

In your initial post, you said:
"I changed MPING to TCPPING, made a jgroups-tcp socket bindings, added property "initial_hosts", and this works and fails-over correctly."

This is all you should need to do to get HttpSession replication working on a TCP-only network.
 
Am I on the wrong way here with modcluster?

Are you actually using mod_cluster?  If not, you don't need to touch the "modcluster" subsystem (in fact, you can remove it altogether).
 
So do I need to tell modcluster or undertow to listen to MCMP messages?

MCMP messages are *sent* by WildFly server *to* the httpd or Undertow process.
If you are actually using mod_cluster, normally, it would use UDP multicast to advertise its endpoint to any listening WildFly instance (which listen via the advertise socket-binding, as configured in the modcluster subsystem).
The alternative, it to configure the modcluster subsystem with the actual mod_cluster MCMP endpoint, which you would do via the "proxies" attribute, as I explained in my initial message.

Assuming this was working already, your mod_cluster process (whether httpd or Undertow) should already be setup to listen to MCMP messages - which are always TCP-based.

Tom Eicher

unread,
Feb 11, 2023, 5:28:51 PM2/11/23
to WildFly

Hi Paul, thanks for you patience.

So it seems modcluster was a red herring for me.
I just want ha-singleton and session replication.

Ok, let me try again please:

I have a working ha-singleton failover between hosts node1 10.10.10.2 and node2 10.10.10.3:
21:20:59,325 INFO  [org.infinispan.CLUSTER] (thread-11,ejb,node1) [Context=default] ISPN100010: Finished rebalance with members [node1, node2], topology id 5

Now I log into a new web session of node1.
Then I access the http port of node2. Often the node2 has no session and acts completely independent.
Sometimes node2 stalls for 15s then sais:

21:21:45,163 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (non-blocking-thread--p8-t15) ISPN000136: Error executing command GetKeyValueCommand on Cache 'xxxportal.ear.xxxportal-war.war', writing keys []: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 15 seconds for key SessionCreationMetaDataKey(ZrZPPP5MOuDVxXokBfAO8bZ-MmAHdlapH9GpG9ud) and requestor GlobalTransaction{id=13, addr=node2, remote=false, xid=null, internalId=-1}. Lock is held by GlobalTransaction{id=12, addr=node2, remote=false, xid=null, internalId=-1}

What could be the cause?

I also saw your post of 2021 on such a message and changed isolation REPEATABLE_READ to READ_COMMITTED.
Now it seems I don't get this exception anymore, but still the nodes share no session.
(using udp setup, I was able to change e.g. a filter in a list of my webapp, and the other node would reflect this upon reload, so I know the app could do it)

node1 can connect to 10.10.10.3 port 7600 alright, other direction also, network-wise...
(shutting down node2 I just saw
22:22:28,408 DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-10,ejb,node2) node2: socket to node1 was closed gracefully
so I guess the network is really ok)

config: (pls let me know if you need to see more)

        <subsystem xmlns="urn:jboss:domain:jgroups:8.0">
            <channels default="ee">
                <channel name="ee" stack="tcp" cluster="ejb"/>
            </channels>
            <stacks>
                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <socket-protocol type="TCPPING" socket-binding="jgroups-tcp">
                        <property name="initial_hosts">10.10.10.2[7600],10.10.10.3[7600],10.10.10.4[7600]</property>
                    </socket-protocol>
                    <protocol type="MERGE3"/>
                    <socket-protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
            </stacks>
        </subsystem>


    <socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
        <socket-binding name="ajp" port="${jboss.ajp.port:8009}"/>
        <socket-binding name="http" port="${jboss.http.port:8080}"/>
        <socket-binding name="https" port="${jboss.https.port:8443}"/>
        <socket-binding name="jgroups-tcp" interface="private" port="7600"/>
        <socket-binding name="jgroups-tcp-fd" interface="private" port="57600"/>
...

as I was unable to find a current working example for wf26, had to patch together stuff I found for older
versions and adapt the syntax/schema. maybe my tcpstack ist incomplete?
I also inititally had a different socket-binding="jgroups-tcpping" at the socket-protocol,
but such a binding is defined nowhere, so I changed to jgroups-tcp. Do I need a dedicated one?

BTW I start using
./bin/standalone.sh --debug 18787 -Djboss.server.base.dir=standalone --server-config=standalone.xml -Djboss.socket.binding.port-offset=0 -Djboss.node.name=node1 -Djboss.bind.address.private=10.10.10.2 -Djboss.bind.address.management=10.10.10.2
(my standalone.xml is really a standalone-ha.xml)

Does this make any sense to you?

Thanks & Cheers Tom.



Paul Ferraro

unread,
Feb 14, 2023, 7:27:50 AM2/14/23
to WildFly
Questions inline:
On Saturday, February 11, 2023 at 5:28:51 PM UTC-5 Tom Eicher wrote:

Hi Paul, thanks for you patience.

So it seems modcluster was a red herring for me.
I just want ha-singleton and session replication.

Ok, let me try again please:

I have a working ha-singleton failover between hosts node1 10.10.10.2 and node2 10.10.10.3:
21:20:59,325 INFO  [org.infinispan.CLUSTER] (thread-11,ejb,node1) [Context=default] ISPN100010: Finished rebalance with members [node1, node2], topology id 5

Now I log into a new web session of node1.
Then I access the http port of node2. Often the node2 has no session and acts completely independent.

When you say, "often the node2 has no session", do you mean to say that sometimes it does?
 
Sometimes node2 stalls for 15s then sais:

21:21:45,163 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (non-blocking-thread--p8-t15) ISPN000136: Error executing command GetKeyValueCommand on Cache 'xxxportal.ear.xxxportal-war.war', writing keys []: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 15 seconds for key SessionCreationMetaDataKey(ZrZPPP5MOuDVxXokBfAO8bZ-MmAHdlapH9GpG9ud) and requestor GlobalTransaction{id=13, addr=node2, remote=false, xid=null, internalId=-1}. Lock is held by GlobalTransaction{id=12, addr=node2, remote=false, xid=null, internalId=-1} 
What could be the cause?

This suggests that replication *is* working, but that your request lifecycle is not completing normally.
In general, session replication is triggered after the response is committed.  The transactional REPEATABLE_READ semantics of the cache are meant to ensure that when a subsequent request arrives for the same session (potentially on a different cluster member) that the replication from the previous request completed, and in this way prevent any stale reads.
I suspect that your application relies on asynchronous servlet API behavior - is this the case?
When using the async servlet API, the transactional lock will be held on the Infinispan cache until the async context is completed.  This would explain why you are hitting lock acquisition timeouts.
 
I also saw your post of 2021 on such a message and changed isolation REPEATABLE_READ to READ_COMMITTED.

For applications that rely on async behavior, it is generally best to disable transactions altogether, and rely on your load balancer's session affinity behavior to avoid concurrent access to the same session by multiple cluster members.
 
Now it seems I don't get this exception anymore, but still the nodes share no session.

READ_COMMITTED isolation allows for lock-free reads.  That is why you don't get this exception.
This configuration generally looks fine.  TCPPING is not a "socket-protocol", i.e. it does not open a server socket; and thus does not need a socket-binding.
It is, however, not terribly convenient, as it lacks any dynamism.
I might suggest something more dynamic, such as DNS_PING, which leverages your LAN's DNS server as a cluster membership registry.
 
as I was unable to find a current working example for wf26, had to patch together stuff I found for older
versions and adapt the syntax/schema. maybe my tcpstack ist incomplete?

The initial_hosts property of the TCPPING protocol only needs to locate the cluster coordinator (i.e. the 1st node to join the cluster), so it generally does not need to be complete.
If you must use TCPPING and you not running in a virtualized environment, I would suggest using this with persistence enabled, e.g. http://jgroups.org/manual5/index.html#_pdc_persistent_discovery_cache
That way, the protocol will record all known members, locally, which will become available against after a restart, which can save you some hassle trying to keep the initial_hosts up-to-date if your coordinator changes over time.

Tom Eicher

unread,
Feb 24, 2023, 10:31:36 AM2/24/23
to WildFly
Hello Paul and all,

after going into this more deeply, I can confirm your diagnoses, everything including sessiond distribution works as expected once the cluster is logged to be correctly formed.

What was missing in my test is the fact that of course node2 needs the session cookie of node1, so it can find out which session to fail over to.
So calling the app with IP address in URL is a bad idea, rather we need a common mydomain.org after each node1/2 URL,
plus we need in undertow config servlet-container a <session-cookie name="MYCOOKIE" domain=".mydomain.org"/>
And TADA the failover works.

The diagnoses of cache isolation is also correct, seems we need READ_COMMITTED when apache wicket is used as UI rendering technology.

Did describe this in detail for future readers, so thanks & Cheers
Tom.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages