Corus Auto Discovery

28 views
Skip to first unread message

Laurence Toenjes

unread,
Sep 14, 2011, 10:41:17 AM9/14/11
to sapia-support
Hi,

Special notes:

On all my Windows machine I have turned off firewall software to help make troubleshooting easier.

I'm using the same Corus domain name of "home" instead of the default ( e.g. corus -d home  ).

Problem 1

I'm not sure but I think I may have been seeing some Corus auto discovery issues.

It's also possible that I'm not seeing an auto discovery issue because I have not been waiting long enough.

Anyway I've been using a MacBook Air 64 bit java and 2 Windows 7 notebooks using 32 bit java and what I'm noticing is that I am sometimes getting different results when I issue the hosts command.

Since I have 3 computers I should see three hosts.  Sometimes I only see one listes for the hosts commands when all 3 computers are running the Corus server.

I've played around with the order of starting a Corus server on each of the three machines and depending on the order of how a Cours server is started affects what I see for the hosts command.

Problem 2

Vista Machine 32 bit java with 2 nics is living in it's own corus world (can only see other corus servers only running on this machine only).

I also was testing a 4th machine (Vista running 32 bit java) and the first problem I ran into is that Corus defaulted to using a virtual NIC installed by vmware (192.168.x.x).  I fixed that problem with a config file change to specify the preferred address of something like 10.0.1.x.  For whatever the reason this machine is in its own world and cannot see other Corus servers on my network and other Corus servers on my network cannot see this machine (all machines can ping each other and have a 10.0.1.x address).

Just to recap this machine has at least 2 nics 
1) vmware one at 192.168.x.x
2) and the standard one on my home network at 10.0.1.x
and I modified my corus config file to make the 10.0.1.x the preferred address for Corus.

Any ideas about my 2 problems,

Thanks,

Laurence

Yanick Duchesne

unread,
Sep 14, 2011, 12:36:27 PM9/14/11
to sapia-...@googlegroups.com
Hi Laurence,

what version of Corus are you using? The 1.4.x version may require some tweaks in the context that you're describing. Version 2.x fixes that. I recommend you switch to 2.x if it's not already the one you're using.

Please let us know.
--
Yanick Duchesne

Laurence

unread,
Sep 14, 2011, 2:25:26 PM9/14/11
to sapia-...@googlegroups.com
Hi Yanick,

I'm using the 2.x version.

I fully admit I have not read all the Corus documentation yet, but I did feel I got to a point where I thought I understood the basics except for the somewhat random auto discovery issue I think I'm having or maybe that's not even a problem because what I'm seeing is how Corus is supposed to work (last Corus instance started on a local network will be visible to existing running Corus instances on the network but may not be able to see already running instances of Corus on local network).

I've discovered a little bit more since I sent the last email.

To get my Vista machine to actually work with Corus I had to disable 2 extra nics.  Before disabling the extra nics I had modified a corus config file to use my preferred network address of 10.0.1.x.  When I started the Corus server after a config change and started the coruscli with the command line switches to use the the address of the Corus server I just started on the same Vista box visually everything looked like it should really work (everything was using 10.0.1.x addresses), but it did not work.

Once I disabled the extra nics, Corus worked (with the exception of the frequent auto discovery problems I think I'm having which I resolve by killing the Corus server that can't be seen by other Corus and then start it again).

What I don't know is if I'm seeing some apparently random issues because I'm not waiting long enough for the auto discovery process to work.

The one trend I think I'm starting to notice is that my Windows machines seem to have more of a problem with auto discovery than my MacBook Air OS X Leopard with 64 bit java.  So far when starting a new Corus instance on a Windows machine it seems to auto discover Corus running on OS X but not seeing existing Corus server instances running on Windows machines.

I just got done waiting for an XP machine I have to finish rebooting and just started Corus on it and it auto recognized existing Corus servers on my network.

After I started Corus on my XP machine, I went over to my Vista machine and it did discover the XP instance of Corus.  When I restarted the Corus server on my Vista machine and then issued the hosts command it only saw itself.

I then restarted Corus on my XP machine and then it only saw the Windows machines on my network meaning it did not see Corus on my MacBook Air until I restarted Corus on my MacBook Air.

I then restarted Corus on my XP machine and then when I used the hosts command it only saw itself.

I have tried a variety of other scenarios and what I think I'm noticing what works best in this order for Corus auto discover of existing Corus instances on different machines on same network:
1) OS X
2) XP
3) Vista
4) Windows 7

Ideally for some better testing I need to have 2 machines with the same OS for some better Corus auto discovery testing.

I was thinking of using Corus as an automation tool for testing as well as monitoring or for routine sysadmin tasks where each Corus sever instance would be running on a different machine on a local network (even for Windows machines).  Since sometimes different machines get periodically rebooted for whatever the reason I was hoping the Corus auto discovery would always recognize already running instances of Corus on a network.  It seems like an instance of a Corus server on a network will see any other Corus server started after it but when ever a new Corus server is started it will not always see existing running instances of Corus on other machines.  For general sysadmin tasks I was hoping I could start a Corus instance and then start coruscli and then have the ability to interact with all other machines on my local network running Corus.

Another use I could see for Corus if I can get this auto discovery to work better for me about recognizing existing running Corus server instances is in an environment where a group of developers all work on the same local network with their shared test servers and every machine has Corus running on it automatically.  In this kind of environment you have people coming to work at different times rebooting their machines at different times sometimes rebooting shared test servers at different times and it would be nice if you could walk up to any machine on your local network and could use coruscli to access every other Corus server instance on your network.

It seems like my workaround for what I want to do is come up with a way to find the longest running Corus server instance on my local network and use that as the master controller and if something happens to that machine or Corus sever instance find the next oldest instance of a Corus server instance (perhaps create my own Corus launcher that records it's launch time to a database accessible to all machines on my network).

I may be misunderstanding what Corus is supposed to be capable of.  Before discovering Corus I had thought of just creating my own xmlrpc servers to be able to pretty much do what ever I wanted (I think xmlrpc is a nice, simple and easy way to just get things done using a variety of programming languages and operating systems).  The auto discovery feature of Corus is what me think twice about reinventing a wheel when a better one was already built.

I appreciate any tips you could give me about Corus auto discovery or limitations of it.

Thanks,

Laurence

p.s. It sure is strange the very first time I installed Corus on an XP machine and rebooted and then started Corus it saw all running instances of Corus on my local network after that it only sees itself until I start a new Corus instance on another machine.

Yanick Duchesne

unread,
Sep 14, 2011, 8:55:09 PM9/14/11
to sapia-...@googlegroups.com
Corus discovery should work regardless of startup order. You're stating in your previous post: "I modified my corus config file to make the 10.0.1.x the preferred address for Corus". Can you just confirm which configuration parameter you've modified - and what value you've set it to?

Also, do you have problems when starting multiple Corus servers on the same machine? You can start in non-daemon mode and type CTRL-C to exit. The only thing is you must make sure all Corus instance's have their own port. Use the -p option to specify it. For example:

corus -p 33000
corus -p 33001
etc.
--
Yanick Duchesne

Yanick Duchesne

unread,
Sep 14, 2011, 10:28:21 PM9/14/11
to sapia-...@googlegroups.com
OK I could reproduce and fix it in the trunk. I recommend you checkout from trunk and compile (http://code.google.com/p/sapia/source/checkout). The project uses Maven, so you just type mvn -install at the checkout root (of course you must install Maven first if it's not already). The distibution .tar or .zip is generated as part of the build by Maven, under modules/server/target. You then just follow the normal procedure to install it.

Make sure you have the repositories below configured in your Maven settings (at $USER_HOME/.m2/settings.xml) prior to building - the snippet below is an excerpt of my own settings:

    <repositories>
      <repository>
        <id>java.net-m2-repository</id>
        <name>Java.net Repository for Maven</name>
        <url>http://download.java.net/maven/2</url>
      </repository>   
      <repository>
        <id>sapia-m2-repository</id>
        <name>Sapia Repository for Maven</name>
        <url>http://www.sapia-oss.org/maven2</url>
      </repository>
    </repositories>
    <pluginRepositories>
      <pluginRepository>
        <id>java.net-m2-repository</id>
        <name>Java.net Repository for Maven</name>
        <url>http://download.java.net/maven/2</url>
      </pluginRepository>
      <pluginRepository>
        <id>sapia-m2-repository</id>
        <name>Sapia Repository for Maven</name>
        <url>http://www.sapia-oss.org/maven2</url>
      </pluginRepository>
    </pluginRepositories>

The trunk also has other improvements - that we intended to release soon, namely some command shortcuts (such as "kill all", "undeploy all"...), lock file creation (in order to forbid starting 2 corus servers with the same port on the same machine), etc.

Keep us posted.
--
Yanick Duchesne

Laurence

unread,
Sep 15, 2011, 12:42:51 AM9/15/11
to sapia-...@googlegroups.com
Yanick,

Just to make sure, am I starting Corus correctly when I type corus -d home from the command line on OS X or Windows?

Just to recap for my testing I'm only running 1 Corus instance per machine (not multiple instances of Corus using different port numbers on the same machine) and using the default port of 33000 and domain name of home instead of the default one named default and all my Corus instances are on a network using 10.0.1.x addresses (the default IP address of 10.0.1.x are from an Apple wireless router).

I did double check my config setting.  It was the corus.properties file
and it was this this line that I uncommented and set to use a 10.0.1.x address
#corus.server.address.pattern=192\\.168\\.\\d+\\.\\d+

Well since I disabled my 2 other nics on my Vista machine I switched back to the corus.properties file that came with the 2.x Corus download for Windows and Corus basically works with the exception I'm having with auto discovery.

For my own evaluation of Corus I'm running 1 Corus instance per machine using default port of 33000 and -d option of home and I start one instance coruscli with the -d home option on each machine too (I'm testing this on a number of extra machines I have at home).  My understanding is when I start coruscli (with the -d home option) I should also see all the Corus servers running in my "home" domain with the hosts command.  Basically what I'm seeing is that the oldest running Corus instance sees all Corus instances started after it, but when starting a new Corus instance it might not see any existing already running a Corus instance.  Whenever I know for a fact that a Corus instance that is running that cannot be seen by another machine when I kill that Corus instance and then start it again and it will be seen by all other machines with Corus already running (but the problem is then the Corus instance that was just killed and started again will not be guaranteed to see all current Corus instances already running).

I can still get some use out of Corus for my own testing needs even if I can't get the auto discovery to always work by making sure what ever machine I happen to be using at the moment the most for testing has its Corus instance started before all the Corus instances on other machines on my local network.

Maybe another thing I could try is to use a different router at home (I really don't know anything about multicast networking).

Laurence

Yanick Duchesne

unread,
Sep 15, 2011, 7:06:31 AM9/15/11
to sapia-...@googlegroups.com
Hi Laurence,

yes for specifying the domain you use the -d switch. Your procedure for testing is right on.

The trunk version of Corus has the fix for the misbehavior you are mentioning (see my previous post on instructions to on how to build from source - you need to checkout from trunk/corus in the SVN repo on Google code). We will do a release shortly, but in the meantime use the trunk.

Also, with that version, using corus.server.address.pattern is not required: Corus will bind to all network interfaces by default. You use the property only to enforce binding to a specific network interface.
--
Yanick Duchesne

Yanick Duchesne

unread,
Sep 15, 2011, 7:41:48 AM9/15/11
to sapia-...@googlegroups.com

Hi Laurence,

Another precision about the build: you require grails 1.3.3 installed and Maven 2.x. The grails 1.3.3 Maven plugin fails with Maven 3.

Since it's getting complicated, we will do a release later today with the required fix. This will ease your pains.

BTW, the bug has nothing to do with multicast. It's rather an issue with discovery logic.

On Sep 15, 2011 7:06 AM, "Yanick Duchesne" <yanickd...@gmail.com> wrote:

Hi Laurence,

yes for specifying the domain you use the -d switch. Your procedure for testing is right on.

The trunk version of Corus has the fix for the misbehavior you are mentioning (see my previous post on instructions to on how to build from source - you need to checkout from trunk/corus in the SVN repo on Google code). We will do a release shortly, but in the meantime use the trunk.

Also, with that version, using corus.server.address.pattern is not required: Corus will bind to all network interfaces by default. You use the property only to enforce binding to a specific network interface.



On Thu, Sep 15, 2011 at 12:42 AM, Laurence <lauren...@gmail.com> wrote:
>
> Yanick,
>

> Just to...

--
Yanick Duchesne

Laurence

unread,
Sep 15, 2011, 10:50:22 AM9/15/11
to sapia-...@googlegroups.com
Hi Yanick,

Great news.  I'm really excited about using Corus now.  This auto discovery feature has some really neat uses.  

Another scenario I can think of is when you might need to test some kind of a process with a completely clean machine (e.g. VmWare Virtual Machine with a fresh OS install with basic Java stuff like JDK, Ant, Maven and of course a Corus instance that is started automatically when the machine is done booting).  

Obviously a VmWare boot of an OS will take some time, but with a pool of VmWare machines running Corus you could always identify a fresh and ready to use Corus instance on the newly booted VmWare Machine (this could be a huge time saver when you really need to do an accurate repeatable test of something running for the first time on a clean machine even for non Java stuff like testing a Windows application installer, etc.).

Laurence

Laurence

unread,
Sep 15, 2011, 11:06:35 AM9/15/11
to sapia-...@googlegroups.com
Yanick,

Good to know about that extra build info, especially about Maven 3 and Grails for my own non Corus related reasons (I'm a Groovy fan).

I'll probably just wait for your next Corus release with the auto discovery fix.

Thanks again for your responsiveness with this auto discovery issue.

Laurence

Laurence Toenjes

unread,
Sep 15, 2011, 5:36:13 PM9/15/11
to sapia-...@googlegroups.com
Yanick,

I tried the Corus trunk build on a couple of Windows machines and they
did not see each other.

...

2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /deployer:
org.sapia.corus.deployer.DeployerExtension@21e554
Exception in thread "-1314816922086_-7491878183500006170Unicast@home"
java.util.MissingFormatArgumentException: Format specifier 's'
at java.util.Formatter.format(Formatter.java:2432)
at java.util.Formatter.format(Formatter.java:2367)
at java.lang.String.format(String.java:2769)
at org.sapia.corus.cluster.ClusterManagerImpl.onAsyncEvent(ClusterManagerImpl.java:135)
at org.sapia.ubik.mcast.EventConsumer.notifyAsyncListeners(EventConsumer.java:318)
at org.sapia.ubik.mcast.EventConsumer.onAsyncEvent(EventConsumer.java:230)
at org.sapia.ubik.mcast.UDPUnicastDispatcher.handle(UDPUnicastDispatcher.java:274)
at org.sapia.ubik.mcast.server.UDPServer.run(UDPServer.java:57)

...

Laurence


C:\utils\corus\corus\modules\server
>echo off
log4j:WARN No appenders could be found for logger
(org.springframework.core.CollectionFactory).
log4j:WARN Please initialize the log4j system properly.
2011.09.15@17:45:17:125 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.db.DbModuleImpl as module
org.sapia.corus.client.services.db.DbModule
2011.09.15@17:45:17:140 DEBUG[org.sapia.corus.db.DbModuleImpl]: DB
module directory C:\utils\corus\corus\modules\server\db\home_33000
2011.09.15@17:45:17:203 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.os.OsModuleImpl as module
org.sapia.corus.client.services.os.OsModule
2011.09.15@17:45:17:203 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.file.FileSystemModuleImpl as module
org.sapia.corus.client.services.file.FileSystemModule
2011.09.15@17:45:17:218 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.configurator.ConfiguratorImpl as module
org.sapia.corus.client.services.configurator.Configurator
2011.09.15@17:45:17:265 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.taskmanager.CorusTaskManagerImpl as module
org.sapia.corus.taskmanager.core.TaskManager
2011.09.15@17:45:17:265 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.taskmanager.CorusTaskManagerImpl as module
org.sapia.corus.taskmanager.CorusTaskManager
2011.09.15@17:45:17:281 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.event.EventDispatcherImpl as module
org.sapia.corus.client.services.event.EventDispatcher
2011.09.15@17:45:17:296 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.http.HttpModuleImpl as module
org.sapia.corus.client.services.http.HttpModule
2011.09.15@17:45:17:312 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.port.PortManagerImpl as module
org.sapia.corus.client.services.port.PortManager
2011.09.15@17:45:17:406 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.cluster.ClusterManagerImpl as module
org.sapia.corus.client.services.cluster.ClusterManager
2011.09.15@17:45:17:453
INFO[org.sapia.corus.cluster.ClusterManagerImpl]: Signaling presence
to cluster on: 231.173.5.7:5454
2011.09.15@17:45:17:453 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.naming.JndiModuleImpl as module
org.sapia.corus.client.services.naming.JndiModule
2011.09.15@17:45:17:453 DEBUG[EventDispatcher]: Adding interceptor:
org.sapia.corus.naming.JndiModuleImpl@5a67c9 for event type: class
org.sapia.corus.core.ServerStartedEvent
2011.09.15@17:45:17:500 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.deployer.DeployerImpl as module
org.sapia.corus.client.services.deployer.Deployer
2011.09.15@17:45:17:500 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Deploy dir: C:\utils\corus\corus\modules\server\deploy\home_33000
2011.09.15@17:45:17:500 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Temporary dir: C:\utils\corus\corus\modules\server\tmp\home_33000
2011.09.15@17:45:17:500 INFO[org.sapia.corus.deployer.DeployerImpl]:
Initializing: rebuilding distribution objects
2011.09.15@17:45:17:515 INFO[org.sapia.corus.deployer.DeployerImpl]:
Distribution objects succesfully rebuilt
2011.09.15@17:45:17:515 DEBUG[EventDispatcher]: Adding interceptor:
org.sapia.corus.deployer.DeployerImpl@16546ef for event type: class
org.sapia.corus.core.ServerStartedEvent
2011.09.15@17:45:17:515
INFO[org.sapia.corus.taskmanager.CorusTaskManagerImpl]: BuildDistTask
>> No distributions
2011.09.15@17:45:17:546 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.processor.ProcessorImpl as module
org.sapia.corus.client.services.processor.Processor
2011.09.15@17:45:17:593 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.cron.CronModuleImpl as module
org.sapia.corus.client.services.cron.CronModule
2011.09.15@17:45:17:609 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.security.SecurityModuleImpl as module
org.sapia.corus.client.services.security.SecurityModule
2011.09.15@17:45:17:609
INFO[org.sapia.corus.security.SecurityModuleImpl]: Initializing the
security module
2011.09.15@17:45:17:687 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /files:
org.sapia.corus.http.filesystem.FileSystemExtension@c623af
2011.09.15@17:45:17:687 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /jmx:
org.sapia.corus.http.jmx.JmxExtension@e753
2011.09.15@17:45:17:781 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /interop/soap:
org.sapia.corus.http.interop.SoapExtension@17cec96
2011.09.15@17:45:17:781 INFO[org.sapia.corus.http.HttpModuleImpl]:
Starting http extension manager
2011.09.15@17:45:17:796 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /processor:
org.sapia.corus.processor.ProcessorExtension@1e13e07
2011.09.15@17:45:17:796
DEBUG[org.sapia.corus.taskmanager.CorusTaskManagerImpl]:
ProcessCheckTask >> Checking for stale processes...
2011.09.15@17:45:17:796
DEBUG[org.sapia.corus.taskmanager.CorusTaskManagerImpl]:
ProcessCheckTask >> Stale process check finished
2011.09.15@17:45:17:796 DEBUG[EventDispatcher]: Adding interceptor:
org.sapia.corus.processor.ProcessorImpl$ProcessorInterceptor@1b34126
for event type: class
org.sapia.corus.client.services.deployer.event.UndeploymentEvent
2011.09.15@17:45:17:796
INFO[org.sapia.corus.security.SecurityModuleImpl]: Starting the
security module
Corus server (2.1-SNAPSHOT) started on: [ host=10.0.1.13, port=33000,
type=tcp/socket ]:33000, domain: home
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Creating mplex socket connector to accept deployment connections
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Starting mplex deployment acceptor thread
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Deployment acceptor thread started
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /deployer:
org.sapia.corus.deployer.DeployerExtension@21e554
Exception in thread "-1314816922086_-7491878183500006170Unicast@home"
java.util.MissingFormatArgumentException: Format specifier 's'
at java.util.Formatter.format(Formatter.java:2432)
at java.util.Formatter.format(Formatter.java:2367)
at java.lang.String.format(String.java:2769)
at org.sapia.corus.cluster.ClusterManagerImpl.onAsyncEvent(ClusterManagerImpl.java:135)
at org.sapia.ubik.mcast.EventConsumer.notifyAsyncListeners(EventConsumer.java:318)
at org.sapia.ubik.mcast.EventConsumer.onAsyncEvent(EventConsumer.java:230)
at org.sapia.ubik.mcast.UDPUnicastDispatcher.handle(UDPUnicastDispatcher.java:274)
at org.sapia.ubik.mcast.server.UDPServer.run(UDPServer.java:57)
2011.09.15@17:45:27:562
INFO[org.sapia.corus.security.SecurityModuleImpl]: Stopping the
security module
2011.09.15@17:45:27:562 INFO[org.sapia.corus.deployer.DeployerImpl]:
Could not accept client connection; server probably shutting down
java.net.SocketException: No socket available - the socket queue is
closed
at org.sapia.ubik.net.mplex.SocketQueue.getSocket(SocketQueue.java:94)
at org.sapia.ubik.net.mplex.SocketConnectorImpl.accept(SocketConnectorImpl.java:154)
at org.sapia.corus.deployer.transport.mplex.AcceptorThread.run(AcceptorThread.java:62)
at java.lang.Thread.run(Thread.java:662)

Terminate batch job (Y/N)? C:\utils\corus\corus\modules\server
>echo off
log4j:WARN No appenders could be found for logger
(org.springframework.core.CollectionFactory).
log4j:WARN Please initialize the log4j system properly.
2011.09.15@17:45:17:125 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.db.DbModuleImpl as module
org.sapia.corus.client.services.db.DbModule
2011.09.15@17:45:17:140 DEBUG[org.sapia.corus.db.DbModuleImpl]: DB
module directory C:\utils\corus\corus\modules\server\db\home_33000
2011.09.15@17:45:17:203 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.os.OsModuleImpl as module
org.sapia.corus.client.services.os.OsModule
2011.09.15@17:45:17:203 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.file.FileSystemModuleImpl as module
org.sapia.corus.client.services.file.FileSystemModule
2011.09.15@17:45:17:218 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.configurator.ConfiguratorImpl as module
org.sapia.corus.client.services.configurator.Configurator
2011.09.15@17:45:17:265 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.taskmanager.CorusTaskManagerImpl as module
org.sapia.corus.taskmanager.core.TaskManager
2011.09.15@17:45:17:265 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.taskmanager.CorusTaskManagerImpl as module
org.sapia.corus.taskmanager.CorusTaskManager
2011.09.15@17:45:17:281 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.event.EventDispatcherImpl as module
org.sapia.corus.client.services.event.EventDispatcher
2011.09.15@17:45:17:296 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.http.HttpModuleImpl as module
org.sapia.corus.client.services.http.HttpModule
2011.09.15@17:45:17:312 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.port.PortManagerImpl as module
org.sapia.corus.client.services.port.PortManager
2011.09.15@17:45:17:406 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.cluster.ClusterManagerImpl as module
org.sapia.corus.client.services.cluster.ClusterManager
2011.09.15@17:45:17:453
INFO[org.sapia.corus.cluster.ClusterManagerImpl]: Signaling presence
to cluster on: 231.173.5.7:5454
2011.09.15@17:45:17:453 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.naming.JndiModuleImpl as module
org.sapia.corus.client.services.naming.JndiModule
2011.09.15@17:45:17:453 DEBUG[EventDispatcher]: Adding interceptor:
org.sapia.corus.naming.JndiModuleImpl@5a67c9 for event type: class
org.sapia.corus.core.ServerStartedEvent
2011.09.15@17:45:17:500 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.deployer.DeployerImpl as module
org.sapia.corus.client.services.deployer.Deployer
2011.09.15@17:45:17:500 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Deploy dir: C:\utils\corus\corus\modules\server\deploy\home_33000
2011.09.15@17:45:17:500 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Temporary dir: C:\utils\corus\corus\modules\server\tmp\home_33000
2011.09.15@17:45:17:500 INFO[org.sapia.corus.deployer.DeployerImpl]:
Initializing: rebuilding distribution objects
2011.09.15@17:45:17:515 INFO[org.sapia.corus.deployer.DeployerImpl]:
Distribution objects succesfully rebuilt
2011.09.15@17:45:17:515 DEBUG[EventDispatcher]: Adding interceptor:
org.sapia.corus.deployer.DeployerImpl@16546ef for event type: class
org.sapia.corus.core.ServerStartedEvent
2011.09.15@17:45:17:515
INFO[org.sapia.corus.taskmanager.CorusTaskManagerImpl]: BuildDistTask
>> No distributions
2011.09.15@17:45:17:546 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.processor.ProcessorImpl as module
org.sapia.corus.client.services.processor.Processor
2011.09.15@17:45:17:593 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.cron.CronModuleImpl as module
org.sapia.corus.client.services.cron.CronModule
2011.09.15@17:45:17:609 DEBUG[ModulePostProcessor]: Binding
org.sapia.corus.security.SecurityModuleImpl as module
org.sapia.corus.client.services.security.SecurityModule
2011.09.15@17:45:17:609
INFO[org.sapia.corus.security.SecurityModuleImpl]: Initializing the
security module
2011.09.15@17:45:17:687 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /files:
org.sapia.corus.http.filesystem.FileSystemExtension@c623af
2011.09.15@17:45:17:687 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /jmx:
org.sapia.corus.http.jmx.JmxExtension@e753
2011.09.15@17:45:17:781 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /interop/soap:
org.sapia.corus.http.interop.SoapExtension@17cec96
2011.09.15@17:45:17:781 INFO[org.sapia.corus.http.HttpModuleImpl]:
Starting http extension manager
2011.09.15@17:45:17:796 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /processor:
org.sapia.corus.processor.ProcessorExtension@1e13e07
2011.09.15@17:45:17:796
DEBUG[org.sapia.corus.taskmanager.CorusTaskManagerImpl]:
ProcessCheckTask >> Checking for stale processes...
2011.09.15@17:45:17:796
DEBUG[org.sapia.corus.taskmanager.CorusTaskManagerImpl]:
ProcessCheckTask >> Stale process check finished
2011.09.15@17:45:17:796 DEBUG[EventDispatcher]: Adding interceptor:
org.sapia.corus.processor.ProcessorImpl$ProcessorInterceptor@1b34126
for event type: class
org.sapia.corus.client.services.deployer.event.UndeploymentEvent
2011.09.15@17:45:17:796
INFO[org.sapia.corus.security.SecurityModuleImpl]: Starting the
security module
Corus server (2.1-SNAPSHOT) started on: [ host=10.0.1.13, port=33000,
type=tcp/socket ]:33000, domain: home
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Creating mplex socket connector to accept deployment connections
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Starting mplex deployment acceptor thread
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.deployer.DeployerImpl]:
Deployment acceptor thread started
2011.09.15@17:45:17:812 DEBUG[org.sapia.corus.http.HttpModuleImpl]:
Adding HTTP extension under /deployer:
org.sapia.corus.deployer.DeployerExtension@21e554
Exception in thread "-1314816922086_-7491878183500006170Unicast@home"
java.util.MissingFormatArgumentException: Format specifier 's'
at java.util.Formatter.format(Formatter.java:2432)
at java.util.Formatter.format(Formatter.java:2367)
at java.lang.String.format(String.java:2769)
at org.sapia.corus.cluster.ClusterManagerImpl.onAsyncEvent(ClusterManagerImpl.java:135)
at org.sapia.ubik.mcast.EventConsumer.notifyAsyncListeners(EventConsumer.java:318)
at org.sapia.ubik.mcast.EventConsumer.onAsyncEvent(EventConsumer.java:230)
at org.sapia.ubik.mcast.UDPUnicastDispatcher.handle(UDPUnicastDispatcher.java:274)
at org.sapia.ubik.mcast.server.UDPServer.run(UDPServer.java:57)
2011.09.15@17:45:27:562
INFO[org.sapia.corus.security.SecurityModuleImpl]: Stopping the
security module
2011.09.15@17:45:27:562 INFO[org.sapia.corus.deployer.DeployerImpl]:
Could not accept client connection; server probably shutting down
java.net.SocketException: No socket available - the socket queue is
closed
at org.sapia.ubik.net.mplex.SocketQueue.getSocket(SocketQueue.java:94)
at org.sapia.ubik.net.mplex.SocketConnectorImpl.accept(SocketConnectorImpl.java:154)
at org.sapia.corus.deployer.transport.mplex.AcceptorThread.run(AcceptorThread.java:62)
at java.lang.Thread.run(Thread.java:662)

Terminate batch job (Y/N)?

On 9/15/11, Yanick Duchesne <yanickd...@gmail.com> wrote:
> Hi Laurence,
>
> Another precision about the build: you require grails 1.3.3 installed and
> Maven 2.x. The grails 1.3.3 Maven plugin fails with Maven 3.
>
> Since it's getting complicated, we will do a release later today with the
> required fix. This will ease your pains.
>
> BTW, the bug has nothing to do with multicast. It's rather an issue with
> discovery logic.
>
> On Sep 15, 2011 7:06 AM, "Yanick Duchesne" <yanickd...@gmail.com> wrote:
>
> Hi Laurence,
>
> yes for specifying the domain you use the -d switch. Your procedure for
> testing is right on.
>
> The trunk version of Corus has the fix for the misbehavior you are
> mentioning (see my previous post on instructions to on how to build from
> source - you need to checkout from trunk/corus in the SVN repo on Google
> code). We will do a release shortly, but in the meantime use the trunk.
>
> Also, with that version, using corus.server.address.pattern is not required:
> Corus will bind to all network interfaces by default. You use the property
> only to enforce binding to a specific network interface.
>
>
>
> On Thu, Sep 15, 2011 at 12:42 AM, Laurence <lauren...@gmail.com> wrote:
>>
>> Yanick,
>>
>> Just to...
> --
> Yanick Duchesne

> <http://www.imetrik.com>
>

Yanick Duchesne

unread,
Sep 15, 2011, 5:58:01 PM9/15/11
to sapia-...@googlegroups.com

Thanks. I see the issue. Not network related. Will release a clean build this evening.

> Another precisio...

> <http://www.imetrik.com>
>

Yanick Duchesne

unread,
Sep 16, 2011, 12:10:56 AM9/16/11
to sapia-...@googlegroups.com
Hi Laurence,

the 2.1 release is available on Sapia's Google code page. Discovery has been tested, behavior was as it should. Good luck and keep us posted.

p.s.: thanks for your efforts and patience on this, they've been invaluable.
--
Yanick Duchesne

Laurence

unread,
Sep 16, 2011, 5:51:19 PM9/16/11
to sapia-...@googlegroups.com
Hi Yanick,

I played around with the Corus 2.1 version today with 5 five different machines. (2 Win7, 1 Vista, 1 Xp, 1 OS X).

What I was noticing was that sometimes it takes a number of restarts of Corus to see all other Corus servers on the local network and over time (issuing periodic hosts commands to see other Corus server peers I can see other Corus servers start to disappear even though they are still running). 

It seems like there is a randomness factor when you start a Corus server and then using coruscli to issues the hosts command because when nothing else has changed on the network sometimes you see some of the other Corus servers and other times you see all of them and eventually over time by issuing the hosts commands you can see the host list get smaller even though the other Corus servers have not been stopped.

Here are some of the things I have been doing on my Windows machines to make Corus discovery testing easier:

1) Make sure Windows machine is running on AC power instead of battery.
2) Make sure Windows machine will not sleep when running on AC power.
3) Make sure network card is left alone by Windows power management options.

Laurence

Yanick Duchesne

unread,
Sep 16, 2011, 7:03:39 PM9/16/11
to sapia-...@googlegroups.com
Hi Laurence,

first of all, let me thank you again for your invaluable feedback.

Now, about discovery, here's another thing: by default, a Corus server will send a heartbeat to its peers every 20 seconds. And also by default, the timeout after which peers will remove non-heartbeating from their internal list is 30 secs. It might be that this is too tight of an interval under certain network conditions (I would reproduce your issue in a coffee shop with WIFI access on my laptop, with 2 corus servers running).

There is currently one limitation: this heartbeat is controlled at a lower level than the actual Corus discovery logic, and currently a peer that is removed from the list of another peer following a missing heartbeat will not be readded to the list if it sends its following heartbeats in a timely fashion.

There is one workaround though: the heartbeat timeout can be configured through a system property, which must be passed at JVM startup - at the actual Java command line. The property is passed as a key/value pair, and the timeout is expected to be in milliseconds. For example, to set the timeout to 2 minutes:

-Dubik.rmi.naming.mcast.heartbeat.timeout=120000

Note the -D prefix: it's the standard to pass Java properties at startup.

If you are starting Corus directly (as a non-daemon), you can update the corus or corus.bat scripts (under the bin directory). Example for the corus script:

java_exec_command="exec \"${JAVACMD}\" -cp \"${CORUS_CLASSPATH}\" -Dubik.rmi.naming.mcast.heartbeat.timeout=120000 -Dcorus.home=\"${CORUS_HOME}\" $java_sys_opts ${MAINCLASS} \"$@\""

And for the corus.bat script:

"%JAVA_HOME%/bin/java"  -Dubik.rmi.naming.mcast.heartbeat.timeout=120000 -Dcorus.home="%CORUS_HOME%" org.sapia.corus.core.CorusServer %1 %2 %3 %4 %5 %6 %7 %8 %9


If you are starting Corus as a daemon through Wrapper,Launch, or Upstart, modify the corus_service_<CORUS_PORT>.wrapper.properties, like so:

# Java Additional Parameters
wrapper.java.additional.1=-Djava.net.preferIPv4Stack=true
wrapper.java.additional.2=-Duser.dir="%CORUS_HOME%"
wrapper.java.additional.3=-Dcorus.home="%CORUS_HOME%"
-Dubik.rmi.naming.mcast.heartbeat.timeout=120000

I have set that interval where I am now, and it still works fine after 30 minutes.

We will consider setting 120000 milliseconds as the default timeout in later releases (and certainly we will fix the "heartbeat miss" issue).

In the meantime, the above should do the trick.
--
Yanick Duchesne

Laurence

unread,
Sep 16, 2011, 8:12:48 PM9/16/11
to sapia-...@googlegroups.com
Yanick,

I tried your -D switch of

-Dubik.rmi.naming.mcast.heartbeat.timeout=120000

on all 5 of my test machines

and I'm seeing other hosts drop out over time after I issue a hosts command.  The last couple runs of my OS X machine the first hosts command showed all 5 hosts (one time it took a couple of starts to get it to see all 5 test machines the first time) and then over time it dropped all the hosts except for itself.

Other info in case it's relevant.

I'm just starting the Corus servers from the command line on Windows and OS X (I'm using the latest Corus V2.1).

All test machines are WIFI connected to an Apple AirportExtreme using 10.0.1.x addresses.

The OS X machine for these tests is a MacBook Air with OS X Snow Leopard running 64 bit Java 6.

The 4 other Windows machines (2 Win7, 1 Xp, 1 Vista) are running 32 bit Java 6.

Laurence

Yanick Duchesne

unread,
Sep 16, 2011, 9:55:48 PM9/16/11
to sapia-...@googlegroups.com
Hi Laurence,

sorry about this pernicious issue... First off, are you using this config still ?:

corus.server.address.pattern=10\\.10\\.\\1\\.\\d+

Second:

are you observing the same thing on non-OSX hosts also?

From the Corus CLI, you can change the host to which you are connected using the connect command. For example, to connect to host 10.10.1.12 that is listening on the default Corus port:

connect -h 10.10.1.12

If not listening on the default port, use:

connect -h 10.10.1.12 -p <port>

You can then try the 'hosts' command to inquire about the cluster view on that host.

I'd like to know if this is an OSX issue, or if this is widespread on your setup.

Thanks for your info on this.
--
Yanick Duchesne

Laurence

unread,
Sep 16, 2011, 10:29:33 PM9/16/11
to sapia-...@googlegroups.com
Yanick,

I'm not using that corus server address pattern any more.

Yes, I'm observing the same thing on the non OS X machines too (I just happened to have 2 Win7 machines nearby when I just got your email and the last time I issued the hosts commands it was showing 5 hosts and now the hosts command is showing only one host - itself).

I've never been able to connect to any other machine except the current machine I'm running coruscli on (I get a no object reference error if I try for OS X and Windows too).

laurencetoenjes ~ $ coruscli -d home
********************************************************************************
*                                                                              *
*                      Corus Command Line Interface (2.1)                      *
*                                                                              *
*                      R E S T R I C T E D   A C C E S S                       *
*                                                                              *
*                            Authorized Users Only                             *
*                                                                              *
*                          (c)2002-2011 sapia-oss.org                          *
*                                                                              *
********************************************************************************

[10.0.1.21:33000@home]>>  hosts
 Host            Port      Operating System      Java VM                        
================================================================================
 10.0.1.21       33000     Mac OS X 10.6.8       1.6.0_26 Java HotSpot(TM) 64-B 
                                                 it Server VM                   
 10.0.1.5        33000     Windows Vista 6.0     1.6.0_13 Java HotSpot(TM) Clie 
                                                 nt VM                          
 10.0.1.13       33000     Windows XP 5.1        1.6.0_22 Java HotSpot(TM) Clie 
                                                 nt VM                          
 10.0.1.7        33000     Windows 7 6.1         1.6.0_27 Java HotSpot(TM) Clie 
                                                 nt VM                          
 10.0.1.6        33000     Windows 7 6.1         1.6.0_27 Java HotSpot(TM) Clie 
                                                 nt VM                          
[10.0.1.21:33000@home]>>  connect -h 10.0.1.6
System error executing command 'connect' ==> no object reference for: [id=1316302956189, hashCode=2041920904]
[10.0.1.21:33000@home]>>  connect -h 10.0.1.6 -p 33000
System error executing command 'connect' ==> no object reference for: [id=1316302956189, hashCode=2041920904]
[10.0.1.21:33000@home]>>  connect -h 10.0.1.21
Connecting on 10.0.1.21:33000
[10.0.1.21:33000@home]>>  hosts
 Host            Port      Operating System      Java VM                        
================================================================================
 10.0.1.21       33000     Mac OS X 10.6.8       1.6.0_26 Java HotSpot(TM) 64-B 
                                                 it Server VM                   
 10.0.1.13       33000     Windows XP 5.1        1.6.0_22 Java HotSpot(TM) Clie 
                                                 nt VM                          
[10.0.1.21:33000@home]>>  



laurencetoenjes ~ $ coruscli -d home -h 10.0.1.13
org.sapia.ubik.rmi.NoSuchObjectException: no object reference for: [id=1316217762640, hashCode=1977175070]
at org.sapia.ubik.rmi.server.RemoteRefEx.invoke(RemoteRefEx.java:81)
at $Proxy0.getDomain(Unknown Source)
at org.sapia.corus.client.facade.CorusConnectionContext.reconnect(CorusConnectionContext.java:151)
at org.sapia.corus.client.facade.CorusConnectionContext.reconnect(CorusConnectionContext.java:140)
at org.sapia.corus.client.facade.CorusConnectionContext.<init>(CorusConnectionContext.java:64)
at org.sapia.corus.client.facade.CorusConnectionContext.<init>(CorusConnectionContext.java:71)
at org.sapia.corus.client.cli.CorusCli.main(CorusCli.java:86)
org.sapia.ubik.rmi.NoSuchObjectException: no object reference for: [id=1316217762640, hashCode=1977175070]
at org.sapia.ubik.rmi.server.RemoteRefEx.invoke(RemoteRefEx.java:81)
at $Proxy0.getDomain(Unknown Source)
at org.sapia.corus.client.facade.CorusConnectionContext.reconnect(CorusConnectionContext.java:151)
at org.sapia.corus.client.facade.CorusConnectionContext.reconnect(CorusConnectionContext.java:140)
at org.sapia.corus.client.facade.CorusConnectionContext.<init>(CorusConnectionContext.java:64)
at org.sapia.corus.client.facade.CorusConnectionContext.<init>(CorusConnectionContext.java:71)
at org.sapia.corus.client.cli.CorusCli.main(CorusCli.java:86)
laurencetoenjes ~ $ 



Laurence

Yanick Duchesne

unread,
Sep 16, 2011, 10:38:55 PM9/16/11
to sapia-...@googlegroups.com

The fact that you cant connect with the CLI explicitely with the connect command clearly indicates a network issue: in this case no multicast is involved at all. Rather, a plain point to point tcp connection is attempted.

Could you re-enable the address pattern on 2 corus servers and restart them? Then, reattempt the connect command on each,  using the addresses that are matching the pattern.

Keep me posted.

On Sep 16, 2011 10:29 PM, "Laurence" <lauren...@gmail.com> wrote:

Yanick,

> sorry about this perniciou...

Laurence

unread,
Sep 16, 2011, 11:21:56 PM9/16/11
to sapia-...@googlegroups.com

Yanick,

I went ahead and set the address pattern on 2 corus servers to an ip value of the machine each property file was on and I still had the same connection problems as before (OS X machine and Win7 machine).

The machine I needed to actually set the address pattern was a Vista machine that had 2 virtual nics that VmWare created on my host machine (I've disabled those nics to help avoid extra headaches).

I remember seeing some settings in the Corus properties for multicast, is this something that is universal and will work on any network as is? 

corus.server.multicast.address=231.173.5.7
corus.server.multicast.port=5454

I don't if this helps but here are the results of an ifconfig command on my OS X machine (I have no idea what the MULTICAST blurbs mean and if it could be useful).

Last login: Fri Sep 16 20:01:55 on ttys002
laurencetoenjes ~ $ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask 0xff000000 
inet6 ::1 prefixlen 128 
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
inet6 fdd6:356f:d480:80a3:1293:e9ff:fe0a:c51a prefixlen 128 
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ether 10:93:e9:0a:c5:1a 
inet6 fe80::1293:e9ff:fe0a:c51a%en0 prefixlen 64 scopeid 0x4 
inet 10.0.1.21 netmask 0xffffff00 broadcast 10.0.1.255
media: autoselect
status: active
ham0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1200
ether a2:56:98:d7:e5:b3 
inet 5.228.86.243 netmask 0xff000000 broadcast 5.255.255.255
open (pid 58)
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1500
inet6 fe80::1293:e9ff:fe0a:c51a%utun0 prefixlen 64 scopeid 0x6 
inet6 fd00:6587:52d7:f8f6:1293:e9ff:fe0a:c51a prefixlen 64 
laurencetoenjes ~ $ 


Laurence

Yanick Duchesne

unread,
Sep 17, 2011, 9:34:20 PM9/17/11
to sapia-...@googlegroups.com
Hi Laurence,

we've made another release (2.1.1) to add small improvements in discovery (fixing the heartbeat-related limitation I've mentioned in a previous post). I recommend you try this one as a last resort... We're going to conduct tests of our own replicating your type of environment. From your description your environment seems mostly wireless, and multicast can be tricky with certain routers for example (see this thread). We've never had problems with discovery in a non-wireless context, and we've never tested in a fully wireless environment.

We will update you with the results of our tests.
--
Yanick Duchesne

Jean-Cédric Desrochers

unread,
Sep 18, 2011, 10:30:53 AM9/18/11
to sapia-...@googlegroups.com, Laurence Toenjes
Hi Laurence,

I'm jumping in a bit late on this issue. Last week has been a hell of a week at my "normal day job"!!!!

I've been reading the various emails and I will try to reproduce and fix any issue with the discovery process. Here's the setup I have for testing:
  • MacBook Pro running OS X 10.6
  • Desktop running Windows XP
  • Laptop running Windows 7
  • Desktop running Ubuntu 10.04
  • I also have VMWare running on my Mac with the following images:
    • openSolaris
    • Windows Vista
    • Fedora, Mandriva, openSUSE and many other Linux distributions
I will try to run some tests with wired and wireless scenarios, to see if it has any effect on the test results. Also, I will try to document somewhat of a network setup procedure to manage and control the network layer on which Corus runs. The discovery logic uses UDP and multicast which can require some custom configuration on some operating system.

Like Yanick mentioned in an earlier reply, we've been using Corus for almost 10 years in production environments and we never had an issue with the discovery logic. However, we always made sure with the system administrators that the machines were in the same subnet and that UDP/multicast was open between the servers. Let's say that we were in ideal conditions!

Whatever causes your issue, well figure out how to reproduce it and fix it.

Regards,

J-C

Laurence

unread,
Sep 18, 2011, 11:47:36 PM9/18/11
to Jean-Cédric Desrochers, sapia-...@googlegroups.com
Hi J-C,

I tried the latest version Corus 2.1.1 and am just running corus and coruscli from the command line on all my machines.

I can say that the Corus 2.1.1 is an improvement, but the lost hosts problem still occurs on my Wi-Fi network.

I forgot I had another wireless like router device on my network and unpowered it (Airport Express hooked up to sound system for iTunes) for just in case it could be interfering with the Corus host self discovery (as far as I could tell it didn't have any effect).

I also noticed my Apple wireless router had a multicast setting named Multicast Rate with the possible values of Low, Medium and High (I tried all the settings and none of the them seemed to stop the Corus dropped hosts).

I went ahead and tried Corus 2.1.1 with several machines without using Wi-Fi and it works very well ( but at home it's not practical to not use Wi-Fi ).

It's good to know that the Corus discovery feature works in non Wi-Fi mode, realistically that's the way I would use it when not at home (getting the Corus discovery feature to work better in Wi-Fi mode is a definite plus).

Just to recap, I made sure when my Windows laptops were plugged into AC power that they would not sleep and I also made sure the network cards would not allow Windows power management options to put them to sleep.

Laurence

Jean-Cédric Desrochers

unread,
Sep 21, 2011, 1:12:06 AM9/21/11
to Laurence, sapia-...@googlegroups.com
A quick update on this issue.

I've been testing with 3 laptops (osX 10.6, winXP and win7) using only a wireless connection. I was monitoring TCP and UDP packets on my mac using wireshark and I think that I'm on to something. I can see that there is some tweaks to put in the discovery logic, but I think the main problem lies in the multicast setup (when the host has more than one NIC).

It's getting late... so I'm taking a break for. Will resume testing tomorrow.


J-C

Jean-Cédric Desrochers

unread,
Sep 21, 2011, 11:02:02 PM9/21/11
to Laurence, sapia-...@googlegroups.com
latest update.... I resolve an issue regarding the multiple NIC. I had to figure out why it was happening since i) you faced that issue and ii) I have vmware on my mac and virtualbox on one of the windows machine I use for testing.

However I still have some cases where a host gets discarded over time. Tomorrow I will focus on the message protocol used between the corus servers to manage the discovery.

I'll keep you guys posted.


J-C

Jean-Cédric Desrochers

unread,
Oct 20, 2011, 1:57:04 PM10/20/11
to Laurence, sapia-...@googlegroups.com
Hi Laurence,

earlier this week we released a new Corus version (2.1.3) which should solve the issues you faced with the multiple NIC. I've made a lot of tests using three laptops (using only wifi) and monitoring the IP data exchange between the three Corus servers with Wireshark. Every node of the network was talking to it's peers and I did not see any drop of running server.

In order for the auto-discovery feature to work in a multi-NIC environment, you must define the 'corus.server.address.pattern' variable in the corus.properties file. This will tell the network sub-layer which NIC to use for broadcasting. Without this configuration the OS will choose one at random and use it: which is not what we want.

Defining this variable will also fix an issue you mentioned were you was not able to connect to a Corus server using the 'coruslic -h hostname -p port#' (you once posted a stack trace with a connection refuse exception).

Hope that this new release will help you and that you will find Corus helpful.

Regards,
     J-C
Reply all
Reply to author
Forward
0 new messages