Thank you both for the very informative replies.
I did a bunch of more investigation today. To preface things, yes we are
running windows on the rendering cluster. I did try to run a heterogenous
cluster (mac app node with windows render nodes) but I couldn't get it to
work.
Carsten, I did actually think of an underlying hardware or "windows" issue.
Our render node machines are not running an AV and don't have local
firewalls. The switch connecting them is a Dell Poweredge and I fiddled with
it as well. I disabled QoS on it and also a weird "Green Ethernet" power
saving mode and neither of those tweaks made a difference. I took a look at
the TCP connections on the server with TCPView and didn't notice anything
awkward (such as connections dropping). It is also worth noting that we have
a secondary application, which utilizes a home-brew clustered rendering
framework (albeit much simpler than Equalizer) and it does not exhibit this
problem so I am inclined to say that the root cause is not our network
infrastructure. The only piece of evidence from this part of my
investigation was a noticeable drop in network traffic going in/out of the
AppNode when the "spikes" or "jitters" take place.
Additionally, I tried various configuration "styles" for our cluster and
noticed that the underlying canvas complexity doesn't matter. I tried
"scaling" the cluster size and found out that the jittering seems to appear
after adding approximately 6 nodes and then gets progressively worse as the
cluster gets bigger.
Stefan, the profiling numbers that I reported in the first post were
actually from one of the moments during which the application "stalled". It
is actually pretty interesting, the CPU utilization graph from the profiling
showed spikes at similar intervals to the stalls so I just grabbed some
statistics from one of those "spike" time intervals and reported them :).
Also, as I mentioned I am experiencing similar behavior (and profiles) with
built-in equalizer samples (e.g. eqPly).
I'll give a shot at compiling the version of Collage that Carsten just
pointed too and let you guys know how it goes. Stefan, I'm also noticing
that in the Issue #38 link you mention something about paying attention to >
64 connection threads in windows? What does that pertain to?
Thank you both again for the comments and please let me know if you have any
more insights. I'll report our progress :).
--
View this message in context:
http://software.1713.n2.nabble.com/Jittery-performance-with-large-cluster-tp7585928p7585948.html