I have performance issue regarding Drools 7x, tested 7.52.final and above. I concentrated to 7.52.final, because of
redhat builds, but same issue I have with most resent versions too
I made some benchmarks and posted question on stackoverflow:
Drools 4.X vs Drools 7.x performance . I had great performance with drools 4.x (We are using 4.0.3 version) with Java 1.6 (we are using Jrockit realtime with deterministic GC, but with standart Java 6 and Parallel GC performance is better, because it doesn't overhead of GC tunning). I shared the
JMH benchmark code . But I'm getting performance degradation with Drools 7.x, in my tests 4.x with Java 6(ParallelGC) is nearly 2 times faster, than 7.x with Java 1.8, 11, 17 (compared different vendor's builds, with Jdk 11 was "little" better). I made Throughput and Latency benchmarks and here are results. This numbers are from my descktop PC :
Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 12 Threads (hyperthreading enabled ), 64 GB Ram, "Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)"
Linux homepc 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Drools 4.0.3 :
Benchmark (shadowProxy) Mode Cnt Score Error Units
DroolsBenchmark.send false thrpt 30 138888.339 ± 6603.057 ops/s
DroolsBenchmark.send true thrpt 30 1704.062 ± 178.104 ops/s
DroolsBenchmark.send avgt 30 75.845 ± 2.668 us/op
Drools 7.52.Final
DroolsBenchmark.send thrpt 30 67881.788 ± 941.384 ops/s
DroolsBenchmark.send avgt 30 147.362 ± 2.510 us/op
We started to compare performance, because we want to migrate our realtime system from Java 6 to Java 11 (maybe Java 17), also to update drools version. Also performance of Drools 7X was degrades on Xeon server(Linux 5.4.17-2102.201.3.el8uek.x86_64 #2 SMP Fri Apr 23 09:05:57 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz with 56 Threads(hyperthreading disabled, the same when enabled and there is 112 cpu threads ) and 1 TB RAM I have half of performance (Even increasing threads) NAME="Oracle Linux Server" VERSION="8.4"
) , couldn't achieve performance above 30K , increasing threads, degrades more and more. Making KieBase ThreadLocal also doesn't helps. We use only Stateless sessions.
On stackoverflow
Roddy of the Frozen Peas suggested, that "your Drools 7 version isn't actually threadsafe", but actually didn't mentioned what he means by it. Is there other library ?
I do not thing it's not threadsafe. I made FlameGraphs with AsyncProfiler and it shows nearly 17% time on ContextImpl, which uses UUID.randomUUID() for StatelessKnowledgeSessionImpl. Also there are a lot of synchronized blocks in DefaultAgenda. Also a lot of Collections.synchronizedMap. I understand that it's very important for Statefull Sessions, but Stateless executes once and then releases. As I see Stateless is really wrapper for Statefull Session in block "try fireAllRules, finally release session". I tested to build drools core source (Tag 7.52.Final) replacing UUID.randomUUID() to nameInc.incrementAndGet()+"_Cnt", also removing synchronized blocks in DefaultAgenda(for testing purpose) and used Kiebase as ThreadLocal. Performance icreesed to ~20% , which is not as good as in version 4.0.3.
Also, I could manage to run these tests on Drools 4.0.3 on Java 11 and 17 (compilation of DRL doesn't works in Java 17, but I could used precompiled and saved KiePackages by Java 11), which was really great surprise. I got better performance comparing Drools 7.x, but not as much as using Java 6 with Drools 4.0.3, also it degrades on Xeon Server. So I made some profiling and found that there is a lot of time spend in
PrimitiveLongMap.Page method
put . With AsyncProfiler&FlameGraphs I found that there is a lot of native calls after OptoRuntime::multianewarray2_C and found very good article
Why does allocating a single 2D array take longer than a loop allocating multiple 1D arrays of the same total size and shape?. I replaced the code as suggested, also replaced synchronized block of
nextWorkingMemoryCounter method in AbstractRuleBase with AtomicInteger and got better results with Java11/17 (>140K msg/ps) than with Java 6(<140Kmsg/ps). I tested this on Server Machine and the Throughput were perfect (~300K msg/sec with 10 Threads ), also increesing Threads were increesing Throughput (of course not Lineary, for example, with 50 Threads, I got ~560K msg/sec)
Here is a result of running Benchmark with Jdk 17 and drools 4.0.3 (pached) running on Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz :
# JMH version: 1.32
# VM version: JDK 17, OpenJDK 64-Bit Server VM, 17+35
# VM invoker: /opt/jdk/microsoft/jdk-17+35/bin/java
# VM options: -server -Xms10G -Xmx10G -XX:+UseShenandoahGC -XX:+UseNUMA -XX:+UseLargePages -XX:+UseTransparentHugePages -Xlog:gc*,gc+ref*,gc+ergo*,gc+heap*,gc+stats*,gc+compaction*,gc+age*:logs/gc.log:time,tags:filecount=25,filesize=30m
# Blackhole mode: full + dont-inline hint
# Warmup: 50 iterations, 10 s each
# Measurement: 100 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 10 threads, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.drools.perf.test.DroolsBenchmarkTest.send
DroolsBenchmarkTest.send thrpt 100 272833.101 ± 435.835 ops/s
On as I said, on my desktop PC, the same test (Jdk 17 and drools 4.0.3 (pached) ) I have about 150K msg/sec
Then I made benchmarks on real code (realtime system) and I'm very happy with these results (~55K msg/sec).
JMH version: 1.32
# VM version: JDK 17, OpenJDK 64-Bit Server VM, 17+35
# VM invoker: /opt/jdk/microsoft/jdk-17+35/bin/java
# VM options: -server -Xms50G -Xmx50G -XX:+UseNUMA -XX:+UseShenandoahGC -XX:+UseLargePages -XX:+UseTransparentHugePages -XX:MaxMetaspaceSize=1G -XX:MetaspaceSize=256M --add-opens=java.base/
java.lang=ALL-UNNAMED -Xlog:gc*,gc+ref*,gc+ergo*,gc+heap*,gc+stats*,gc+compaction*,gc+age*:logs/gc.log:time,pid,tags:filecount=25,filesize=30m
# Blackhole mode: full + dont-inline hint
# Warmup: 50 iterations, 10 s each
# Measurement: 250 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 45 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: ge.magticom.ocs.rulecompiler.benchmark.DroolsBenchmarkTest.send
# Parameters: (threadLocal = true, useNewRules = false)
Benchmark (threadLocal) (useNewRules) Mode Cnt Score Error Units
DroolsBenchmarkTest.send true false avgt 250 829.179 ± 1.398 us/op
Run 3 forks with the same configuration for Throughput Msg/sec
# Benchmark mode: Throughput, ops/time
DroolsBenchmarkTest.send true false thrpt 750 55015.309 ± 51.340 ops/s
Also UseShenandoahGC on JDK 17 was perfectly working, avg latency is 0.3 msc , max pause time 2 msc, allocation rate 7-7.5 GB/sec
But I worry that I couldn't do it with Drools 7X (on real code , I have ~10-11K msg/sec on my desktop PC and <5K msg/sec on server, the same JVM, different kernel, different gclib).
What can you suggest?