Scaling up model checking

149 views
Skip to first unread message

dun...@gmail.com

unread,
Aug 27, 2019, 12:38:31 PM8/27/19
to tlaplus
So I got my model to a point where it catches a lot of the things I want to catch.  Modeling a single "slot" and two processes (never mind what that means) takes about 1 minute 20 seconds.  Making it 2 slots and 3 processes has so far taken my 8-thread workstation about 12 hours; and to be sure the model is race-free, I really want to run 3 slots and 3 processes.

So I started looking into how to increase throughput.  Among the bits of kit we have lying around for testing is a slightly older 48-thread box, and a more recent 400+ thread box (but Linux only knows how to use 64 of them).

Unfortunately, I'm having trouble getting tlc to scale up to using these systems.  Basically, if I go past 8 worker threads, the number of states generated actually goes down.  And since the maximum clock rate of my desktop is actually higher than the maximum clock rate of the two massive-cpu boxen, the total throughput on the big boxes is lower than on my workstation.

I've also tried running in -simluation mode.  Allegedly this has been implemented (https://github.com/tlaplus/tlaplus/issues/147); but running with 24 worker threads, the java process never takes more than 120% (i.e., no more than one worker is actually active).

It's more or less the same for two different JVMs: CentOS's OpenJDK 11.0.4, and Debian's OpenJDK 1.8.0_171 (from jessie-backports).

Any suggestisons? Am I missing something obvious?

Thanks,
 -George

Markus Kuppe

unread,
Aug 27, 2019, 12:40:53 PM8/27/19
to tla...@googlegroups.com
Hi George,

please share the command-line parameters you run TLC with.

Thanks
Markus

Hillel Wayne

unread,
Aug 27, 2019, 12:41:00 PM8/27/19
to tla...@googlegroups.com

Hi George,

Have you looked into the toolbox's distributed model checking? You can use it to distribute the model checking to multiple AWS or Azure compute instances.

H

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/aa5306a4-d55c-469c-9df4-3eb56379eaf2%40googlegroups.com.

dun...@gmail.com

unread,
Aug 27, 2019, 1:15:26 PM8/27/19
to tlaplus
Thanks for the quick response.  The default command looks something like this:

    java -cp $(TLATOOLS_JAR) -XX:+UseParallelGC -Xmx24G tlc2.TLC -config FOO.cfg -workers 8 -difftrace FOO.tla | tee foo.out

On the bigger machines I started with `workers` at 48 and 64 (on the 48-thread and the 64-thread machine respectively), then tried 24, 16, and 8 for both.  Ultimately, 8 got the highest throughput.

For 'simulation', the command looks similar:

    java -cp tla2tools.jar -XX:+UseParallelGC tlc2.TLC -config FOO.cfg -workers 48 -simulate -depth 200 -difftrace FOO.tla

I tried 'simulation' mode with 2 workers, and with htop saw two threads each taking about 60% of one cpu; I tried with 4, and two threads got about 40% each, and the other two got about 20% each (still totaling 120%).  So probably there's a severe bottleneck somewhere.

ATM on the 64-core box I have 60 different simulators with 2 workers each going, and that seems to be lighting things up; but it's a lot harder to monitor the results.

 -George
 

dun...@gmail.com

unread,
Aug 27, 2019, 1:29:31 PM8/27/19
to tlaplus


On Tuesday, August 27, 2019 at 5:41:00 PM UTC+1, Hillel Wayne wrote:

Hi George,

Have you looked into the toolbox's distributed model checking? You can use it to distribute the model checking to multiple AWS or Azure compute instances.


When first learning TLA I used the toolbox, and I did notice the "deploy to AWS / Azure" option.  But for lots of reasons I prefer using the CLI versions.  Anything you do through the toolbox you should be able to do through the CLI, right?

If absolutely necessary, I would consider going that route; but apart from the disadvantage of not being able to use the CLI tools, there's the issue of having to get budget approved for using AWS or Azure.  These two test boxes we have are mostly just sitting around at the moment, so it makes sense to see if we can light them up before looking at cloud-based solutions.

Thanks,
 -George

Markus Kuppe

unread,
Aug 27, 2019, 1:49:48 PM8/27/19
to tla...@googlegroups.com
On 27.08.19 10:13, dun...@gmail.com wrote:
> Thanks for the quick response.  The default command looks something
> like this:
>
>     java -cp $(TLATOOLS_JAR) -XX:+UseParallelGC -Xmx24G tlc2.TLC
> -config FOO.cfg -workers 8 -difftrace FOO.tla | tee foo.out
>
> On the bigger machines I started with `workers` at 48 and 64 (on the
> 48-thread and the 64-thread machine respectively), then tried 24, 16,
> and 8 for both.  Ultimately, 8 got the highest throughput.
>

Most importantly you should switch to the concurrency-optimized
fingerprint set implementation (OffHeapDiskFPSet).  Check the Toolbox's
.log file located in ~/.tlaplus/.metadata/ for "TLC COMMAND-LINE" which
shows the optimizations (you have to run TLC from the Toolbox first). 
Additionally, you can extract OS optimizations from [3].

Generally, you will probably want to watch "Large Scale Model Checking
101" [1].  However, it pre-dates the TLC profiler which you should use
to inspect a smaller model for performance bottlenecks such as
inefficient expressions.

Lastly, a concurrency-optimized implementation for TLC's state queue is
currently work-in-progress [2].

Hope this helps,

[1] https://vimeo.com/264959035

[2] https://github.com/lemmy/PageQueue/

[3]
https://github.com/tlaplus/tlaplus/blob/525f664f82afa67de9b0b2af0419bae02f35e738/org.lamport.tla.toolbox.jclouds/src/org/lamport/tla/toolbox/jcloud/CloudDistributedTLCJob.java#L559-L562

Markus Kuppe

unread,
Aug 27, 2019, 2:07:49 PM8/27/19
to tla...@googlegroups.com
On 27.08.19 10:20, dun...@gmail.com wrote:
> When first learning TLA I used the toolbox, and I did notice the
> "deploy to AWS / Azure" option.  But for lots of reasons I prefer
> using the CLI versions.  Anything you do through the toolbox you
> should be able to do through the CLI, right?
>
> If absolutely necessary, I would consider going that route; but apart
> from the disadvantage of not being able to use the CLI tools, there's
> the issue of having to get budget approved for using AWS or Azure. 
> These two test boxes we have are mostly just sitting around at the
> moment, so it makes sense to see if we can light them up before
> looking at cloud-based solutions.


The cloud feature - dubbed "Cloud TLC" - can be used from the
command-line too.  [1] shows how the TLC performance tests make use of
it.  The initial setup of cloud credentials is discussed in the Toolbox
help.

Best

Markus

[1] https://gist.github.com/lemmy/62c4b6b158187180a5b7c873d08a0f51

william...@10gen.com

unread,
Nov 13, 2019, 8:49:12 PM11/13/19
to tlaplus
To follow up on this, I am observing some apparent scalability issues with TLC when running on a single machine. I am using an EC2 c5d.24xlarge instance type, which has 96 cores and 192GB of memory. When I run a particular model with 64 workers, I see the following:

ubuntu@ip-172-31-1-228:/data/tla$ java -XX:MaxDirectMemorySize=120000m -Xmx120185m -Dtlc2.tool.fp.FPSet.impl=tlc2.tool.fp.OffHeapDiskFPSet -XX:+UseParallelGC -cp tla2tools.jar tlc2.TLC -workers 64 -config mc.cfg mc.tla
TLC2 Version 2.14 of 10 July 2019 (rev: 0cae24f)
Running breadth-first search Model-Checking with fp 117 and seed 1389448420162930971 with 64 workers on 96 cores with 106832MB heap and 120000MB offheap memory [pid: 7730] (Linux 4.15.0-1051-aws amd64, Ubuntu 11.0.4 x86_64, OffHeapDiskFPSet, DiskStateQueue).
Parsing file /data/tla/mc.tla
Parsing file /tmp/Naturals.tla
Parsing file /tmp/Integers.tla
Parsing file /tmp/FiniteSets.tla
Parsing file /tmp/Sequences.tla
Parsing file /tmp/TLC.tla
Semantic processing of module Naturals
Semantic processing of module Integers
Semantic processing of module Sequences
Semantic processing of module FiniteSets
Semantic processing of module TLC
Semantic processing of module mc
Starting... (2019-11-14 01:05:32)
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2019-11-14 01:05:43.
Progress(6) at 2019-11-14 01:05:46: 147,528 states generated (147,528 s/min), 9,917 distinct states found (9,917 ds/min), 8,452 states left on queue.
Progress(9) at 2019-11-14 01:06:46: 4,363,650 states generated (4,216,122 s/min), 382,868 distinct states found (372,951 ds/min), 290,240 states left on queue.
Progress(9) at 2019-11-14 01:07:46: 8,448,010 states generated (4,084,360 s/min), 731,818 distinct states found (348,950 ds/min), 533,823 states left on queue.
Progress(10) at 2019-11-14 01:08:46: 12,452,538 states generated (4,004,528 s/min), 1,093,250 distinct states found (361,432 ds/min), 790,535 states left on queue.
Progress(10) at 2019-11-14 01:09:46: 16,336,391 states generated (3,883,853 s/min), 1,436,130 distinct states found (342,880 ds/min), 1,044,868 states left on queue.
Progress(10) at 2019-11-14 01:10:46: 20,274,326 states generated (3,937,935 s/min), 1,787,099 distinct states found (350,969 ds/min), 1,295,292 states left on queue.
Progress(10) at 2019-11-14 01:11:46: 24,372,652 states generated (4,098,326 s/min), 2,177,834 distinct states found (390,735 ds/min), 1,556,071 states left on queue.
Progress(10) at 2019-11-14 01:12:46: 28,343,121 states generated (3,970,469 s/min), 2,552,098 distinct states found (374,264 ds/min), 1,815,312 states left on queue.
Progress(10) at 2019-11-14 01:13:46: 32,267,652 states generated (3,924,531 s/min), 2,902,935 distinct states found (350,837 ds/min), 2,057,542 states left on queue.
Progress(10) at 2019-11-14 01:14:46: 36,234,857 states generated (3,967,205 s/min), 3,282,234 distinct states found (379,299 ds/min), 2,314,932 states left on queue.
Progress(11) at 2019-11-14 01:15:46: 40,186,365 states generated (3,951,508 s/min), 3,679,592 distinct states found (397,358 ds/min), 2,582,686 states left on queue.
Progress(11) at 2019-11-14 01:16:46: 44,118,806 states generated (3,932,441 s/min), 4,034,988 distinct states found (355,396 ds/min), 2,825,454 states left on queue.

The throughput (s/min) appears to top out at around 3.9 million states/minute. These are the reported numbers when I run with just 8 workers, on the same machine:

ubuntu@ip-172-31-1-228:/data/tla$ java -XX:MaxDirectMemorySize=120000m -Xmx120185m -Dtlc2.tool.fp.FPSet.impl=tlc2.tool.fp.OffHeapDiskFPSet -XX:+UseParallelGC -cp tla2tools.jar c2.TLC -workers 8 -config mc.cfg mc.tla
TLC2 Version 2.14 of 10 July 2019 (rev: 0cae24f)
Running breadth-first search Model-Checking with fp 43 and seed 6871599717738624516 with 8 workers on 96 cores with 106832MB heap and 120000MB offheap memory [pid: 8128] (Linux 4.15.0-1051-aws amd64, Ubuntu 11.0.4 x86_64, OffHeapDiskFPSet, DiskStateQueue).
Parsing file /data/tla/mc.tla
Parsing file /tmp/Naturals.tla
Parsing file /tmp/Integers.tla
Parsing file /tmp/FiniteSets.tla
Parsing file /tmp/Sequences.tla
Parsing file /tmp/TLC.tla
Semantic processing of module Naturals
Semantic processing of module Integers
Semantic processing of module Sequences
Semantic processing of module FiniteSets
Semantic processing of module TLC
Semantic processing of module mc
Starting... (2019-11-14 01:17:58)
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2019-11-14 01:18:07.
Progress(6) at 2019-11-14 01:18:10: 136,991 states generated (136,991 s/min), 8,542 distinct states found (8,542 ds/min), 7,332 states left on queue.
Progress(8) at 2019-11-14 01:19:10: 3,344,141 states generated (3,207,150 s/min), 295,218 distinct states found (286,676 ds/min), 229,350 states left on queue.
Progress(9) at 2019-11-14 01:20:10: 6,591,286 states generated (3,247,145 s/min), 572,490 distinct states found (277,272 ds/min), 430,834 states left on queue.
Progress(9) at 2019-11-14 01:21:10: 9,852,114 states generated (3,260,828 s/min), 846,787 distinct states found (274,297 ds/min), 621,471 states left on queue.
Progress(9) at 2019-11-14 01:22:10: 12,984,599 states generated (3,132,485 s/min), 1,124,563 distinct states found (277,776 ds/min), 827,090 states left on queue.
Progress(9) at 2019-11-14 01:23:10: 16,140,694 states generated (3,156,095 s/min), 1,398,002 distinct states found (273,439 ds/min), 1,030,079 states left on queue.
Progress(10) at 2019-11-14 01:24:10: 19,356,420 states generated (3,215,726 s/min), 1,688,862 distinct states found (290,860 ds/min), 1,228,268 states left on queue.
Progress(10) at 2019-11-14 01:25:10: 22,544,906 states generated (3,188,486 s/min), 1,981,062 distinct states found (292,200 ds/min), 1,428,619 states left on queue.
Progress(10) at 2019-11-14 01:26:10: 25,777,526 states generated (3,232,620 s/min), 2,284,525 distinct states found (303,463 ds/min), 1,633,791 states left on queue.
Progress(10) at 2019-11-14 01:27:10: 28,980,735 states generated (3,203,209 s/min), 2,577,732 distinct states found (293,207 ds/min), 1,831,318 states left on queue.
Progress(10) at 2019-11-14 01:28:10: 32,199,852 states generated (3,219,117 s/min), 2,872,788 distinct states found (295,056 ds/min), 2,033,065 states left on queue.

which shows a max throughput of around 3.2 million states/minute. This would imply that TLC is scaling very badly i.e. a scale factor of (3.9/3.2) = 1.21x when there are 8 times as many cores. I don't think TLC should be disk bound yet since it has a ton of memory (120GB) allocated to it and I can observe nearly maximal core utilization while executing these runs. Is there something obvious I may be overlooking here?

Markus Kuppe

unread,
Nov 13, 2019, 8:53:09 PM11/13/19
to tla...@googlegroups.com
Hi Will,

the current implementation of the queue of unexplored states
(StateQueue) does not scale well. PageQueue [1] is supposed to become
its replacement but it is still in its early stages.

Best
Markus

[1] https://github.com/lemmy/PageQueue


william...@10gen.com

unread,
Nov 13, 2019, 9:02:14 PM11/13/19
to tlaplus
Ok, thanks for the information. From your experience, do you have any intuition on the scalability limit (CPU core limit) I should expect for TLC if it has plenty of memory?


On Tuesday, August 27, 2019 at 12:38:31 PM UTC-4, dun...@gmail.com wrote:

Markus Kuppe

unread,
Nov 14, 2019, 1:11:44 AM11/14/19
to tla...@googlegroups.com
On 13.11.19 18:01, william.schultz via tlaplus wrote:
> From your experience, do you have any intuition on the scalability limit
> (CPU core limit) I should expect for TLC if it has plenty of memory?

This is difficult to answer because it depends on the cost to evaluate
the next-state relation. For a larger spec where the next-state
relation is expensive to evaluate, TLC will benefit from more CPUs
(contention on the StateQueue won't be the bottleneck). For a toy spec
that models a simple counter there is no point in using more than a
single CPU though.

M.

william...@10gen.com

unread,
Nov 16, 2019, 9:02:53 AM11/16/19
to tlaplus
I see. Should I expect better scalability with distributed TLC? Your presentation here indicates that distributed mode may be able to scale linearly in some cases. My goal is to try running multiple TLC worker processes on the same EC2 machine. Do you have reason to believe that will improve performance versus running all threads inside the same JVM process? I am also curious if using multiple fingerprint set servers is recommended for scaling to this level (i.e. 96 machine cores). I am just curious if it's worth throwing many cores/machines at the problem. If the improvements are only marginal, it may just be simpler for me to run non-distributed TLC with suitably many workers.

Will


On Tuesday, August 27, 2019 at 12:38:31 PM UTC-4, dun...@gmail.com wrote:

Markus Kuppe

unread,
Nov 16, 2019, 2:54:30 PM11/16/19
to tla...@googlegroups.com
On 16.11.19 06:02, william.schultz via tlaplus wrote:
> I see. Should I expect better scalability with distributed TLC? Your
> presentation here
> <https://www.youtube.com/watch?v=zGIK2p6csAo> indicates that distributed
> mode may be able to scale linearly in some cases. My goal is to try
> running multiple TLC worker processes on the same EC2 machine. Do you
> have reason to believe that will improve performance versus running all
> threads inside the same JVM process? I am also curious if using multiple
> fingerprint set servers is recommended for scaling to this level (i.e.
> 96 machine cores). I am just curious if it's worth throwing many
> cores/machines at the problem. If the improvements are only marginal, it
> may just be simpler for me to run non-distributed TLC with suitably many
> workers.
>

Hi Will,

let's move this discussion to GitHub [1] because it is too specific for
the group's general scope.

Best
Markus

[1] https://github.com/tlaplus/tlaplus/issues/396
Reply all
Reply to author
Forward
0 new messages