TLC: NoSuchElementException on checkpoint restore

122 views
Skip to first unread message

Jaak Ristioja

unread,
Dec 9, 2015, 7:13:54 AM12/9/15
to tlaplus
Hello!

After switching from distributed ad-hoc mode to non-distributed mode and
recovering from checkpoint, I get the following exception:

@!@!@STARTMSG 2197:0 @!@!@
Starting recovery from checkpoint 15-12-02-13-53-09/
@!@!@ENDMSG 2197 @!@!@
@!@!@STARTMSG 1000:1 @!@!@
TLC threw an unexpected exception.
This was probably caused by an error in the spec or model.
See the User Output or TLC Console for clues to what happened.
The exception was a java.util.NoSuchElementException
java.util.NoSuchElementException
at
tlc2.tool.fp.MSBDiskFPSet$TLCIterator.getLast(MSBDiskFPSet.java:321)
at
tlc2.tool.fp.MSBDiskFPSet$MSBFlusher.mergeNewEntries(MSBDiskFPSet.java:108)
at
tlc2.tool.fp.DiskFPSet$Flusher.mergeNewEntries(DiskFPSet.java:1298)
at tlc2.tool.fp.DiskFPSet.recoverFP(DiskFPSet.java:896)
at tlc2.tool.fp.DiskFPSet.recover(DiskFPSet.java:939)
at tlc2.tool.ModelChecker.recover(ModelChecker.java:778)
at tlc2.tool.ModelChecker.modelCheck(ModelChecker.java:114)
at tlc2.TLC.process(TLC.java:773)
at tlc2.TLC.main(TLC.java:195)

@!@!@ENDMSG 1000 @!@!@
@!@!@STARTMSG 2186:0 @!@!@
Finished in 04s at (2015-12-09 14:06:40)
@!@!@ENDMSG 2186 @!@!@

Recovering from the same checkpoint when switching back to distributed
ad-hoc mode works. Looks like this might be caused by the FPSet not
being present? Is there a workaround or an easy fix for this?


Best regards,
Jaak

Markus Alexander Kuppe

unread,
Dec 9, 2015, 7:35:59 AM12/9/15
to tla...@googlegroups.com
Hi Jaak,

what number of distributed fingerprint sets have you been using? Zero
for the built-in set or >0 distributed sets.

You are leaving the beaten path when you try to recovery from
checkpoints created by a different TLC mode. Why do want to do this anyway?

Markus

Jaak Ristioja

unread,
Dec 9, 2015, 8:09:49 AM12/9/15
to tla...@googlegroups.com
3 sets, one for each slave.

> You are leaving the beaten path when you try to recovery from
> checkpoints created by a different TLC mode. Why do want to do this anyway?

In my distributed setup, the slaves seems to be rather idle, If I get
access to a more powerful machine I want to try whether doing the
computation locally on that machine would be any faster. I suspect that
besides the master CPU usage, networking between the master and slaves
might also be a bottleneck.

I recorded some of data on the network interface of the master and took
a look at it using Wireshark. I saw uncompressed RMI data containing
quite a lot of TLA model identifier strings being passed. I guess that
networking could be well optimized.

Jaak

Markus Alexander Kuppe

unread,
Dec 9, 2015, 8:21:17 AM12/9/15
to tla...@googlegroups.com
On 09.12.2015 14:09, Jaak Ristioja wrote:
> 3 sets, one for each slave.

Hi Jaak,

combining N fingerprint sets of a distributed TLC run into a single one
is logically possible but technically not supported.

> In my distributed setup, the slaves seems to be rather idle, If I get
> access to a more powerful machine I want to try whether doing the
> computation locally on that machine would be any faster. I suspect that
> besides the master CPU usage, networking between the master and slaves
> might also be a bottleneck.
>
> I recorded some of data on the network interface of the master and took
> a look at it using Wireshark. I saw uncompressed RMI data containing
> quite a lot of TLA model identifier strings being passed. I guess that
> networking could be well optimized.

What kind of network are your nodes connected with?

Markus

Jaak Ristioja

unread,
Dec 9, 2015, 8:36:48 AM12/9/15
to tla...@googlegroups.com
On 09.12.2015 15:21, Markus Alexander Kuppe wrote:
>> In my distributed setup, the slaves seems to be rather idle, If I get
>> access to a more powerful machine I want to try whether doing the
>> computation locally on that machine would be any faster. I suspect that
>> besides the master CPU usage, networking between the master and slaves
>> might also be a bottleneck.
>>
>> I recorded some of data on the network interface of the master and took
>> a look at it using Wireshark. I saw uncompressed RMI data containing
>> quite a lot of TLA model identifier strings being passed. I guess that
>> networking could be well optimized.
>
> What kind of network are your nodes connected with?

Afaik it should be a gigabit network between the slaves and the machine
running the master VM. I'm attaching the output of the distributed TLC
run using -agentlib:hprof=cpu=samples,depth=10.

Jaak
java.hprof.txt

Jaak Ristioja

unread,
Dec 9, 2015, 8:48:23 AM12/9/15
to tla...@googlegroups.com
Here's -agentlib:hprof=cpu=samples,depth=100 output from a longer run.

Jaak

java.hprof.txt

Giuliano

unread,
Apr 10, 2016, 10:04:17 AM4/10/16
to tlaplus
Hello, I am encountering the same error, but I did not run TLC in distributed mode.
I started TLC on the command line as follows.
java -cp ../tla/ -Xmx70g -Xms10g tlc2.TLC -workers 16 -maxSetSize 10000000 -deadlock -coverage 15 MC.tla
After stopping TLC accidentally (after several days unfortunately), I tried recovering the last checkpoint as follows.
java -cp ../tla/ -Xmx70g -Xms10g tlc2.TLC -workers 16 -maxSetSize 10000000 -deadlock -recover states/16-04-07-07-52-44 -coverage 15 MC.tla
This resulted in :
java.util.NoSuchElementException
        at tlc2.tool.fp.MSBDiskFPSet$TLCIterator.getLast(MSBDiskFPSet.java:321)
        at tlc2.tool.fp.MSBDiskFPSet$MSBFlusher.mergeNewEntries(MSBDiskFPSet.java:108)
        at tlc2.tool.fp.DiskFPSet$Flusher.mergeNewEntries(DiskFPSet.java:1298)
        at tlc2.tool.fp.DiskFPSet.recoverFP(DiskFPSet.java:896)
        at tlc2.tool.fp.DiskFPSet.recover(DiskFPSet.java:939)
        at tlc2.tool.ModelChecker.recover(ModelChecker.java:782)
        at tlc2.tool.ModelChecker.modelCheck(ModelChecker.java:114)
        at tlc2.TLC.process(TLC.java:773)
        at tlc2.TLC.main(TLC.java:195)

Any idea that would help me recover the last checkpoint would be appreciated.
Thanks,
Giuliano

Giuliano

unread,
Apr 10, 2016, 10:30:50 AM4/10/16
to tlaplus
I forgot to mention that I am using the TLA+ command-line tools.
TLC reports:
TLC2 Version 2.08 of 21 December 2015

Giuliano

unread,
Apr 11, 2016, 11:58:05 AM4/11/16
to tlaplus
Hello, I was able to recover my checkpoint using the latest integration build.

evgeniy....@gmail.com

unread,
Apr 4, 2017, 11:09:39 AM4/4/17
to tlaplus
After recovering from checkpoint I have a very weird messages displayed periodically:
tlc RAPlus.tla -config RAPlus.cfg -deadlock -metadir /run/media/MC -checkpoint 30 -recover /run/media/MC/17-04-03-13-54-53
TLC2 Version 2.09 of 28 January 2016
Running in Model-Checking mode with 3 workers.
Parsing file RAPlus.tla
Parsing file /home/sh/tla2tools/tla/tla2sany/StandardModules/Naturals.tla
Parsing file /home/sh/tla2tools/tla/tla2sany/StandardModules/Sequences.tla
Parsing file /home/sh/tla2tools/tla/tla2sany/StandardModules/TLC.tla
Parsing file /home/sh/tla2tools/tla/tla2sany/StandardModules/FiniteSets.tla
Semantic processing of module Naturals
Semantic processing of module Sequences
Semantic processing of module TLC
Semantic processing of module FiniteSets
Semantic processing of module RAPlus
Starting... (2017-04-04 18:04:14)
Implied-temporal checking--satisfiability problem has 21 branches.
Starting recovery from checkpoint /run/media/sh/MC/17-04-03-13-54-53/
AAAAAA
AAAAAA
AAAAAA
AAAAAA

and so on.

My model has nothing to do with that "AAAAAA" messages. I have not saw such messages before.

evgeniy....@gmail.com

unread,
Apr 4, 2017, 11:13:57 AM4/4/17
to tlaplus
Some time later I got this message:

Error: Java ran out of memory. Running Java with a larger memory allocation
pool (heap) may fix this. But it won't help if some state has an enormous
number of successor states, or if TLC must compute the value of a huge set.
Finished in 04min 10s at (2017-04-04 18:08:25)

This is strange because I ran TLC with the same parameters for virtual machine memory
as before.

Markus Alexander Kuppe

unread,
Apr 4, 2017, 2:14:27 PM4/4/17
to tla...@googlegroups.com
> [...]
Hi Evgeniy,

the "AAAAAA" messages are obviously bogus but anyway harmless. My guess
is, that the corresponding print statement in the code is a leftover
from a debugging session.
The reason why TLC shows several messages is, because your liveness
properties have been split into 21 branches and TLC recovers a graph for
each one them.

With regards to the out of memory error, are you saying that you only
get the error if you recover from a checkpoint but don't get it, if you
run the exact same model (same parameters, safety & liveness properties,
...) without recovery? If yes, can you make your checkpoint data and
spec & model (privately) available to me?

By the way, when did you download the Toolbox? The version TLC reports
indicates that you are running a nightly/CI build.

Cheers
Markus

Evgeniy Shishkin

unread,
Apr 5, 2017, 3:40:13 AM4/5/17
to tlaplus
вторник, 4 апреля 2017 г., 21:14:27 UTC+3 пользователь Markus Alexander Kuppe написал:
The whole story is as follows:
I downloaded TLA+ tools from some official web site (maybe Lamports website, I dont remember exactly)
I ran TLC on my model which produced aforementioned checkpoint.

Some time later I decided to stop TLC (ctrl+c). I did that after checkpoint was finished so there is no reason to suspect that checkpoint is broken.

I ran TLC again with that checkpoint (- but got an error which is a subject of this topic (NoSuchElementException)

Then I saw that message from mr.Giuliano, where he suggest to try latest build.
I downloaded that latest build and ran it again. I ran it with the same parameters as before except checkpoint interval parameter (It was increased this time).
I used this link for download: http://tla.msr-inria.inria.fr/tlatoolbox/ci/tladist/

My original run command was as follows:
tlc RAPlus.tla -config RAPlus.cfg -deadlock -metadir /run/media/MC -checkpoint 5

My recover attempt was:
tlc RAPlus.tla -config RAPlus.cfg -deadlock -metadir /run/media/MC -checkpoint 30 -recover /run/media/MC/17-04-03-13-54-53

tlc is a script which is equal to:
#!/bin/bash
java -Djava.net.preferIPv4Stack=true \
-Dcom.ibm.cacheLocalHost=true \
-Djava.net.preferIPv6Addresses=false \
-Xmx16g \
tlc2.TLC -cleanup -workers 3 $*

> With regards to the out of memory error, are you saying that you only
> get the error if you recover from a checkpoint but don't get it, if you
> run the exact same model (same parameters, safety & liveness properties,
> ...) without recovery?

I get this error on recovery using that latest version of TLC. I actually get memory error also when running latest version of TLC from scratch (after several hours of checking).
But I do not get that error when using previous version of TLC from scratch on that model.
Reply all
Reply to author
Forward
0 new messages