Problems with ECL on MacOSX Maverick?

50 views
Skip to first unread message

Patrick Stein

unread,
Oct 31, 2013, 10:10:14 AM10/31/13
to cl-tes...@googlegroups.com
I think I'm seeing problems with ECL on MacOSX since I upgraded to Maverick which uses a different compiler.  Maybe I have to re-install ECL?

Some stuff seems to load, but other things run and run and run and run:

  502 89579 ??       1176:34.84 /usr/local/bin/ecl -norc -eval (require :cmp) -eval (PROGN (LOAD #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/setup.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common-asdf.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-run-libtest.lisp") (SET-RESPONSE #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/response4830250852919236339.lisp" (RUN-LIBTEST-MAIN :CL-6502 #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/logs/cl-6502" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/fasl/"))) -eval (ext:quit)

I killed the above cl-6502 test after 19hrs of CPU time.  I'm about to kill two other clang processes that have eaten up more than an hour of CPU time trying to compile:
work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/fasl/ql/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.c

Anyone else seeing anything similar?  I can still run ECL from the command-line and compile and load a large number of packages, but a few are hanging like this.  I don't know if it's something to do with the actual code in these packages or if it's bugs with the compiler when my computer sleeps/wakes or some such thing.

ttyl,
Patrick

Anton Vodonosov

unread,
Oct 31, 2013, 6:07:59 PM10/31/13
to cl-tes...@googlegroups.com, Patrick Stein
31.10.2013, 18:10, "Patrick Stein" <patrick...@gmail.com>:
> I think I'm seeing problems with ECL on MacOSX since I upgraded to Maverick which uses a different compiler. О©╫Maybe I have to re-install ECL?
>
> Some stuff seems to load, but other things run and run and run and run:
>
> О©╫ 502 89579 ??О©╫О©╫ О©╫ О©╫О©╫1176:34.84 /usr/local/bin/ecl -norc -eval (require :cmp) -eval (PROGN (LOAD #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/setup.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common.lisp") (LOADО©╫#P"/Users/cltestgrid/cl-test-grid/agent/proc-common-asdf.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-run-libtest.lisp") (SET-RESPONSE #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/response4830250852919236339.lisp" (RUN-LIBTEST-MAIN :CL-6502 #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/logs/cl-6502" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/fasl/"))) -eval (ext:quit)
>
> I killed the above cl-6502 test after 19hrs of CPU time. О©╫I'm about to kill two other clang processes that have eaten up more than an hour of CPU time trying to compile:
> work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/fasl/ql/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.c
> Anyone else seeing anything similar? О©╫I can still run ECL from the command-line and compile and load a large number of packages, but a few are hanging like this. О©╫I don't know if it's something to do with the actual code in these packages or if it's bugs with the compiler when my computer sleeps/wakes or some such thing.
>
> ttyl,
> Patrick

Hello Patrick.

I haven't seen this, and I don't have access to MacOSX, and clang.
Maybe it is a bug in ECL or clang.

I have a question. The test taken 19 hours, was it the ECL process running 19 hours,
or it was the only clang started by ECL?

Because cl-test-gird-agent should kill the child ECL process if it exceeds 15 minutes
timeout. If it was only clang - I can explain how it happened.

Another question.
Can you try the hanging operations manually, to find out whether they are reproducible.
For example, to recompile jwacs:

rm -rf ~/.cache/common-lisp/
time ~/lisps/ecl-bin-git/bin/ecl -norc -eval '(require :cmp)' -load ~/quicklisp/setup.lisp -eval '(ql:quickload :jwacs)' -eval '(quit)'

My results on linux + gcc:

real 7m26.532s
user 7m22.304s
sys 0m2.880s

Long, but far from an hour. BTW, why it takes so long. The parse-javascript.c
file generated by ECL from parse-javascript.lisp contains 72000 lines.

Running CL-6502 testsuite:

rm -rf ~/.cache/common-lisp/
time ~/lisps/ecl-bin-git/bin/ecl -norc -load ~/quicklisp/setup.lisp -eval '(push "~/cl-test-grid/" asdf:*central-registry*)' -eval '(ql:quickload :test-grid-testsuites)' -eval '(tg-testsuites:libtest :CL-6502)' -eval '(quit)'

My results:
real 3m37.469s
user 2m39.266s
sys 0m34.014s

Best regards,
- Anton

Patrick Stein

unread,
Nov 1, 2013, 1:19:20 PM11/1/13
to cl-tes...@googlegroups.com, Patrick Stein

It was the ECL process that was chewing up all of the time. At the time I looked it had no child processes. It took a kill -9 to kill it though.

I will try the selective re-runs later today.

ttyl,
Patrick


On 2013-10-31, at 5:07 PM, Anton Vodonosov <avodo...@yandex.ru> wrote:
> Another question.
> Can you try the hanging operations manually, to find out whether they are reproducible.
> For example, to recompile jwacs:
>
> rm -rf ~/.cache/common-lisp/
> time ~/lisps/ecl-bin-git/bin/ecl -norc -eval '(require :cmp)' -load ~/quicklisp/setup.lisp -eval '(ql:quickload :jwacs)' -eval '(quit)'
>
> My results on linux + gcc:
>
> real 7m26.532s
> user 7m22.304s
> sys 0m2.880s
>
> Long, but far from an hour. BTW, why it takes so long. The parse-javascript.c
> file generated by ECL from parse-javascript.lisp contains 72000 lines.
>
> Running CL-6502 testsuite:
>
> rm -rf ~/.cache/common-lisp/
> time ~/lisps/ecl-bin-git/bin/ecl -norc -load ~/quicklisp/setup.lisp -eval '(push "~/cl-test-grid/" asdf:*central-registry*)' -eval '(ql:quickload :test-grid-testsuites)' -eval '(tg-testsuites:libtest :CL-6502)' -eval '(quit)'
>
> My results:
> real 3m37.469s
> user 2m39.266s
> sys 0m34.014s
>
> Best regards,
> - Anton
>
> --
> You received this message because you are subscribed to the Google Groups "cl-test-grid" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cl-test-grid...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

signature.asc

Patrick Stein

unread,
Nov 1, 2013, 2:02:38 PM11/1/13
to cl-tes...@googlegroups.com, Patrick Stein

I'm recompiling JWACS now. I've currently got:

UID PID TTY TIME CMD

502 4847 ttys001 5:31.30 ecl -norc -eval (require :cmp) -load /Users/cltestgrid/.quicklisp/setup.lisp -eval (ql:quickload :jwacs) -eval (quit)

502 5410 ttys001 0:00.01 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -I. -I/usr/local/include/ -g -O2 -fPIC -fno-common -g -O2 -fPIC -fno-common -D_THREAD_SAFE -Ddarwin -O2 -c /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.c -o /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.o

502 5413 ttys001 20:54.17 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -cc1 -triple x86_64-apple-macosx10.9.0 -emit-obj -disable-free -disable-llvm-verifier -main-file-name parse-javascript.c -mrelocation-model pic -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core2 -target-linker-version 224.1 -g -coverage-file /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.o -resource-dir /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/5.0 -D _THREAD_SAFE -D darwin -I . -I /usr/local/include/ -O2 -fdebug-compilation-dir /Users/cltestgrid/cl-test-grid -ferror-limit 19 -fmessage-length 80 -stack-protector 1 -mstackrealign -fblocks -fobjc-runtime=macosx-10.9.0 -fobjc-dispatch-method=mixed -fobjc-default-synthesize-properties -fencode-extended-block-signature -fno-common -fdiagnostics-show-option -fcolor-diagnostics -o /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.o -x c /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.c


So, the ECL process has taken up 5 1/2 minutes and there is a clang process that has taken up 21 minutes. I suspect that what might have happened was the cltestgrid tried to kill ecl and managed to kill the clang process and didn't manage to the ecl process which then started eating CPU time.


I'll let it run for a bit longer and try to kill it.

ttyl,
Patrick

signature.asc

Anton Vodonosov

unread,
Nov 2, 2013, 11:09:46 AM11/2/13
to cl-tes...@googlegroups.com, Patrick Stein

01.11.2013, 22:02, "Patrick Stein" <p...@nklein.com>:


> I'm recompiling JWACS now.  I've currently got:
>
>   UID   PID TTY           TIME CMD
>
>   502  4847 ttys001    5:31.30 ecl -norc -eval (require :cmp) -load /Users/cltestgrid/.quicklisp/setup.lisp -eval (ql:quickload :jwacs) -eval (quit)
>
>   502  5410 ttys001    0:00.01 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -I. -I/usr/local/include/ -g -O2 -fPIC -fno-common -g -O2 -fPIC -fno-common -D_THREAD_SAFE -Ddarwin -O2 -c /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.c -o /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.o
>
>   502  5413 ttys001   20:54.17 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -cc1 -triple x86_64-apple-macosx10.9.0 -emit-obj -disable-free -disable-llvm-verifier -main-file-name parse-javascript.c -mrelocation-model pic -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core2 -target-linker-version 224.1 -g -coverage-file /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.o -resource-dir /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/5.0 -D _THREAD_SAFE -D darwin -I . -I /usr/local/include/ -O2 -fdebug-compilation-dir /Users/cltestgrid/cl-test-grid -ferror-limit 19 -fmessage-length 80 -stack-protector 1 -mstackrealign -fblocks -fobjc-runtime=macosx-10.9.0 -fobjc-dispatch-method=mixed -fobjc-default-synthesize-properties -fencode-extended-block-signature -fno-common -fdiagnostics-show-option -fcolor-diagnostics -o /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.o -x c /Users/cltestgrid/.cache/common-lisp/ecl-13.5.1-44c86e08-macosx-x86/Users/cltestgrid/.quicklisp/dists/quicklisp/software/jwacs-20130312-darcs/parse-javascript.c
>
> So, the ECL process has taken up 5 1/2 minutes and there is a clang process that has taken up 21 minutes.

So, it total, the jwacs recompilation by

ecl -norc -eval (require :cmp) -load ~/.quicklisp/setup.lisp -eval (ql:quickload :jwacs) -eval (quit)

takes more than 25 minutes?
Could you try leaving the jwacs recompilation with clang until it finishes completely?

> I suspect that what might have happened was the cltestgrid tried to kill ecl and managed
> to kill the clang process and didn't manage to the ecl process which then started eating CPU time.

Here you mean the situation from your initial message,
where there is only ECL process, rith?
cl-test-grid kills only it's direct child - ECL.
Killing process tree (including clang spawned by ECL)
is only implemented for windows. That's why feel confused,
that ECL process remained in the system.

Best regards,
- Anton

Anton Vodonosov

unread,
Nov 3, 2013, 1:51:38 AM11/3/13
to cl-tes...@googlegroups.com, Patrick Stein
Also, Patrick, could you send me the agent logs
from cl-test-grid/work-dir/agent/log/ directory?

Best regards,
- Anton

Patrick Stein

unread,
Nov 3, 2013, 9:28:49 AM11/3/13
to Anton Vodonosov, cl-tes...@googlegroups.com
I will send you the logs.

The parse-javascript.c file that JWACS is trying to compile is a 4M file. I can't really believe that it should take more than an hour to compile something even that big, but I will let it run for a day.

The cl-6502 tests, however, end up unkillable. Even after a "kill -9" they still stick around as a zombie process until I reboot my machine. I don't believe the cl-6502 tests will ever complete.

... Patrick

signature.asc

Patrick Stein

unread,
Nov 4, 2013, 5:50:22 PM11/4/13
to cl-tes...@googlegroups.com, Anton Vodonosov
Compiling jwacs took 68 minutes for me:

real    68m9.271s
user    58m58.793s
sys     2m40.499s

I've attached my agent logs.  I am about to try running the cl-6502 tests again, but I am skeptical.  It has resulted in a zombie ECL process each time I've tried to kill it before and not long after I tried to kill it, other apps started to freeze up until I did a hard reboot.

ttyl,
Patrick
agent-logs.tgz

Anton Vodonosov

unread,
Nov 5, 2013, 8:43:22 PM11/5/13
to Patrick Stein, cl-tes...@googlegroups.com
Patrick, I would like to fix this problem, but without
consuming your time and attention.

Could you arrange for ssh access for me to that machine?

I do not understand why the situation you observe mayО©╫
occur - ECL process should not stay. cl-test-grid-agent
kills lisp process with signal 9 if the process exceeds
timeout of 15 minutes.

Your cl-test-grid agent logs confirm the agent successfully
detects the ECL timeouts:

<INFO> [07:21:04] tg-agent complete-test-run.lisp (proc-test-loading) - Testing load of system jwacs...
<INFO> [07:21:04] lisp-exe lisp-exe.lisp (start) - starting command: /usr/local/bin/ecl -norc -eval (require :cmp) -eval (PROGN (LOAD #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/setup.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common-asdf.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-test-loading.lisp") (SET-RESPONSE #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/response7868847252545089198.lisp" (TEST-LOADING-MAIN #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/logs/jwacs-load" "jwacs" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/fasl/"))) -eval (ext:quit)
<WARN> [07:36:05] lisp-exe lisp-exe.lisp (run-with-timeout) - Lisp process #<ECL #x30200214505D> ((PROGN (LOAD #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/setup.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-common-asdf.lisp") (LOAD #P"/Users/cltestgrid/cl-test-grid/agent/proc-test-loading.lisp") (SET-RESPONSE #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/response7868847252545089198.lisp" (TEST-LOADING-MAIN #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/logs/jwacs-load" "jwacs" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/quicklisp/" #P"/Users/cltestgrid/cl-test-grid/work-dir/agent/test-runs/20131030171642-ecl-13.5.1-44c86e08-macosx-x86-lisp-to-c/fasl/")))) exceeded the timeout of 900 seconds. Trying to kill the process and it's possible child processes
<INFO> [07:36:05] lisp-exe lisp-exe.lisp (try-to-kill-process-tree) - Trying to kill the process tree of #<EXTERNAL-PROCESS (/usr/local/bin/ecl -norc -eval ...)[62629] (RUNNING) #x30200410600D>
<WARN> [07:36:05] lisp-exe lisp-exe.lisp (unix-kill-process-tree) - Killing the process tree for non-windows platforms is not implemented yet. Just killing the process.
<INFO> [07:36:05] tg-agent complete-test-run.lisp (proc-test-loading) - Child lisp testing loading the jwacs system has exceeded the timeout of 900 seconds and killed.
<INFO> [07:36:05] tg-agent complete-test-run.lisp (proc-test-loading) - The jwacs system load result: (:STATUS :TIMEOUT)



Still, the ECL process is alive. Probably the kill operation was not successful...

Are you using CCL to run the main agent program, as it is recommended in the docs?
Or you use some other lisp, like SBCL?

Best regards,
- Anton


05.11.2013, 02:50, "Patrick Stein" <patrick...@gmail.com>:
>> Compiling jwacs took 68 minutes for me:
>>
>> real О©╫ О©╫68m9.271s
>> user О©╫ О©╫58m58.793s
>> sys О©╫ О©╫ 2m40.499s
>>
>> I've attached my agent logs. О©╫I am about to try running the cl-6502 tests again, but I amО©╫skeptical. О©╫It has resulted in a zombie ECL process each time I've tried to kill itО©╫before and not long after I tried to kill it, other apps started to freeze up until I didО©╫a hard reboot.
>>
>> ttyl,
>> Patrick
>
> On Sunday, November 3, 2013 8:28:49 AM UTC-6, Patrick Stein wrote:> On 2013-11-03, at 1:51 AM, Anton Vodonosov <avodo...@yandex.ru> wrote:
>>
>>> Also, Patrick, could you send me the agent logs
>>> from cl-test-grid/work-dir/agent/log/ directory?
>>>
>>> Best regards,
>>> - Anton
>>
>> I will send you the logs.
>>
>> The parse-javascript.c file that JWACS is trying to compile is a 4M file. О©╫I can't really believe that it should take more than an hour to compile something even that big, but I will let it run for a day.
>>
>> The cl-6502 tests, however, end up unkillable. О©╫Even after a "kill -9" they still stick around as a zombie process until I reboot my machine. О©╫I don't believe the cl-6502 tests will ever complete.
>>
>> ... Patrick
>

Anton Vodonosov

unread,
Nov 7, 2013, 1:42:37 PM11/7/13
to Patrick Stein, cl-tes...@googlegroups.com
Patrick, I am experimenting on your machine
with ECL tests.

I have created an isolated test case which runs
ECL tests of jwacs and cl-6502 under supervision
of main test-grid agent process (the same way
it is usually work).

You can find my code in /Users/avodonosov/testgrid-investigation.lisp

So far in my experiments the main process correctly
detects timeout and kills the child ECL process.

The clang subprocesses spawned by ECL remain running,
because test-grid-agent does not know how to kill unix.
I am working to fix this, so that test-grid-agent will
kill ECL and any subprocesses started by ECL.

Still, I can not reproduce the situation you describe,
when ECL remains running. I carefully use the same
ECL as you use in your cl-test-grid configuration,
and the same CCL for the main process.

Can it be that when you observed this situation you had
the main agent process already killed somehow?

For example, in you first email you show the hanging ECL process
as this:

502 89579 ?? 1176:34.84 /usr/local/bin/ecl [...]

I do not understand the "??". It looks like unknown parent process ID.
But it strange, if the parent process (main agent process) was killed,
the ECL is usually adopted by init, so I would expect the parent process
ID to be 1.

What program do you use to monitor processes and find out
the CMU time consumed by the process?

Best regards,
- Anton

Patrick Stein

unread,
Nov 7, 2013, 2:01:54 PM11/7/13
to cl-tes...@googlegroups.com

On 2013-11-07, at 12:42 PM, Anton Vodonosov <avodo...@yandex.ru> wrote:

> Still, I can not reproduce the situation you describe,
> when ECL remains running. I carefully use the same
> ECL as you use in your cl-test-grid configuration,
> and the same CCL for the main process.

I'm running the cl-6502 test manually still as user cltestgrid. I believe that if I kill it, it will become a zombie process though I'm not sure as this time I ran it from a shell instead of from cron(8).

> Can it be that when you observed this situation you had
> the main agent process already killed somehow?

It is possible. I don't know. I know that it always catches me off guard when my CPU fan starts churning away on cltestgrid stuff. I have it on a cron(8) job and always forget. So, I often kill off a few java processes from ABCL tests before I realize that it just that cron job. I may have killed off the top-level test grid, too, but I don't think I'd have killed a ccl process without figuring out exactly what it was trying to run.

> For example, in you first email you show the hanging ECL process
> as this:
>
> 502 89579 ?? 1176:34.84 /usr/local/bin/ecl [...]

That's the output from 'ps -u cltestgrid'. The '??' is not the parent pid, it is the tty. As this was started from cron(8), there wouldn't have needed to be a tty and even if there were, it wouldn't need it once the parent shell is dead. I don't recall what 'ps -lu cltestgrid' said about who the parent was.

> What program do you use to monitor processes and find out
> the CMU time consumed by the process?

I look mostly at 'ps -u cltestgrid' or 'ps -u avodonosov' today. Sometimes I run 'top -o cpu' to see what's currently causing my fan to spin.

ttyl,
Patrick
signature.asc

Anton Vodonosov

unread,
Nov 7, 2013, 2:05:42 PM11/7/13
to cl-tes...@googlegroups.com, Patrick Stein
OK, I continue to investigate

Anton Vodonosov

unread,
Nov 7, 2013, 10:01:33 PM11/7/13
to cl-tes...@googlegroups.com, Patrick Stein
Patrick,

For the two problems you reported I have the following results:

1. jwacs compilation leaves clang processes compiling parse-javascript.c
О©╫О©╫
О©╫This problem reproduces very reliably and is caused by the fact
О©╫О©╫О©╫that on unix test-grid-agent only kills the hanging ECL, but not
О©╫О©╫О©╫ECL's child processes.

О©╫О©╫О©╫I have implemented a fix for this - new cl-test-grid-agent
О©╫О©╫О©╫kills the whole process tree on unix too.

О©╫О©╫О©╫Tested the jwacs use case with new cl-test-grid
О©╫О©╫О©╫on your machine. The clang processes do not remain anymore.

2. cl-6502 testsuite leaves ECL process in the system.

О©╫О©╫О©╫It is difficult to reproduce. I have tried many times,
О©╫О©╫О©╫both manually and from test-grid-agent.
О©╫О©╫О©╫But only once I saw an orphaned ECL process remained
О©╫О©╫О©╫in the system. Also, another time I saw two ECL processes
О©╫О©╫О©╫during cl-6502 testing; and one of the processes was
О©╫О©╫О©╫a child of another.

О©╫О©╫О©╫I think it is possible that ECL forks itself (for example
О©╫О©╫О©╫to invoke C compiler), but the forked child process hangs.
О©╫О©╫О©╫If this is true, then test-grid-agent has killed parent
О©╫О©╫О©╫ECL, and the child ECL remained in the system.

О©╫О©╫О©╫If this is true, then the fix I have implemented solves this
О©╫О©╫О©╫problem too.

To summarize, I have fixed the problem causing the most
of the trouble, because it reproduces all the time.
And most likely the another use case was caused by the same
reason, so it should be fixed too.

I want to test these two use-cases on your machine some more times,
maybe tomorrow.

If you encounter the hanging ECL again, please let me know.

Best regards,
- Anton

Patrick Stein

unread,
Nov 7, 2013, 10:04:13 PM11/7/13
to cl-tes...@googlegroups.com
Thank you!

-- Patrick <p...@nklein.com>

> On Nov 7, 2013, at 9:01 PM, Anton Vodonosov <avodo...@yandex.ru> wrote:
>
> Patrick,
>
> For the two problems you reported I have the following results:
>
> 1. jwacs compilation leaves clang processes compiling parse-javascript.c
>
> This problem reproduces very reliably and is caused by the fact
> that on unix test-grid-agent only kills the hanging ECL, but not
> ECL's child processes.
>
> I have implemented a fix for this - new cl-test-grid-agent
> kills the whole process tree on unix too.
>
> Tested the jwacs use case with new cl-test-grid
> on your machine. The clang processes do not remain anymore.
>
> 2. cl-6502 testsuite leaves ECL process in the system.
>
> It is difficult to reproduce. I have tried many times,
> both manually and from test-grid-agent.
> But only once I saw an orphaned ECL process remained
> in the system. Also, another time I saw two ECL processes
> during cl-6502 testing; and one of the processes was
> a child of another.
>
> I think it is possible that ECL forks itself (for example
> to invoke C compiler), but the forked child process hangs.
> If this is true, then test-grid-agent has killed parent
> ECL, and the child ECL remained in the system.
>
> If this is true, then the fix I have implemented solves this
> problem too.
>
> To summarize, I have fixed the problem causing the most
> of the trouble, because it reproduces all the time.
> And most likely the another use case was caused by the same
> reason, so it should be fixed too.
>
> I want to test these two use-cases on your machine some more times,
> maybe tomorrow.
>
> If you encounter the hanging ECL again, please let me know.
>
> Best regards,
> - Anton
>

Anton Vodonosov

unread,
Nov 10, 2013, 7:09:30 PM11/10/13
to cl-tes...@googlegroups.com, Patrick Stein
Patrick, unfortunately, yesterday I have reproduced the hanging ECL.

I have started a loop of testing jwacs and cl-6502 5 times.
One of the cl-6502 tests hung, here are tg-agent logs:

<INFO> [12:31:15] tg-agent complete-test-run.lisp (proc-run-libtest) - Starting CL-6502 test suite...
<INFO> [12:31:15] lisp-exe lisp-exe.lisp (start) - starting command: /usr/local/bin/ecl -norc -eval (require :cmp) -eval (progn (load #P"/Users/avodonosov/quicklisp/setup.lisp") (load #P"/Users/avodonosov/cl-test-grid/agent/proc-common.lisp") (load #P"/Users/avodonosov/cl-test-grid/agent/proc-common-asdf.lisp") (load #P"/Users/avodonosov/cl-test-grid/agent/proc-run-libtest.lisp") (set-response #P"/Users/avodonosov/test-run/response8409641804721035492.lisp" (run-libtest-main :cl-6502 #P"/Users/avodonosov/test-run/logs/cl-6502" #P"/Users/avodonosov/quicklisp/" #P"/Users/avodonosov/test-run/fasl/"))) -eval (ext:quit)
<WARN> [12:46:16] lisp-exe lisp-exe.lisp (run-with-timeout) - Lisp process #<ECL #x302002597F2D> ((PROGN (LOAD #P"/Users/avodonosov/quicklisp/setup.lisp") (LOAD #P"/Users/avodonosov/cl-test-grid/agent/proc-common.lisp") (LOAD #P"/Users/avodonosov/cl-test-grid/agent/proc-common-asdf.lisp") (LOAD #P"/Users/avodonosov/cl-test-grid/agent/proc-run-libtest.lisp") (SET-RESPONSE #P"/Users/avodonosov/test-run/response8409641804721035492.lisp" (RUN-LIBTEST-MAIN :CL-6502 #P"/Users/avodonosov/test-run/logs/cl-6502" #P"/Users/avodonosov/quicklisp/" #P"/Users/avodonosov/test-run/fasl/")))) exceeded the timeout of 900 seconds. Trying to kill the process and its possible child processes
<INFO> [12:46:16] lisp-exe lisp-exe.lisp (try-to-kill-process-tree) - Trying to kill the process tree of #<EXTERNAL-PROCESS (/usr/local/bin/ecl -norc -eval ...)[5188] (RUNNING) #x30200271D2BD>
<INFO> [12:46:16] lisp-exe lisp-exe.lisp (exec) - running command: /bin/sh /Users/avodonosov/cl-test-grid/agent/killproctree.sh 5188 9
<INFO> [12:46:39] tg-agent complete-test-run.lisp (proc-run-libtest) - Child lisp process running CL-6502 test suite has exceeded the timeout of 900 seconds and killed.
<INFO> [12:46:39] tg-agent complete-test-run.lisp (proc-run-libtest) - The CL-6502 test suite result: (:STATUS :TIMEOUT)

According to these logs, tg-agent successfully detected the ECL timeout
and tried to kill it. Yet, hours later, I did

patrick:~ avodonosov$ ps -lu avodonosov | grep ecl
504 5188 4637 4004 0 31 0 2520516 66020 - R 0 ?? 304:02.56 /usr/local/bin/ecl -norc -eval (require :cmp) -eval (progn (load #P"/Users/avodonosov/quicklisp/setup.lisp") (load #P"/Users/avodonosov/cl-test-grid/agent/proc-common.lisp") (load #P"/Users/avodonosov/cl-test-grid/agent/proc-common-asdf.lisp") (load #P"/Users/avodonosov/cl-test-grid/agent/proc-run-libtest.lisp") (set-response #P"/Users/avodonosov/test-run/response8409641804721035492.lisp" (run-libtest-main :cl-6502 #P"/Users/avodonosov/test-run/logs/cl-6502" #P"/Users/avodonosov/quicklisp/" #P"/Users/avodonosov/test-run/fasl/"))) -eval (ext:quit)

As we can see from process ID and the response8409641804721035492.lisp
file name in the process arguments, it is exactly the same process
tg-agent tried to kill. It is not a child of ECL, it is the root ECL
process running cl-6502 test suite. The process state is R - means running.

I know for sure that the
/bin/sh /Users/avodonosov/cl-test-grid/agent/killproctree.sh 5188 9
command invoked by agent usually kills processes. I tested it manually,
and also tg-agent killed many ECL processes compiling jwacs, together
with child clang processes.

I have tried this command again, from shell:

patrick:~ avodonosov$ /bin/sh /Users/avodonosov/cl-test-grid/agent/killproctree.sh 5188 9
patrick:~ avodonosov$ ps -lu avodonosov | grep ecl
504 5188 4637 6004 0 0 0 0 0 - ?E 0 ?? 0:00.00 (ecl)

The process remain in the system, its state is now ?E, with zero memory sizes.
THe main character is ? - ps doesn't know how to this status.
The E flag means "the process is trying to exit", according to 'man ps'.

So, we again observe the ECL process, which survives 'kill -9'.

I tried to attache to the process with debugger, but it was
too late after the second kill attempt - debugger was unable
to attach.

I have been refreshing my knowledge or unix signals.
The docs say that if the process is performing a system call,
like IO, the SIGKILL signal does not immediately kill it.
But when the system call finishes, the signal is delivered
and the process is killed.

But in our case ECL hangs for hours and hours.

Gnu C library docs even go as far as saying
"In fact, if SIGKILL fails to terminate a process, that by itself
constitutes an operating system bug which you should report."
http://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

I want to reproduce this problem again on your machine,
,attach to ECL with debugger, and try to understand what
it is doing (waiting for).

Also I will try to reproduce the same problem on linux.

On Monday and Tuesday I most likely will be unable to work
on this problem, so probably I will return to it in the
middle of the week.

Best regards,
- Anton

Patrick Stein

unread,
Nov 10, 2013, 7:47:35 PM11/10/13
to cl-tes...@googlegroups.com
Unfortunately, I will be travelling for work this week and next. My computer will be offline from Monday evening until Friday afternoon this week and Tuesday through Friday the following week.

-- Patrick <p...@nklein.com>

Anton Vodonosov

unread,
Nov 10, 2013, 7:53:07 PM11/10/13
to cl-tes...@googlegroups.com, Patrick Stein
OK. I can try with linux meantime.

11.11.2013, 04:47, "Patrick Stein" <p...@nklein.com>:
> Unfortunately, I will be travelling for work this week and next. О©╫My computer will be offline from Monday evening until Friday afternoon this week and Tuesday through Friday the following week.
>
> -- Patrick <p...@nklein.com>

Anton Vodonosov

unread,
Dec 23, 2013, 1:16:33 AM12/23/13
to Patrick Stein, cl-tes...@googlegroups.com
Hi, Patrick.

I have been trying to reproduce the problem on a multicore linux system.
Tested around 100-200 times, ECL never hangs on the cl-6502 testsuite.

Sometimes the testsuite takes more than 15 minutes timeout,
and tg-agent kills ECL successfully. If I remove the timeout
cl-6502 testsuite just finishes in about 20 minutes.

I would like to ensure I am testing exactly the same version
of ECL. What (asdf:implementation-identifier) returns on your ECL?

Best regards,
- Anton

11.11.2013, 04:53, "Anton Vodonosov" <avodo...@yandex.ru>:
> OK. I can try with linux meantime.
>
> 11.11.2013, 04:47, "Patrick Stein" <p...@nklein.com>:
>
>> О©╫Unfortunately, I will be travelling for work this week and next. О©╫My computer will be offline from Monday evening until Friday afternoon this week and Tuesday through Friday the following week.
>>
>> О©╫-- Patrick <p...@nklein.com>
Reply all
Reply to author
Forward
0 new messages