[AOLSERVER] Problem with ns_shutdown

23 views
Skip to first unread message

Porter, Caroline

unread,
Mar 1, 2012, 9:08:12 AM3/1/12
to aolserv...@lists.sourceforge.net

We are shutting down aolserver via the control port using the ns_shutdown command.  We are getting intermittent coredumps during the shutdown process.  Does anyone have any ideas as to how to resolve this?

 

Here’s some more info…

 

webserver log:

 

[29/Feb/2012:08:20:02][30350.82082672][-nscp:1-] Notice: nscp: 127.0.0.1 connected

[29/Feb/2012:08:20:03][30350.82082672][-nscp:1-] Notice: nscp: nsadmin logged in

[29/Feb/2012:08:20:04][30350.4151592640][-main-] Notice: nsmain: AOLserver/4.5.1 stopping

[29/Feb/2012:08:20:04][30350.4151592640][-main-] Notice: driver: stopping: nssock

[29/Feb/2012:08:20:04][30350.4151592640][-main-] Notice: sched: shutdown pending

[29/Feb/2012:08:20:04][30350.131660656][-socks-] Notice: socks: shutdown pending

[29/Feb/2012:08:20:04][30350.4141099888][-sched-] Notice: sched: shutdown started

[29/Feb/2012:08:20:04][30350.4141099888][-sched-] Notice: sched: waiting for event threads...

[29/Feb/2012:08:20:04][30350.131660656][-socks-] Notice: nscp: shutdown

[29/Feb/2012:08:20:04][30350.66386800][-sched:idle1-] Notice: exiting

[29/Feb/2012:08:20:04][30350.148007792][-sched:idle0-] Notice: exiting

[29/Feb/2012:08:20:04][30350.131660656][-socks-] Notice: socks: shutdown complete

[29/Feb/2012:08:20:04][30350.56376176][-nssock:driver-] Notice: exiting

[29/Feb/2012:08:20:04][30350.4141099888][-sched-] Notice: sched: shutdown complete

[29/Feb/2012:08:20:04][30350.4151592640][-main-] Notice: driver: stopped: nssock

[29/Feb/2012:08:20:05][30350.82082672][-nscp:1-] Notice: nscp: 127.0.0.1 disconnected

[29/Feb/2012:08:20:05][30350.56376176][-shutdown-] Notice: Shutdown called for server bwd

[29/Feb/2012:08:20:05][30350.56376176][-shutdown-] Notice: nslog: closing '/data/bwd/logs/httpd_access_stg_delray.bna.com_5000.log'

[29/Feb/2012:08:20:05][30350.4151592640][-main-] Notice: nsmain: AOLserver/4.5.1 exiting

called Tcl_FindHashEntry on deleted table

 

Here’s what is in the coredump…

 

Program terminated with signal 6, Aborted.

#0  0x0071d430 in __kernel_vsyscall ()

#0  0x0071d430 in __kernel_vsyscall ()

#1  0x0036ab71 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64

#2  0x0036c44a in abort () at abort.c:92

#3  0x002e8ddf in Tcl_PanicVA () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#4  0x002e8e04 in Tcl_Panic () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#5  0x002bccea in BogusFind () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#6  0x00304de1 in ThreadStorageGetHashTable () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#7  0x00304f0c in TclpThreadDataKeyGet () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#8  0x00303d28 in Tcl_GetThreadData () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#9  0x002e8545 in TclFreeObj () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#10 0x0030f8b0 in FreeVarEntry () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#11 0x002bc845 in Tcl_DeleteHashTable () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#12 0x0031052e in UnsetVarStruct () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#13 0x0031080f in TclDeleteNamespaceVars () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#14 0x002dfda8 in TclTeardownNamespace () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#15 0x002e0045 in Tcl_DeleteNamespace () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#16 0x002dfeab in TclTeardownNamespace () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#17 0x002e0045 in Tcl_DeleteNamespace () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#18 0x002dfeab in TclTeardownNamespace () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#19 0x002647a7 in DeleteInterpProc () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#20 0x002f47a4 in Tcl_EventuallyFree () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#21 0x00264702 in Tcl_DeleteInterp () from /apps/bos-dev/bwd/lib/libtcl8.5.so

#22 0x0014dd2f in Ns_TclDestroyInterp () from /apps/bos-dev/bwd/lib/libnsd.so

#23 0x0014e508 in DeleteData () from /apps/bos-dev/bwd/lib/libnsd.so

#24 0x00ca6479 in NsCleanupTls () from /apps/bos-dev/bwd/lib/libnsthread.so

#25 0x00ca81e2 in FreeThread () from /apps/bos-dev/bwd/lib/libnsthread.so

#26 0x00174a8a in __nptl_deallocate_tsd (arg=0x4e47b70) at pthread_create.c:154

#27 start_thread (arg=0x4e47b70) at pthread_create.c:308

#28 0x0041cc2e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:133

 

 

 

Rusty Brooks

unread,
Mar 1, 2012, 9:56:07 AM3/1/12
to Porter, Caroline, aolserv...@lists.sourceforge.net
This rings a bell, not from AOLServer, but from using threaded tcl in general.  I'm sorry it's a bit fuzzy but I seem to recall it happening from some kind of race condition causing a teardown procedure to happen more than once (and the second one causes the core dump because all the things it's trying to free are gone).  I was using a threaded tclkit at the time that I saw this a lot and I think it largely went away when I upgraded to a newer version of tcl, but it's been a couple years since I dealt with it, so I really don't remember.

If you google "called Tcl_FindHashEntry on deleted table" you will find a LOT of stuff.

Rusty

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/_______________________________________________
aolserver-talk mailing list
aolserv...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk

Victor Guerra

unread,
Mar 1, 2012, 10:35:53 AM3/1/12
to Porter, Caroline, aolserv...@lists.sourceforge.net
Which version of tcl are you running? 

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
aolserver-talk mailing list
aolserv...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk




--
-vg

Porter, Caroline

unread,
Mar 1, 2012, 1:43:24 PM3/1/12
to Victor Guerra, aolserv...@lists.sourceforge.net

tcl 8.5.9

 

Caroline

Jim Davidson

unread,
Mar 2, 2012, 8:36:05 AM3/2/12
to Porter, Caroline, aolserv...@lists.sourceforge.net, Victor Guerra


Hi,

It appears this crash is Tcl trying to free some per-thread context in a thread that's exiting after Aolserver is done with it's cleanup of Tcl.  Checking the latest Aolserver source on GitHub shows a final call to Tcl_Finalize in nsd/nsmain.c just before the return.  If you're running this version, try commenting it out to see if the crash goes away.



More detail on what MAY be happening....


What ns_shutdown does is send a signal for the main thread to initiate shutdown.  The main thread then sends signals to all the subsystems (conn threads, scheduler, etc.) to shutdown, waits for those shutdowns to complete (i.e., threads to exit), and then does some final cleanup. As threads exit, they call various per-thread cleanup handlers which rely on per-process state including the Tcl core.

From your stack trace it looks like some thread possibly created outside of one of these subsystems is exiting after Aolserver thinks shutdown is complete and after Aolserver has called Tcl_Finalize to tear down the Tcl core.  Such a thread could be created by 3rd party code that calls pthread_create directly (or even Ns_ThreadCreate directly), and then later calls into Aolserver.  While Aolserver attempts to carefully manage all the threads it knows about, and there's considerable code to gracefully signal and wait for these threads, it can't really control when, if, and how, these other threads exit.

It turns out the Aoslerver API is designed to attempt to handle this situation a bit but Tcl generally is not.  This is a symptom of the different approaches to thread cleanup in Tcl and Aoslerver.  Aolserver follows the pthread model which calls registered cleanup routines in order and then tries again a few times if necessary in the case some cleanup accidentally re-initializes some resource (see the comment in NsCleanupTls in nsthread/tls.c).

Tcl instead provides various callback mechanisms for cleanup and there's much care and coordination in the Tcl core to ensure things are cleaned up in the right order.  However, as Tcl is designed to be embedded in other code, this level of care cannot be guaranteed outside the core.  My opinion is that it was always unfortunate Tcl chose this model given it's goals and constraints.  Another way to look at it is that in Aolserver, the correct order is a matter of optimization whereas in Tcl it's a necessity. 

The pthread model, while not perfect, in practice always seemed more robust for far fewer lines of code.   Admit-ably, Aolserver doesn't care so much as exit really is about graceful shutdown of transaction processing threads -- the rest is just aesthetics as the _exit() will evaporate memory, open files, etc., efficiently and accurately.  Evidently Tcl has long operated in some embedded systems where cleanup needed to be an actual cleanup.  These use cases pre-dated threaded Tcl and the old cleanup interfaces where extended for threaded code instead of introducing a new model.

Anyway, as I could never really get Tcl cleanup to operate in a reliable way and because it didn't really matter to Aolserver, the call to Tcl_Finalize had been commented out for years. As this has become a recurring problem, I'd suggest now it should be a config option, default off.  In the off-chance someone really needs Tcl_Finalize, they could set the option on.

Of course you could have some other problem.  If this doesn't help, you could try compiling with symbols and poking around in the core dump for some more clues.


Cheers,
-Jim

Maurizio Martignano

unread,
Mar 1, 2012, 11:49:20 PM3/1/12
to Porter, Caroline, Victor Guerra, aolserv...@lists.sourceforge.net

Dear Caroline,

                I use Aolserver on Windows Environments.

 

There the shutdown is always a problem if nsmain.c calls Tcl_Finalize.

To avoid any problem I had to comment out this call:

#ifndef _WIN32

    Tcl_Finalize();

#endif

I believe that if you do the same, that is you comment out this call, you may avoid all race conditions you are experiencing.

Please do notice that not calling Tcl_Finalize is not an issue, cause the Aolserver process is anyhow about to die and the proper cleaning of all allocated resources is performed by the operating system.

 

Hope it helps,

Maurizio

Gustaf Neumann

unread,
Mar 5, 2012, 5:29:01 AM3/5/12
to aolserv...@lists.sourceforge.net
Dear Caroline,

We had a long discussion here in the list whether or not one should call Tcl_Finalize() during cleanup. While i am not in favor of commenting Tcl_Finalize() out on a unix-like os, i think that the possible harm of doing so is limited. A crash during finalize can hint to a problem sitting some else.

Concerning your crash-case: do you experience crashes on every ns_shutdown?

i would recommend to upgrade to tcl 8.5.11 if possible.

best regards
-gustaf neumann

Maurizio Martignano

unread,
Mar 5, 2012, 10:28:24 AM3/5/12
to Gustaf Neumann, aolserv...@lists.sourceforge.net

Dear Gustav,

 

                I believe you should also comment Jim’s writing:

 

Anyway, as I could never really get Tcl cleanup to operate in a reliable way and because it didn't really matter to Aolserver, the call to Tcl_Finalize had been commented out for years. As this has become a recurring problem, I'd suggest now it should be a config option, default off.  In the off-chance someone really needs Tcl_Finalize, they could set the option on.

 

 

All the best,

Maurizio

Jim Davidson

unread,
Mar 5, 2012, 12:11:45 PM3/5/12
to Maurizio Martignano, aolserv...@lists.sourceforge.net

Howdy,

The more I think about it, the config option makes good sense as a compromise.  Gustaf's point about a crash hinting at other problems is a strong reason to leave it working in some fashion instead of just dumping altogether as I had in the past.


/* BEGIN OPINION ... consider enjoying Facebook instead of reading this babble...  */

But, I will re-iterate one thing:  The manner in which Tcl handles cleanup makes it technically impossible to avoid all possible crashes assuming you allow Tcl to be used in other programs of which you do not have control of all the code. 

To be clear, the pthread model that Aolserver follows (i.e., iterative calls to cleanups) cannot be guaranteed to work either -- you could conceive of a condition where thread cleanup A re-initalizes B which re-initalizes A, etc.  That's why there's a retry count, set to 5.  It attempts to catch the normal case of going through the cleanups just once, tries again in the off-chance something got re-initalized, and once again in the very off-chance something got re-initalized again.  And, it requires users of the interface to understand how it's done -- the biggest confusion is ignoring the object passed as a pointer to the cleanup and instead referencing it through the per-thread interface when that slot is expected to remain null.  In practice, that problem is more severe and caught during development, avoiding a run-time confusion between unrelated interfaces.

The Tcl model instead has a a good deal of careful code to manage the cleanup and many checks for cases in which things are out of order.  Because you can't own *all* the code that may call into Tcl, against best efforts to cleanup and unload Tcl you can in practice get tripped up either calling in before initialization or calling in after teardown.

An alternative for Tcl would be to reference all resources strictly through per-thread structures which reference global structures as necessary.  Reference counts and locks would initializes these per-thread and per-process interfaces as needed.  The last cleanup of the per-thread interface would teardown the global resources.  If things got re-initialized later, so be it.  Aolserver generally follows this model.  I think if Tcl did not pre-date threaded code, it may have more naturally followed this model.  But Tcl has a long legacy -- well over 20 years -- and it's made incremental progress towards threading over the past 10 or so years, leveraging the previous cleanup models.  Plus, Tcl operates in Windows of various DLL flavors further complicating this (you can check the Aolserver code for how Windows is handled there -- a bit more shaky given Window's limitations but reasonably close to pthreads).

Now some would argue that the mechanics of per-thread resources pointing to per-process resources with locks, etc., to handle initialization and tear down is complicated and/or inefficient.  And, that the optimization benefits of code streamlined for single threading are important enough to not use this model (i.e., access globals directly instead of a stub per-thread structure for the single threaded stuff).  On both points I disagree.  And, I'm now an old curmudgeon who won't be convinced otherwise :)

Caveat:  I haven't looked directly at the code in some time but this was an significant interest of mine years back given the problems on Aolserver.  I played around with re-structuring Tcl in this way with the possibility of submitting a TIP for it but never had time to see it through (because, well, there's so much complicated cleanup code in there already to purge).

/* ...END OPINION */


In the end, if it is Tcl_Finalize that's causing your crash, it's pretty likely you can just disable it for now and get back to business.  I hope this helps,


Cheers,
-Jim







------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2_______________________________________________

Gustaf Neumann

unread,
Mar 6, 2012, 8:00:08 AM3/6/12
to aolserv...@lists.sourceforge.net
On 05.03.12 16:27, Maurizio Martignano wrote:

Dear Gustav,

 

                I believe you should also comment Jim’s writing:

 

Anyway, as I could never really get Tcl cleanup to operate in a reliable way and because it didn't really matter to Aolserver, the call to Tcl_Finalize had been commented out for years. As this has become a recurring problem, I'd suggest now it should be a config option, default off. 


The case for windows is special, since the crash happens there during the unload of the dll's, which is highly platform specific. The situation on unixes is different. In our configurations (mostly linux with 32bit and 64 bit, rhel/fedora/ubuntu, as well some mac os x), the shutdown works fine since many years (although we do not call it normally from the control port).

It is not the case, that Tcl_Finalize() in recent Tcl versions is inherently broken and "needs to be fixed". It is the situation as Jim points out, that some modules/packages might register handlers that have some bugs, or - which would be my primary suspect - some loaded module/package has a bug (overwritten some memory, double frees, ...) which manifests during cleanup. Cleanup is highly sensitive to bugs in the memory management.

The source of the problem should be fixed, not the symptom. It is not unlikely that the same bug will hit you in some other cases as well...

-gustaf

In the off-chance someone really needs Tcl_Finalize, they could set the option on.

 

 

All the best,

Maurizio

 

 

From: Gustaf Neumann [mailto:neu...@wu.ac.at]

Sent: 05 March 2012 11:29

Reply all
Reply to author
Forward
0 new messages