Quick status update on my issues: Sometime around 6 last night (EDT), my
server did an automatic server restart (for those not familiar with
Proliant servers, there is a circuit whcih monitors the system response,
and if it appears to be hung - runaway process, etc. - for a certain
amount of time, it will perform what is called an ASR: an automatic cold
reboot). I logged in to check on the box about four or five hours later,
to find that httpd.exe was hogging both CPUs (again), and not responding
to requests. I was able to kill Apache and restart it, and all has since
returned to normal. Uptime on the box was about 6 days, with Apache
restarting itself anytime between every five minutes and every two or
three hours, and a cron job to manually bounce the process once per day.
I don't know if this (the CPU hogging) is directly related to the PHP
issues or not.
As soon as I'm able to take the time, I'm going to build an identical
setup just for testing purposes. I'm going to have some rack space
available soon.
Cheers/2
--
Lewis
-------------------------------------------------------------
Lewis G Rosenthal, CNA, CLP, CLE
Rosenthal & Rosenthal, LLC www.2rosenthals.com
Need a managed Wi-Fi hotspot? www.hautspot.com
Secure, stable, operating system www.ecomstation.com
-------------------------------------------------------------
massimo s.
http://www.ecomstation.it/
Lewis G Rosenthal ha scritto:
<snip>
massimo s.
http://www.ecomstation.it/
Lewis G Rosenthal ha scritto:
Hi Lewis,
No luck so far duplicating this here. Since this is not a public server,
I haven't yet found a way to simulate sufficient traffic. The SCOUG
server just keeps chugging along, but it's a simple setup. No php and
lots of REXX CGI.
I may need to experiment with wget or ab.
>returned to normal. Uptime on the box was about 6 days, with Apache
>restarting itself anytime between every five minutes and every two or
>three hours, and a cron job to manually bounce the process once per day.
>I don't know if this (the CPU hogging) is directly related to the PHP
>issues or not.
We need to get a better picture of of what is happening at the time of the
trap.
Please send me copies of the .conf files you can share via private mail.
I might have some more ideas. The confi.sys I have is from July, so a
fresh copy of this might be useful too.
What I recommend for now is enable log forensic and bump the logging level
to debug. To combat the log file size, add some code to the restart
wrapper to archive the log files before restarting httpd.
>As soon as I'm able to take the time, I'm going to build an identical
>setup just for testing purposes. I'm going to have some rack space
>available soon.
I suspect this is only going to help if you can generate the right
traffic.
Steven
--
----------------------------------------------------------------------
"Steven Levine" <ste...@earthlink.net> eCS/Warp/DIY etc.
www.scoug.com www.ecomstation.com
----------------------------------------------------------------------
what kind of firewall do you have between your server(s) and the internet? i'd
be real interested in what snort with a certain set of rules would be sounding
alerts on if anything... in other words, i'm wondering if you be be subject to
effects of certain attacks being tried...
how did you kill apache and all the children?
On 10/03/09 05:36 pm, Steven Levine thus wrote :
> In <4AC77B3F...@2rosenthals.com>, on 10/03/09
> at 12:26 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi Lewis,
>
> No luck so far duplicating this here. Since this is not a public server,
> I haven't yet found a way to simulate sufficient traffic. The SCOUG
> server just keeps chugging along, but it's a simple setup. No php and
> lots of REXX CGI.
>
> I may need to experiment with wget or ab.
>
>
Indeed, I saw a noticeable uptick in these occurrences *after*
incorporating the VOICE stuff. Whether it is traffic-related in general
or something specific to code introduced at the time of the migration
(i.e., prior to that, I had no MediaWiki or Mantis sites configured), I
can't say.
>> returned to normal. Uptime on the box was about 6 days, with Apache
>> restarting itself anytime between every five minutes and every two or
>> three hours, and a cron job to manually bounce the process once per day.
>> I don't know if this (the CPU hogging) is directly related to the PHP
>> issues or not.
>>
>
> We need to get a better picture of of what is happening at the time of the
> trap.
>
>
Indeed.
> Please send me copies of the .conf files you can share via private mail.
> I might have some more ideas. The confi.sys I have is from July, so a
> fresh copy of this might be useful too.
>
>
I shall gather them right now.
> What I recommend for now is enable log forensic and bump the logging level
> to debug. To combat the log file size, add some code to the restart
> wrapper to archive the log files before restarting httpd.
>
>
I have bumped PHP to E_ALL & E_STRICT. I can move Apache back to debug
for a while (which would surely give us more specifics on the exe). I
shall see to that shortly.
>> As soon as I'm able to take the time, I'm going to build an identical
>> setup just for testing purposes. I'm going to have some rack space
>> available soon.
>>
>
> I suspect this is only going to help if you can generate the right
> traffic.
>
>
Indeed. The box is not heavily loaded, however. server-status routinely
shows things like "2 requests currently being processed, 16 idle
workers" which is hardly strenuous. As you say, it's the *right* traffic
we need to see the problem (and that was what led me to the discussion
of logging timestamps: trying to find a correlation between the Apache
log and the PHP one.
Stand by for those files.
On 10/03/09 10:58 pm, waldo kitty thus wrote :
> Lewis G Rosenthal wrote:
>
>> I was able to kill Apache and restart it, and all has since
>> returned to normal. Uptime on the box was about 6 days, with Apache
>> restarting itself anytime between every five minutes and every two or
>> three hours, and a cron job to manually bounce the process once per day.
>> I don't know if this (the CPU hogging) is directly related to the PHP
>> issues or not.
>>
>
> what kind of firewall do you have between your server(s) and the internet?
I have Novell Security Manager (based on Astaro version 6) running on
the border. One of these days, I'll upgrade it to Astaro v7...
> i'd
> be real interested in what snort with a certain set of rules would be sounding
> alerts on if anything... in other words, i'm wondering if you be be subject to
> effects of certain attacks being tried...
>
>
Tons; and it does run snort. Here's something typical:
Intrusion Protection Alert
An intrusion has been detected. The packet has been dropped automatically.
You can toggle this rule between "drop" and "alert only" in WebAdmin.
Details about the intrusion alert:
Message........: WEB-PHP remote include path
Details........: http://www.snort.org/pub-bin/sigs.cgi?sid=2002
Time...........: 2009:10:03-15:57:13
Packet dropped.: yes
Priority.......: 1 (high)
Classification.: Web Application Attack
IP protocol....: 6 (TCP)
Source IP address: 92.50.143.90 (92.50.143.90.static.ufanet.ru)
- http://www.dnsstuff.com/tools/ptr.ch?ip=92.50.143.90
- http://www.ripe.net/perl/whois?query=92.50.143.90
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=92.50.143.90
- http://cgi.apnic.net/apnic-bin/whois.pl?search=92.50.143.90
Source port: 60449
Destination IP address: 192.168.100.2 (hawking.randr)
- http://www.dnsstuff.com/tools/ptr.ch?ip=192.168.100.2
- http://www.ripe.net/perl/whois?query=192.168.100.2
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=192.168.100.2
- http://cgi.apnic.net/apnic-bin/whois.pl?search=192.168.100.2
Destination port: 80 (http)
Now, this happened to be dropped bevfore it reached the server. An
example of an attack which was not dropped:
Intrusion Protection Alert
An intrusion has been detected. The packet has *not* been dropped.
If you want to block packets like this one in the future,
set the corresponding intrusion protection rule to "drop" in WebAdmin.
Be careful not to block legitimate traffic caused by false alerts though.
Details about the intrusion alert:
Message........: WEB-PHP xmlrpc.php post attempt
Details........: http://www.snort.org/pub-bin/sigs.cgi?sid=3827
Time...........: 2009:04:02-04:06:21
Packet dropped.: no
Priority.......: 1 (high)
Classification.: Web Application Attack
IP protocol....: 6 (TCP)
Source IP address: 80.240.220.83 (80-240-220-83.dnat.migtel.ru)
- http://www.dnsstuff.com/tools/ptr.ch?ip=80.240.220.83
- http://www.ripe.net/perl/whois?query=80.240.220.83
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=80.240.220.83
- http://cgi.apnic.net/apnic-bin/whois.pl?search=80.240.220.83
Source port: 51457
Destination IP address: 192.168.100.2 (hawking.randr)
- http://www.dnsstuff.com/tools/ptr.ch?ip=192.168.100.2
- http://www.ripe.net/perl/whois?query=192.168.100.2
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=192.168.100.2
- http://cgi.apnic.net/apnic-bin/whois.pl?search=192.168.100.2
Destination port: 80 (http)
(As the Snort reference on this one is too old, here are details: http://www.securityfocus.com/bid/14088/info .)
and this, from the past couple of days:
Intrusion Protection Alert
An intrusion has been detected. The packet has *not* been dropped.
If you want to block packets like this one in the future,
set the corresponding intrusion protection rule to "drop" in WebAdmin.
Be careful not to block legitimate traffic caused by false alerts though.
Details about the intrusion alert:
Message........: WEB-MISC Phorecast remote code execution attempt
Details........: http://www.snort.org/pub-bin/sigs.cgi?sid=1391
Time...........: 2009:10:04-04:25:29
Packet dropped.: no
Priority.......: 1 (high)
Classification.: Web Application Attack
IP protocol....: 6 (TCP)
Source IP address: 213.175.211.6
- http://www.dnsstuff.com/tools/ptr.ch?ip=213.175.211.6
- http://www.ripe.net/perl/whois?query=213.175.211.6
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=213.175.211.6
- http://cgi.apnic.net/apnic-bin/whois.pl?search=213.175.211.6
Source port: 55015
Destination IP address: 192.168.100.2 (hawking.randr)
- http://www.dnsstuff.com/tools/ptr.ch?ip=192.168.100.2
- http://www.ripe.net/perl/whois?query=192.168.100.2
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=192.168.100.2
- http://cgi.apnic.net/apnic-bin/whois.pl?search=192.168.100.2
Destination port: 80 (http)
( http://www.selfsecurity.org/TrendMap/signature/jpn/352.htm ) or
Intrusion Protection Alert
An intrusion has been detected. The packet has *not* been dropped.
If you want to block packets like this one in the future,
set the corresponding intrusion protection rule to "drop" in WebAdmin.
Be careful not to block legitimate traffic caused by false alerts though.
Details about the intrusion alert:
Message........: WEB-PHP PHPLIB remote command attempt
Details........: http://www.snort.org/pub-bin/sigs.cgi?sid=1254
Time...........: 2009:09:28-09:05:12
Packet dropped.: no
Priority.......: 1 (high)
Classification.: Attempted User Privilege Gain
IP protocol....: 6 (TCP)
Source IP address: 217.160.74.224 (p15169201.pureserver.info)
- http://www.dnsstuff.com/tools/ptr.ch?ip=217.160.74.224
- http://www.ripe.net/perl/whois?query=217.160.74.224
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=217.160.74.224
- http://cgi.apnic.net/apnic-bin/whois.pl?search=217.160.74.224
Source port: 53300
Destination IP address: 192.168.100.2
- http://www.dnsstuff.com/tools/ptr.ch?ip=192.168.100.2
- http://www.ripe.net/perl/whois?query=192.168.100.2
- http://ws.arin.net/cgi-bin/whois.pl?queryinput=192.168.100.2
- http://cgi.apnic.net/apnic-bin/whois.pl?search=192.168.100.2
Destination port: 80 (http)
AFAICT, none of the above (et seq) directly correlate to the times of bad Apache behavior and/or the PHP crashes we've been seeing. Most of the attempts are prefixed [CRIT-852] (dropped) vs 850 (allowed), and from all that I can see (having just again reviewed a hundred or so entries), these exploits are rather old and *should* be easily survivable by PHP > 4.40 and/or Apache > 2.2.1 or 2.0.49 or so.
> how did you kill apache and all the children?
>
>
When I did it manually? That was easy: Surface the Apache window from
which the daemon is started, and Ctrl-Brk a couple of times to break out
of my script (which normally restarts Apache when it fails). When the
window closed, I went to TOP to confirm that there were no more httpd
processes running.
My nightly kill cron job is:
0 3 * * * start "Apache Restart" /n /min
c:\os2\apps\pgmcontrol\pgmcntrl /kill /exename:httpd.exe
which uses Christan Langanke's PGMCNTRL app to kill all instances of
httpd.exe by name, regardless of the PID(s). The script:
/**/
Do forever
/*
The master environment says:
'SET TZ=EST5EDT,3,2,0,7200,11,1,0,7200,3600'
which is not POSIX-compliant, so:
'SET TZ=EST5EDT4,M3.2,M11.1'
which doesn't play nicely with a bunch of things,
so let's keep it simple for DST:
*/
'SET TZ=EST4'
'set beginlibpath=j:\Apps\apache2\bin;j:\Apps\apache2\modules'
'bin\httpd.exe -d . 2>&1'
Call Syssleep 2
end
exit
will restart Apache after the cron job finishes.
However, as the system actually hung in less than 24 hours, the cron job
is apparently not helping (Apache restarts enough on its own and via the
script, I suppose).
Hi,
>incorporating the VOICE stuff. Whether it is traffic-related in general
>or something specific to code introduced at the time of the migration
>(i.e., prior to that, I had no MediaWiki or Mantis sites configured), I
>can't say.
The problems are very likely php related, but suspecting this is only the
first step to finding and fixing the issue. We really do not yet have a
clear idea as to which http request is failing. This is why is suggested
that log forensic might help.
>>> returned to normal. Uptime on the box was about 6 days, with Apache
>>> restarting itself anytime between every five minutes and every two or
>>> three hours, and a cron job to manually bounce the process once per day.
>>> I don't know if this (the CPU hogging) is directly related to the PHP
>>> issues or not.
Just to sure I understand you meaning of restarting, do you mean that the
main apache process dies and is restarted by your wrapper script?
>I shall gather them right now.
Got 'em.
>Indeed. The box is not heavily loaded, however. server-status routinely
>shows things like "2 requests currently being processed, 16 idle
>workers" which is hardly strenuous.
This is essentially an idle server.
Hi guys,
>> how did you kill apache and all the children?
>When I did it manually? That was easy: Surface the Apache window from
>which the daemon is started, and Ctrl-Brk a couple of times to break out
>of my script (which normally restarts Apache when it fails).
FWIW, there was a time when Ctrl-C tended to hang httpd in the exit list,
but if you are not seeing this, there's not problem using Ctrl-C.
Out of habit, I use a version of apachectl.cmd which finds the pid file
and uses
apache_kill -TERM
which is supposed to provide a more graceful shutdown.
> 'bin\httpd.exe -d . 2>&1'
I recommand you add
say 'Restarting httpd at' date() time()
'bin\httpd.exe -d . 2>&1'
say 'httpd stopped at' date() time()
This tells you where to start looking in the logs.
>However, as the system actually hung in less than 24 hours, the cron job
>is apparently not helping (Apache restarts enough on its own and via the
>script, I suppose).
Agreed.
On 10/04/09 05:02 pm, Steven Levine thus wrote :
> In <4AC8D0AA...@2rosenthals.com>, on 10/04/09
> at 12:43 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>> incorporating the VOICE stuff. Whether it is traffic-related in general
>> or something specific to code introduced at the time of the migration
>> (i.e., prior to that, I had no MediaWiki or Mantis sites configured), I
>> can't say.
>>
>
> The problems are very likely php related, but suspecting this is only the
> first step to finding and fixing the issue. We really do not yet have a
> clear idea as to which http request is failing. This is why is suggested
> that log forensic might help.
>
>
Indeed.
>>>> returned to normal. Uptime on the box was about 6 days, with Apache
>>>> restarting itself anytime between every five minutes and every two or
>>>> three hours, and a cron job to manually bounce the process once per day.
>>>> I don't know if this (the CPU hogging) is directly related to the PHP
>>>> issues or not.
>>>>
>
> Just to sure I understand you meaning of restarting, do you mean that the
> main apache process dies and is restarted by your wrapper script?
>
>
It's hard to tell. I will add your suggested lines to the wrapper, which
should make it easier to determine which was an "internal" daemon
restart and which was caused by the wrapper determining that httpd.exe
was no longer running. There is no discernible difference between the
two happenings when reading the Apache error log or POPUPLOG.OS2 (unless
the latter only shows the wrapper restarting...I haven;t actually
*counted* the restarts in one vs the other).
>> I shall gather them right now.
>>
>
> Got 'em.
>
>
>> Indeed. The box is not heavily loaded, however. server-status routinely
>> shows things like "2 requests currently being processed, 16 idle
>> workers" which is hardly strenuous.
>>
>
> This is essentially an idle server.
>
>
My thoughts exactly.
Current stats:
Current Time: Sunday, 04-Oct-2009 17:33:04
Restart Time: Sunday, 04-Oct-2009 15:52:45
Parent Server Generation: 0
Server uptime: 1 hour 40 minutes 18 seconds
Total accesses: 580 - Total Traffic: 7.6 MB
CPU Usage: u3583.05 s0 cu0 cs0 - 59.5% CPU load
.0964 requests/sec - 1329 B/second - 13.5 kB/request
2 requests currently being processed, 12 idle workers
Not a very busy day, yet I have a slew of restarts today.
The Apache error log shows:
[Sun Oct 04 00:14:58 2009] [error] (OS 10035)Resource temporarily
unavailable: apr_socket_accept
followed by four of the usual stuff we've come to expect (child PID
shutting down - all the same PID, too), down to:
Killed by SIGSEGV
pid=0x025e ppid=0x025d tid=0x0001 slot=0x00ed pri=0x0200 mc=0x0001
J:\APPS\APACHE2\BIN\HTTPD.EXE
cs:eip=005b:1c892f1c ss:esp=0053:0022fc3c ebp=0022fc68
ds=0053 es=0053 fs=150b gs=0000 efl=00210202
eax=1c892f1c ebx=00e62a0c ecx=00000000 edx=004f1ae0 edi=0000000a
esi=003755a8
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
Now, POPUPLOG.OS2 says:
10-04-2009 00:15:14 SYS3175 PID 025f TID 000e Slot 00e5
J:\APPS\APACHE2\BIN\HTTPD.EXE
c0000005
1ffb1b65
P1=00000001 P2=00000b14 P3=XXXXXXXX P4=XXXXXXXX
EAX=00000b14 EBX=2003b840 ECX=024fffd4 EDX=024fffd4
ESI=024fffd4 EDI=0000000e
DS=0053 DSACC=f0f3 DSLIM=ffffffff
ES=0053 ESACC=f0f3 ESLIM=ffffffff
FS=150b FSACC=00f3 FSLIM=00000030
GS=0000 GSACC=**** GSLIM=********
CS:EIP=005b:1ffb1b65 CSACC=f0df CSLIM=ffffffff
SS:ESP=0053:024fff94 SSACC=f0f3 SSLIM=ffffffff
EBP=024fff94 FLG=00010213
DOSCALL1.DLL 0002:00001b65
The Apache log shows the identical behavior two minutes later, but
POPUPLOG has no further entry until 4:33am. So, I *might* conclude from
that that the only entries in POPUPLOG are the ones when the exe crashed
and could not recover, forcing the wrapper to restart it, vs the ones
where httpd was able to restart itself, reported in the Apache error
log. Would you concur?
And of course, none of these show any specific correlation to the high
CPU usage issue, unfortunately. <sigh>
Hi,
>It's hard to tell. I will add your suggested lines to the wrapper, which
>should make it easier to determine which was an "internal" daemon
>restart and which was caused by the wrapper determining that httpd.exe
>was no longer running. There is no discernible difference between the
>two happenings when reading the Apache error log or POPUPLOG.OS2 (unless
>the latter only shows the wrapper restarting...I haven;t actually
>*counted* the restarts in one vs the other).
Whether or not the exception shows up in popuplog is determined by which
exception handler processes the exception. If it's the libc handler we
will not see a popuplog entry. If it's some other exception handler, it
depends. Many of the OS/2 components establish an expection handler while
they are processing a request. These handlers typically clean up internal
resources and ask to die. This results in a typical popuplog.
>The Apache error log shows:
> [Sun Oct 04 00:14:58 2009] [error] (OS 10035)Resource temporarily
> unavailable: apr_socket_accept
This in something I've been wondering about. Recall the traps we used to
get when error codes were not properly reported. I wonder if this is
another of these cases.
>followed by four of the usual stuff we've come to expect (child PID
>shutting down - all the same PID, too), down to:
> Killed by SIGSEGV
> pid=0x025e ppid=0x025d tid=0x0001 slot=0x00ed pri=0x0200 mc=0x0001
> J:\APPS\APACHE2\BIN\HTTPD.EXE
> cs:eip=005b:1c892f1c ss:esp=0053:0022fc3c ebp=0022fc68
> ds=0053 es=0053 fs=150b gs=0000 efl=00210202
> eax=1c892f1c ebx=00e62a0c ecx=00000000 edx=004f1ae0 edi=0000000a
> esi=003755a8
> Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
This is the libc handler catching an exception in in some code it
controls.
>Now, POPUPLOG.OS2 says:
> 10-04-2009 00:15:14 SYS3175 PID 025f TID 000e Slot 00e5
> J:\APPS\APACHE2\BIN\HTTPD.EXE
> c0000005
> 1ffb1b65
> P1=00000001 P2=00000b14 P3=XXXXXXXX P4=XXXXXXXX
> EAX=00000b14 EBX=2003b840 ECX=024fffd4 EDX=024fffd4
> ESI=024fffd4 EDI=0000000e
> DS=0053 DSACC=f0f3 DSLIM=ffffffff
> ES=0053 ESACC=f0f3 ESLIM=ffffffff
> FS=150b FSACC=00f3 FSLIM=00000030
> GS=0000 GSACC=**** GSLIM=********
> CS:EIP=005b:1ffb1b65 CSACC=f0df CSLIM=ffffffff
> SS:ESP=0053:024fff94 SSACC=f0f3 SSLIM=ffffffff
> EBP=024fff94 FLG=00010213
> DOSCALL1.DLL 0002:00001b65
This is probably the kernel catching the exception caused by the
DosUnsetExcepitonHandler call.
I'll verify this when I return from errands.
>The Apache log shows the identical behavior two minutes later, but
>POPUPLOG has no further entry until 4:33am. So, I *might* conclude from
>that that the only entries in POPUPLOG are the ones when the exe crashed
>and could not recover, forcing the wrapper to restart it, vs the ones
>where httpd was able to restart itself, reported in the Apache error
>log. Would you concur?
What it says is that only some exceptions also cause the DosUnsetException
trap. If we are dealing with memory corruption or something similar, this
is not unexpected.
>And of course, none of these show any specific correlation to the high
>CPU usage issue, unfortunately. <sigh>
I doubt very much the issue usage related. Most likely it is some httpd
request that invokes some php code that's behaving badly.
Once you get what looks like a reasonable set of log data, please zip it
up and send it my way along with the popuplogs. I want to merge the data
for analysis.
One thing that's occured to me as a possible cause of the problems, that
I'm hoping Steven can comment on is:
Apache2 uses TCPIP v4
modphp uses TCPIP v4
PHP uses TCPIP v4.1
MySQL uses TCPIP v4.1
Can the mixing of TCPIP versions be causing the stack corruption?
If there's any chance of this, I guess I should get around to trying to
move Apache2 to TCPIP v4.1
Cheers,
Paul
Hi Paul,
>One thing that's occured to me as a possible cause of the problems, that
>I'm hoping Steven can comment on is:
I can comment, but not with anything I consider significant expertise.
>Can the mixing of TCPIP versions be causing the stack corruption?
I guess it's possible, but I don't think it's related to the problem we
are working on.
The builds that Lewis is currently using don't seem to be exhibiting the
stack overflow that we saw previoiusly.
To recap, there are several failure modes
- httpd unresponsive
- httpd high CPU load
- httpd stack overflow
- trap in php
- trap in DosUnsetExceptionHandler
The one I am focusing on is #4 because it occurs most often and Lewis is
good a getting me the analysis data I need. The others are intermittent
or, in the case of the DosUnsetExceptionHandler trap, secondary failures.
Steven Levine wrote:
> In <4AC93E2...@smedley.id.au>, on 10/05/09
> at 09:15 AM, Paul Smedley <pa...@smedley.id.au> said:
>
> Hi Paul,
>
>> One thing that's occured to me as a possible cause of the problems, that
>> I'm hoping Steven can comment on is:
>
> I can comment, but not with anything I consider significant expertise.
>
>> Can the mixing of TCPIP versions be causing the stack corruption?
>
> I guess it's possible, but I don't think it's related to the problem we
> are working on.
>
> The builds that Lewis is currently using don't seem to be exhibiting the
> stack overflow that we saw previoiusly.
>
> To recap, there are several failure modes
>
> - httpd unresponsive
> - httpd high CPU load
> - httpd stack overflow
> - trap in php
> - trap in DosUnsetExceptionHandler
>
> The one I am focusing on is #4 because it occurs most often and Lewis is
> good a getting me the analysis data I need. The others are intermittent
> or, in the case of the DosUnsetExceptionHandler trap, secondary failures.
Do the PHP traps occur if the cgi version of php.exe is in use? or only
when using modphp5.dll?
That would at least help prove/disprove my tcpip v4/v4.1 theory :)
Cheers,
Paul
On 10/04/09 07:24 pm, Steven Levine thus wrote :
> In <4AC91848...@2rosenthals.com>, on 10/04/09
> at 05:48 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>> It's hard to tell. I will add your suggested lines to the wrapper, which
>> should make it easier to determine which was an "internal" daemon
>> restart and which was caused by the wrapper determining that httpd.exe
>> was no longer running. There is no discernible difference between the
>> two happenings when reading the Apache error log or POPUPLOG.OS2 (unless
>> the latter only shows the wrapper restarting...I haven;t actually
>> *counted* the restarts in one vs the other).
>>
>
> Whether or not the exception shows up in popuplog is determined by which
> exception handler processes the exception. If it's the libc handler we
> will not see a popuplog entry. If it's some other exception handler, it
> depends. Many of the OS/2 components establish an expection handler while
> they are processing a request. These handlers typically clean up internal
> resources and ask to die. This results in a typical popuplog.
>
>
Ah, so... The lightbulb atop my head is flickering as it comes on...
This makes sense. Of course, what's tricky is knowing which app is using
which exception handler.
>> The Apache error log shows:
>>
>
>
>> [Sun Oct 04 00:14:58 2009] [error] (OS 10035)Resource temporarily
>> unavailable: apr_socket_accept
>>
>
> This in something I've been wondering about. Recall the traps we used to
> get when error codes were not properly reported. I wonder if this is
> another of these cases.
>
>
Hmmm... 10035 is a Winsock error on M$ systems:
Resource temporarily unavailable.
This error is returned from operations on non-blocking sockets
that cannot be completed immediately, for example recv when no data
is queued to be read from the socket. It is a nonfatal error, and
the operation should be retried later. It is normal for
WSAEWOULDBLOCK to be reported as the result from calling connect on
a non-blocking SOCK_STREAM socket, since some time must elapse for
the connection to be established.
Have a quick look at
http://httpd.apache.org/docs/2.2/misc/perf-tuning.html , in particular,
the portion concerning resource starvation in sockets implementation
when listening on multiple ports (I currently listen on 80 and 81;
perhaps the issue comes into play when we are serving on 80 and a random
request comes in on 81?). Might AcceptMutex have any bearing on this? As
a simple test (for the multiple Listen directives, at least), I am going
to disable port 81 (I can always deflect to 80 via the firewall, anyway;
this is an old holdover from the days of my original cable connection
when I could not get business-class service, and thus, 80 was blocked,
and I knew nothing at the time of PAT).
>> followed by four of the usual stuff we've come to expect (child PID
>> shutting down - all the same PID, too), down to:
>>
>
>
>> Killed by SIGSEGV
>> pid=0x025e ppid=0x025d tid=0x0001 slot=0x00ed pri=0x0200 mc=0x0001
>> J:\APPS\APACHE2\BIN\HTTPD.EXE
>> cs:eip=005b:1c892f1c ss:esp=0053:0022fc3c ebp=0022fc68
>> ds=0053 es=0053 fs=150b gs=0000 efl=00210202
>> eax=1c892f1c ebx=00e62a0c ecx=00000000 edx=004f1ae0 edi=0000000a
>> esi=003755a8
>> Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
>>
>
> This is the libc handler catching an exception in in some code it
> controls.
>
>
Which is why the entries appear without the usual Apache timestamps?
>> Now, POPUPLOG.OS2 says:
>>
>
>
>> 10-04-2009 00:15:14 SYS3175 PID 025f TID 000e Slot 00e5
>> J:\APPS\APACHE2\BIN\HTTPD.EXE
>> c0000005
>> 1ffb1b65
>> P1=00000001 P2=00000b14 P3=XXXXXXXX P4=XXXXXXXX
>> EAX=00000b14 EBX=2003b840 ECX=024fffd4 EDX=024fffd4
>> ESI=024fffd4 EDI=0000000e
>> DS=0053 DSACC=f0f3 DSLIM=ffffffff
>> ES=0053 ESACC=f0f3 ESLIM=ffffffff
>> FS=150b FSACC=00f3 FSLIM=00000030
>> GS=0000 GSACC=**** GSLIM=********
>> CS:EIP=005b:1ffb1b65 CSACC=f0df CSLIM=ffffffff
>> SS:ESP=0053:024fff94 SSACC=f0f3 SSLIM=ffffffff
>> EBP=024fff94 FLG=00010213
>>
>
>
>> DOSCALL1.DLL 0002:00001b65
>>
>
> This is probably the kernel catching the exception caused by the
> DosUnsetExcepitonHandler call.
>
>
We've discussed this before. As these are consistent (read: always a
c0000005 error, resulting in an apparent DOSCALL1 failure), I've just
discounted them as reporting that the barn door has been left open after
the horse has run out, while the Apache logs tell us *which* horse just
ran out. ;-)
> I'll verify this when I return from errands.
>
>
>> The Apache log shows the identical behavior two minutes later, but
>> POPUPLOG has no further entry until 4:33am. So, I *might* conclude from
>> that that the only entries in POPUPLOG are the ones when the exe crashed
>> and could not recover, forcing the wrapper to restart it, vs the ones
>> where httpd was able to restart itself, reported in the Apache error
>> log. Would you concur?
>>
>
> What it says is that only some exceptions also cause the DosUnsetException
> trap. If we are dealing with memory corruption or something similar, this
> is not unexpected.
>
>
Gotcha.
>> And of course, none of these show any specific correlation to the high
>> CPU usage issue, unfortunately. <sigh>
>>
>
> I doubt very much the issue usage related. Most likely it is some httpd
> request that invokes some php code that's behaving badly.
>
> Once you get what looks like a reasonable set of log data, please zip it
> up and send it my way along with the popuplogs. I want to merge the data
> for analysis.
>
Will do!
Thanks!!
On 10/05/09 12:54 am, Paul Smedley thus wrote :
I can't get PHP 5.2.11 to run as CGI Apache 2.2.13 (even your fixed
httpd which should run CGI). I'd love to try!
> That would at least help prove/disprove my tcpip v4/v4.1 theory :)
>
>
Indeed!
Paul, please note my previous response to Steve. How does 2.2.13 handle
the sockets issue by default? If we accept as truth the (OS
10035)Resource temporarily unavailable: apr_socket_accept message, how
might we best approach mitigating such a socket issue? Your mention of
the differences between TCP 4.0 and 4.1 in the various components caused
me to start thinking about sockets. Could we really be temporarily
running out of sockets, and instead of gracefully waiting for a
non-blocking socket to become available, some PHP function (likely
related to MySQL, as that seems to be the module reporting the trouble -
as seen under Theseus) just freezes?
By default, OS/2 AF_INET sockets implementation allocates 75 4K clusters
for sockets and 80 64K blocks which the stack (SOCKETS.SYS) may
allocate. Both of these parameters are tunable. I am not approaching
this as a programmer, due to my limited abilities in that regard. As an
engineer, I might try to adjust something in the stack to make the app
more comfortable (stable). Am I barking up the wrong tree?
On 10/04/09 11:33 pm, Lewis G Rosenthal thus wrote :
> Hi!
>
> On 10/04/09 07:24 pm, Steven Levine thus wrote :
>
>> In <4AC91848...@2rosenthals.com>, on 10/04/09
>> at 05:48 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>>
<snip>
Done. Thanks to the magic of port address translation, it was a
two-click change in my firewall rules. I commented the Listen 81
directive in httpd.conf and Ctrl-Brk'd (note that since I added your
suggested timestamp code to the wrapper, this was the third time the
time was posted to the window; the initial start, the last Ctrl-Brk -
when I edited the ExecCGI options, and this time).
<snip>
+ system unresponsive
> - httpd high CPU load
> - httpd stack overflow
> - trap in php
> - trap in DosUnsetExceptionHandler
cu, Dieter
While PHP is a very good candidate there seems to be more:
[Thu Oct 01 17:45:43 2009] [error] [client 222.73.103.178] File does
not exist: X:/1/homepages/eigen/phpMyAdmin
apr_canonical_error: Unknown OS/2 error code 10004
3.178] File does not exist: X:/1/homepages/eigen/phpMyAdmin
cu, Dieter
On 10/05/09 12:05 am, Lewis G Rosenthal thus wrote :
The cron script to restart Apache runs at 3am (I'll disable that for
tonight), however, it is now 7:23am EDT. Here is the Apache status:
Current Time: Monday, 05-Oct-2009 07:23:09
Restart Time: Monday, 05-Oct-2009 03:00:09
Parent Server Generation: 0
Server uptime: 4 hours 23 minutes
Total accesses: 1316 - Total Traffic: 19.0 MB
CPU Usage: u18016.1 s0 cu0 cs0 - 114% CPU load
.0834 requests/sec - 1262 B/second - 14.8 kB/request
1 requests currently being processed, 20 idle workers
I don't remember when Apache stayed up over 4 hours straight. I'm not
ready to pass out cigars yet, but this is surely interesting.
On 10/05/09 05:55 am, Dieter Ringhofer thus wrote :
Does this happen consistently, or just after some period of uptime?
Possibly running out of file handles?
I am not seeing this particular one in my logs now, though I have seen
the 10004 before...
>> not exist: X:/1/homepages/eigen/phpMyAdmin
>> apr_canonical_error: Unknown OS/2 error code 10004
>> 3.178] File does not exist: X:/1/homepages/eigen/phpMyAdmin
>>
>>
> Does this happen consistently, or just after some period of uptime?
Normally server runs with very little load one or two weeks before I
restart it. This happened about 20 hours after a reboot.
> Possibly running out of file handles?
No, not in this case for sure.
cu, Dieter
HI Dieter,
>>> not exist: X:/1/homepages/eigen/phpMyAdmin
>>> apr_canonical_error: Unknown OS/2 error code 10004
>>> 3.178] File does not exist: X:/1/homepages/eigen/phpMyAdmin
>Normally server runs with very little load one or two weeks before I
>restart it. This happened about 20 hours after a reboot.
>> Possibly running out of file handles?
>No, not in this case for sure.
I'm not yet convinced. IIRC, 10004 is the APR's mapped version of OS/2
error 4 which is
#define ERROR_TOO_MANY_OPEN_FILES 4 /* MSG%OUT_OF_HANDLES */
Next time this occurs, open up Theseus and check the number of open files
for the failing httpd instance.
FWIW, one problem with the APR mapping is that this error code is the same
numerical value as the socket error
#define SOCEINTR (SOCBASEERR+4) /* Interrupted
system call */
However, the message content implies that ERROR_TOO_MANY_OPEN_FILES is the
more likely actual error.
Regards,
Hi,
>Ah, so... The lightbulb atop my head is flickering as it comes on...
>This makes sense. Of course, what's tricky is knowing which app is using
>which exception handler.
It can be. I have collected list. :-)
>>> The Apache error log shows:
>>> [Sun Oct 04 00:14:58 2009] [error] (OS 10035)Resource temporarily
>>> unavailable: apr_socket_accept
>Hmmm... 10035 is a Winsock error on M$ systems:
TCP/IP in our case.
#define SOCEAGAIN (SOCBASEERR+35) /* Resource
temporarily unavailable */
See ?:\Toolkit\H\NERRNO.H and tcppr.inf
Since WinSock is derived for TCP/IP sockets much of the API is similar.
>Have a quick look at
>http://httpd.apache.org/docs/2.2/misc/perf-tuning.html , in particular,
>the portion concerning resource starvation in sockets implementation
>when listening on multiple ports (I currently listen on 80 and 81;
>perhaps the issue comes into play when we are serving on 80 and a random
>request comes in on 81?).
I doubt it. The docs say the description is a bit out of date.. The
solution describe appears to ensure that the listen sockets all get equal
time and it avoids the flurry of activity that would occur if all the
threads woke up at the same time to compete for the new connection.
>>> Killed by SIGSEGV
>>> pid=0x025e ppid=0x025d tid=0x0001 slot=0x00ed pri=0x0200 mc=0x0001
>>> J:\APPS\APACHE2\BIN\HTTPD.EXE
>>> cs:eip=005b:1c892f1c ss:esp=0053:0022fc3c ebp=0022fc68
>>> ds=0053 es=0053 fs=150b gs=0000 efl=00210202
>>> eax=1c892f1c ebx=00e62a0c ecx=00000000 edx=004f1ae0 edi=0000000a
>>> esi=003755a8
>>> Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
>> This is the libc handler catching an exception in in some code it
>> controls.
>Which is why the entries appear without the usual Apache timestamps? >>
It's also missing some other useful data that a normal popuplog includes.
When the time comes for a new libc, I plan to discuss this with Paul.
>We've discussed this before. As these are consistent (read: always a
>c0000005 error, resulting in an apparent DOSCALL1 failure), I've just
>discounted them as reporting that the barn door has been left open after
>the horse has run out, while the Apache logs tell us *which* horse just
>ran out. ;-)
Sorta. They secondary trap is useful in that it implies a certain type of
problem may have occured (i.e. stack corruption). The exception handler
chain is kept on the stack. The head of the chain is pointed to by the
TIB field tib_pexchain.
Hi,
>>> Killed by SIGSEGV
>>> pid=0x025e ppid=0x025d tid=0x0001 slot=0x00ed pri=0x0200 mc=0x0001
>>> J:\APPS\APACHE2\BIN\HTTPD.EXE
>>> cs:eip=005b:1c892f1c ss:esp=0053:0022fc3c ebp=0022fc68
>>> ds=0053 es=0053 fs=150b gs=0000 efl=00210202
>>> eax=1c892f1c ebx=00e62a0c ecx=00000000 edx=004f1ae0 edi=0000000a
>>> esi=003755a8
>>> Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
I forget to ask, which httpd.exe are you using?
What does lxlite -c:exemap -vf- httpd.exe say about the stack size and
stack base.
What I see for Paul's latest build is
Start obj:EIP: 1:00000000 Stack obj:ESP: 3:00200000
## Base Size
R W E Res Dis Shr Pre Inv Swp Rsd Loc A16 32B Cnf IOP
3 00030000 00200000 ¹ ¹ ¹
If so this confirms we still have stack overflow.
Is the rosenthal_apache2_stack_pdump_018.zip still accurate for memory
layout? If so, the culprit is pdf.dll. If not, I need a new process dump
file.
In <4AC96906...@2rosenthals.com>, on 10/04/09
at 11:33 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
Hi,
<isaid>
If so this confirms we still have stack overflow.
</isaid>
I think this is not the case. I misread the esp value and dropped a zero.
This trap is something else. Possibly another variant of a DLL unloading
too soon. I'll know better once I have a current dump file.
>>>> not exist: X:/1/homepages/eigen/phpMyAdmin
>>>> apr_canonical_error: Unknown OS/2 error code 10004
>>>> 3.178] File does not exist: X:/1/homepages/eigen/phpMyAdmin
>
>>Normally server runs with very little load one or two weeks before I
>>restart it. This happened about 20 hours after a reboot.
>>> Possibly running out of file handles?
>>No, not in this case for sure.
>
> I'm not yet convinced. IIRC, 10004 is the APR's mapped version of
> OS/2
> error 4 which is
Steven, don't get me wrong, please, but, this server is doing almost
nothing - in theory I could switch it off. phpMyAdmin does not exist
there as well, btw. This error has nothing to do with PHP so far.
Database access, file access and much more happens on systems running
BSD, Linux, Solaris and ... Windows. The only real load related to
HTTP which could be caused on this very machine from extern are
several static HTML files (no SSI) and several zip files as long as
you don't know some specials. When somebody tries to use those
specials reason for this error might be "access denied" or "file not
found" in case he is doing something wrong but, error message in
error_log would be wrong than.
> #define ERROR_TOO_MANY_OPEN_FILES 4 /*
> MSG%OUT_OF_HANDLES */
Sounds logical ...
When your assumption is true system runs out of file handles as soon
as something happens while normally five almost completely sleeping
httpd tasks are running on top of operating system only.
> Next time this occurs, open up Theseus and check the number of open
> files
> for the failing httpd instance.
I will have a look to find another place for the machine first to be
able to attach some equipment like a monitor to be able to not to take
influence while trying to find something. :-) I hope to be able to
reproduce it.
> FWIW, one problem with the APR mapping is that this error code is
> the same
> numerical value as the socket error
>
> #define SOCEINTR (SOCBASEERR+4) /*
> Interrupted
> system call */
>
> However, the message content implies that ERROR_TOO_MANY_OPEN_FILES
> is the
> more likely actual error.
Sorry but: I can't imagine it. Nevertheless I will have a look about
it.
cu, Dieter
On Mon, 05 Oct 2009 11:21:31 -0800 "Steven Levine" <ste...@earthlink.net>
wrote:
>
>
>In <4AC96906...@2rosenthals.com>, on 10/04/09
> at 11:33 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
>Hi,
>
>
><isaid>
>If so this confirms we still have stack overflow.
></isaid>
>
>I think this is not the case. I misread the esp value and dropped a zero.
>
Oops. Happens to the best of us. ;-)
>This trap is something else. Possibly another variant of a DLL unloading
>too soon. I'll know better once I have a current dump file.
>
I'll get you one in the next day or so. Right now, after removing Listen
81, we're up over 12 hours & responsive, with no CPU hog.
FYI: Build is Paul's latest 2.2.13 (after the CGI fix), with the larger
stack size. Memory layout should be consistent with the 018 dump, as I
haven't added or removed anything else.
Still think this couldn't be caused by a socket issue when listening on
multiple ports simultaneously? That change and the ExecCGI removal were the
*only* things I tweake in the conf. No reboot, either.
___
Lewis G Rosenthal
Rosenthal a Rosenthal, LLC
Sent with SnapperMail
On 10/05/09 03:07 pm, Dieter Ringhofer thus wrote :
How many virtuals are you running, and do you have separate logfiles
open for each one? With 30 virtuals (I don't run MySQL on the same box,
so I'm not sure of your file handle requirements in that regard), I had
to bump from 60 *additional* handles (I went to 120; currently 200).
Apache would not start at all, as it was unable to even open the log
files. Here's a sample error (from Apache 2.2.11):
[Sat Jan 31 23:36:33 2009] [error] (OS 4)OS/2 error 4: could not
open mime types config file J:/APPS/apache2/conf/mime.types.
Sure enough, I was out of file handles (mime.types did indeed exist).
On 10/05/09 12:59 pm, Steven Levine thus wrote :
> In <4AC96906...@2rosenthals.com>, on 10/04/09
> at 11:33 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>>>> Killed by SIGSEGV
>>>> pid=0x025e ppid=0x025d tid=0x0001 slot=0x00ed pri=0x0200 mc=0x0001
>>>> J:\APPS\APACHE2\BIN\HTTPD.EXE
>>>> cs:eip=005b:1c892f1c ss:esp=0053:0022fc3c ebp=0022fc68
>>>> ds=0053 es=0053 fs=150b gs=0000 efl=00210202
>>>> eax=1c892f1c ebx=00e62a0c ecx=00000000 edx=004f1ae0 edi=0000000a
>>>> esi=003755a8
>>>> Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
>>>>
>
> I forget to ask, which httpd.exe are you using?
>
> What does lxlite -c:exemap -vf- httpd.exe say about the stack size and
> stack base.
>
> What I see for Paul's latest build is
>
> Start obj:EIP: 1:00000000 Stack obj:ESP: 3:00200000
>
> ## Base Size
> R W E Res Dis Shr Pre Inv Swp Rsd Loc A16 32B Cnf IOP
> 3 00030000 00200000 ¹ ¹ ¹
>
> If so this confirms we still have stack overflow.
>
>
Indeed, I can confirm that this one is the same.
<snip>
FYI: Current stats:
Current Time: Monday, 05-Oct-2009 19:38:50
Restart Time: Monday, 05-Oct-2009 03:00:09
Parent Server Generation: 0
Server uptime: 16 hours 38 minutes 41 seconds
Total accesses: 247 - Total Traffic: 2.6 MB
CPU Usage: u1732.68 s0 cu0 cs0 - 2.89% CPU load
.00412 requests/sec - 45 B/second - 10.8 kB/request
1 requests currently being processed, 14 idle workers
The error log *only* has debug messages since the 3am restart:
[Mon Oct 05 03:00:09 2009] [notice] Apache/2.2.13 (OS/2) PHP/5.2.11
configured -- resuming normal operations
[Mon Oct 05 03:00:09 2009] [info] Server built: Sep 27 2009 10:25:33
[Mon Oct 05 03:00:09 2009] [debug] proxy_util.c(1814): proxy:
initialized plain memory in child 1096 for worker proxy:reverse
[Mon Oct 05 03:00:10 2009] [debug] proxy_util.c(1814): proxy:
initialized plain memory in child 1095 for worker proxy:reverse
[Mon Oct 05 03:00:10 2009] [debug] proxy_util.c(1902): proxy:
initialized worker 1 in child 1095 for (*) min=0 max=495465840
smax=495465840
[Mon Oct 05 03:00:10 2009] [debug] proxy_util.c(1902): proxy:
initialized worker 1 in child 1096 for (*) min=0 max=495465840
smax=495465840
[Mon Oct 05 03:30:20 2009] [debug] proxy_util.c(1814): proxy:
initialized plain memory in child 1106 for worker proxy:reverse
*No* restarts listed in the error log, POPUPLOG.OS2, or the OS/2 window.
I have turned off the cron job to retart the server at 3am, and will
just let it run for now.
If the multiple listening addresses is indeed the issue, it is possible
that those people running SSL may be experiencing the same symptom,
simply as a result of listening on 80 and 443 simultaneously.
Uptime now 21:43. Total traffic 14.1MB.
Gentlemen, I am starting to believe that we actually hit on the problem.
<snip>
One, and only one. It is the main server. I defined it to add more
easy whenever I need vhosts.
> and do you have separate logfiles open for each one?
No, nothing separated - I don't need statistics at this machine.
cu, Dieter
On 10/06/09 02:31 am, Dieter Ringhofer thus wrote :
> Hi Lewis!
>
>> How many virtuals are you running,
>>
>
> One, and only one. It is the main server. I defined it to add more
> easy whenever I need vhosts.
>
>
Then the only thing I can think which might be consuming file handles
would be MySQL. As I say, I run it on NetWare, not on OS/2, so I haven't
experienced a lack of file handles due to the db engine. As Steve
mentioned, Theseus will tell you how many files are being opened by each
process.
>> and do you have separate logfiles open for each one?
>>
>
> No, nothing separated - I don't need statistics at this machine.
>
>
Well, even if you did, with only one virtual, you'd hardly be pushing
the limit there. Still, the error you're seeing implies a lack of
available file handles.
In my case, we knew there was some resource starvation, but until Paul
mentioned the IP stack, I hadn't taken the error message literally
(apr_socket_accept). When I finally came around to thinking that
"sometimes, a cigar is just a cigar," it shed a completely new light on
the problem. It now looks as though I really *did* have a socket issue,
after all, even though there should be scads of available sockets (like
dying of thirst in the middle of the Atlantic)...
Besides Apache, PHP, and MySQL, what else do you have running on the
box? SVN? Mail?
Hi,
>I'll get you one in the next day or so. Right now, after removing Listen
>81, we're up over 12 hours & responsive, with no CPU hog.
Interesting. Easy enough to try a second listen here. I do find this
somewhat unexpected, since listening to 80 and 8080 is a pretty typical
seutp.
>FYI: Build is Paul's latest 2.2.13 (after the CGI fix), with the larger
>stack size. Memory layout should be consistent with the 018 dump, as I
>haven't added or removed anything else.
For some reason, this does not appear to be the case.
>Still think this couldn't be caused by a socket issue when listening on
>multiple ports simultaneously? That change and the ExecCGI removal were
>the *only* things I tweake in the conf. No reboot, either.
You should try restoring the listen. Somehow I would expect ExecCGI to
cause more problems. There was a time, before my time as the SCOUG
webmaster, where the server would occaisionally end with with several java
instances running for no apparent reason. As always, there was a reason.
Hi,
> Current Time: Monday, 05-Oct-2009 19:38:50
> Restart Time: Monday, 05-Oct-2009 03:00:09
> Parent Server Generation: 0
> Server uptime: 16 hours 38 minutes 41 seconds
> Total accesses: 247 - Total Traffic: 2.6 MB
> CPU Usage: u1732.68 s0 cu0 cs0 - 2.89% CPU load
> .00412 requests/sec - 45 B/second - 10.8 kB/request
> 1 requests currently being processed, 14 idle workers
>The error log *only* has debug messages since the 3am restart:
> [Mon Oct 05 03:00:09 2009] [notice] Apache/2.2.13 (OS/2) PHP/5.2.11
> configured -- resuming normal operations
> [Mon Oct 05 03:00:09 2009] [info] Server built: Sep 27 2009 10:25:33
> [Mon Oct 05 03:00:09 2009] [debug] proxy_util.c(1814): proxy:
> initialized plain memory in child 1096 for worker proxy:reverse
> [Mon Oct 05 03:00:10 2009] [debug] proxy_util.c(1814): proxy:
> initialized plain memory in child 1095 for worker proxy:reverse
> [Mon Oct 05 03:00:10 2009] [debug] proxy_util.c(1902): proxy:
> initialized worker 1 in child 1095 for (*) min=0 max=495465840
> smax=495465840
> [Mon Oct 05 03:00:10 2009] [debug] proxy_util.c(1902): proxy:
> initialized worker 1 in child 1096 for (*) min=0 max=495465840
> smax=495465840
> [Mon Oct 05 03:30:20 2009] [debug] proxy_util.c(1814): proxy:
> initialized plain memory in child 1106 for worker proxy:reverse
>If the multiple listening addresses is indeed the issue, it is possible
>that those people running SSL may be experiencing the same symptom,
>simply as a result of listening on 80 and 443 simultaneously.
It turns out that my local setup already listens to both 80 and 8080 which
is a pretty typical setup.
Since the DosUnsetExceptionHandler dump file was not very revealing, I
took the opportunity to build some rexx scripts for the pmdf REXX
subcommand handler. One can list all the symbols found on the stack.
This is helpful because the the k command does not generate useful results
after the DosUnsetExceptionHandler trap. The script reported lots of php
activity, but nothing stood out as a possible failure mode. The script
is limited in that it shows symbols that may or may not be on the current
call chain. I need to build something smarter that can find the current
call chain.
On 10/06/09 11:22 am, Steven Levine thus wrote :
> In <3932-SnapperMsg4EDDFFFDC6EFF9E0@[10.128.42.229]>, on 10/05/09
> at 03:28 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>> I'll get you one in the next day or so. Right now, after removing Listen
>> 81, we're up over 12 hours & responsive, with no CPU hog.
>>
>
> Interesting. Easy enough to try a second listen here. I do find this
> somewhat unexpected, since listening to 80 and 8080 is a pretty typical
> seutp.
>
>
I agree, and surely, I didn't *always* have this problem (though when it
actually began is now quite foggy; it got worse after the VOICE stuff
came on board, though, back at the end of January). And again, platforms
being different, I don't see anything like this on NetWare or Linux,
where I routinely listen on multiple ports on the same IP.
>> FYI: Build is Paul's latest 2.2.13 (after the CGI fix), with the larger
>> stack size. Memory layout should be consistent with the 018 dump, as I
>> haven't added or removed anything else.
>>
>
> For some reason, this does not appear to be the case.
>
>
Weird. We can confirm the memory config when I get a fresh dump in a few
days.
>> Still think this couldn't be caused by a socket issue when listening on
>> multiple ports simultaneously? That change and the ExecCGI removal were
>> the *only* things I tweake in the conf. No reboot, either.
>>
>
> You should try restoring the listen. Somehow I would expect ExecCGI to
> cause more problems. There was a time, before my time as the SCOUG
> webmaster, where the server would occaisionally end with with several java
> instances running for no apparent reason. As always, there was a reason.
>
>
I'll restore it tomorrow or Thursday, when I will be in a better
position to monitor things. Right now, I'm sort of enjoying the ability
to breathe easy and not wait for a phone call from someone complaining
that "I can't get to my website..." :-) I do want to do this, however,
as we need to be certain which of the two changes made the difference. I
just don't know what CGI might have been in the document tree which
could have been executed, though at this point *anything* is possible.
I'll poke around in the access logs and the directory structure to see
what's there. How might you account for the crashes in the PHP MySQL
module when running PHP as module vs CGI, if triggered by the execution
of some CGI script?
MySQL is NOT installed on this machine. NO database is installed on
this machine.
>>> and do you have separate logfiles open for each one?
>>
>> No, nothing separated - I don't need statistics at this machine.
>>
> Well, even if you did, with only one virtual, you'd hardly be
> pushing
> the limit there. Still, the error you're seeing implies a lack of
> available file handles.
It implies it but, I would be more than astonished it being the real
error.
> Besides Apache, PHP, and MySQL, what else do you have running on the
> box? SVN? Mail?
Nothing. Simply nothing. That's the reason why I would be more than
astonished when this very system runs out of file handles.
The webserver serves static HTML normally - no SSI, nothing. Sometimes
I use it for testing purposes of popular open sourced software. But,
with those problems at backends of popular CMS and blog I even don't
use it for development.
cu, Dieter
Hi,
>I don't see anything like this on NetWare or Linux,
>where I routinely listen on multiple ports on the same IP.
It's clearly a platform-specific issue with our distribution. Problems
like this are unlikely to make it into production systems on other
platforms. There's just too many folks running test systems on the new
builds.
>Weird. We can confirm the memory config when I get a fresh dump in a few
>days.
Sure, no rush on any of this until we need to look at a dump file again.
>Right now, I'm sort of enjoying the ability
>to breathe easy and not wait for a phone call from someone complaining
>that "I can't get to my website..." :-)
I know what you mean.
>I'll poke around in the access logs and the directory structure to see
>what's there. How might you account for the crashes in the PHP MySQL
>module when running PHP as module vs CGI, if triggered by the execution
>of some CGI script?
The dump will tell us once we figure out what we are looking at. I really
don't have much of an idea what events led up to the
DosUnsetExceptionHandler trap.
I've just got this feeling that what I have been shown so far is omitting
something that is relevant. Perhaps the ReportingApacheErrors.txt I sent
you to review will help us ensure we gather the relevant data.
Regarding php as a httpd module or cgi, the php extension modules are used
in either case. The difference is how php is invoked by httpd.
Hi Dieter,
>>>>> not exist: X:/1/homepages/eigen/phpMyAdmin
>>>>> apr_canonical_error: Unknown OS/2 error code 10004
>>>>> 3.178] File does not exist: X:/1/homepages/eigen/phpMyAdmin
>> I'm not yet convinced. IIRC, 10004 is the APR's mapped version of
>> OS/2
>> error 4 which is
>Steven, don't get me wrong, please, but, this server is doing almost
>nothing - in theory I could switch it off.
I understand all this. Not all defects are load related.
However, my experience is that is it best to take the error message at
face value until there is something that implies the message is bogus.
Error report defects have occurred in older httpd versions, so it's not
impossible the error report you got is bogus, but that if unproven at the
moment.
>phpMyAdmin does not exist
>there as well, btw.
Something generated the access request and the error is not the typical
file not found report.
>This error has nothing to do with PHP so far.
I would tend to agree with this. So far, it appears to be a odd response
to a GET request.
>When your assumption is true system runs out of file handles as soon as
>something happens while normally five almost completely sleeping httpd
>tasks are running on top of operating system only.
I can only draw conclusions based on the information you provide.
>I will have a look to find another place for the machine first to be
>able to attach some equipment like a monitor to be able to not to take
>influence while trying to find something. :-) I hope to be able to
>reproduce it.
Reproduction is the first step. Gathering appropriate data is the next.
Lewis and I are working on a document that should make this process a bit
more effective.
>> However, the message content implies that ERROR_TOO_MANY_OPEN_FILES
>> is the
>> more likely actual error.
>Sorry but: I can't imagine it. Nevertheless I will have a look about it.
It is odd, but then again the failure mode itself it odd.
Regards,
On 10/06/09 12:29 pm, Dieter Ringhofer thus wrote :
> Hi Lewis!
>
>>>> How many virtuals are you running,
>>>>
>>>>
>>> One, and only one. It is the main server. I defined it to add more
>>> easy whenever I need vhosts.
>>>
>>>
>> Then the only thing I can think which might be consuming file
>> handles
>> would be MySQL.
>>
>
> MySQL is NOT installed on this machine. NO database is installed on
> this machine.
>
>
Oops... My error! I had this image in my head of your first follow-up to
this mentioning MySQL... LOL! I think my assumption was based upon your
mention of trouble with Joomla 1.5...while I run MySQL on another box,
so many people run the whole AMP stack on one, that I jumped (wrongly!)
to the conclusion that you were, too.
>>>> and do you have separate logfiles open for each one?
>>>>
>>> No, nothing separated - I don't need statistics at this machine.
>>>
>>>
>> Well, even if you did, with only one virtual, you'd hardly be
>> pushing
>> the limit there. Still, the error you're seeing implies a lack of
>> available file handles.
>>
>
> It implies it but, I would be more than astonished it being the real
> error.
>
>
:-)
>> Besides Apache, PHP, and MySQL, what else do you have running on the
>> box? SVN? Mail?
>>
>
> Nothing. Simply nothing. That's the reason why I would be more than
> astonished when this very system runs out of file handles.
>
>
Wow... Indeed, odd.
> The webserver serves static HTML normally - no SSI, nothing. Sometimes
> I use it for testing purposes of popular open sourced software. But,
> with those problems at backends of popular CMS and blog I even don't
> use it for development.
>
>
Well, this is surely different, then, than what I've seen.
What does your http-mpm.conf look like?
On 10/06/09 01:57 pm, Steven Levine thus wrote :
> In <4ACB6402...@2rosenthals.com>, on 10/06/09
> at 11:36 AM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>> I don't see anything like this on NetWare or Linux,
>> where I routinely listen on multiple ports on the same IP.
>>
>
> It's clearly a platform-specific issue with our distribution. Problems
> like this are unlikely to make it into production systems on other
> platforms. There's just too many folks running test systems on the new
> builds.
>
>
You are surely correct. Our limited installed base works to our
disadvantage in such matters.
>> Weird. We can confirm the memory config when I get a fresh dump in a few
>> days.
>>
>
> Sure, no rush on any of this until we need to look at a dump file again.
>
>
;-)
>> Right now, I'm sort of enjoying the ability
>> to breathe easy and not wait for a phone call from someone complaining
>> that "I can't get to my website..." :-)
>>
>
> I know what you mean.
>
>
Server uptime: 1 day 13 hours 42 minutes 2 seconds
I haven't been this happy since the last stock split I had. :-)
>> I'll poke around in the access logs and the directory structure to see
>> what's there. How might you account for the crashes in the PHP MySQL
>> module when running PHP as module vs CGI, if triggered by the execution
>> of some CGI script?
>>
>
> The dump will tell us once we figure out what we are looking at. I really
> don't have much of an idea what events led up to the
> DosUnsetExceptionHandler trap.
>
>
Understood.
> I've just got this feeling that what I have been shown so far is omitting
> something that is relevant. Perhaps the ReportingApacheErrors.txt I sent
> you to review will help us ensure we gather the relevant data.
>
>
I will adjust the contents of the bundle I put together for you this
trip to conform to the document. That should help determine whether you
feel the need to add anything else or change the suite of reports.
> Regarding php as a httpd module or cgi, the php extension modules are used
> in either case. The difference is how php is invoked by httpd.
>
>
My thought was that as the PHP MySQL module was the object identified by
the memory address in question, and that as PHP itself is running as
module under httpd.exe (the latter, of course, brought down by the
failing MySQL dll), and that as PHP is *not* currently executed as CGI
(thus not reliant upon the ExecCGI directive), that ExecCGI should have
no direct bearing upon running a MySQL query via PHP, and thus, should
not be related to the resource starvation (unless the CGI was running
and consuming sockets so that when the MySQL query attempted to request
one - or more - there were none available *at that time*...?). Yes, I
see your point now. The error itself does not point to the trigger, only
the bullet at the end of the barrel. ;-) I just had to think it through.
It may indeed *not* be the MySQL query which is consuming more than a
normal number of sockets and running out, but there may simply not be
enough sockets available, even *before* the query is executed.
Hi,
>Server uptime: 1 day 13 hours 42 minutes 2 seconds
>I haven't been this happy since the last stock split I had. :-)
I assume you mean a 2 for 1 split not and not a 1 for 10. :-)
>poke around in the access logs and the directory structure to see >>
>what's there. How might you account for the crashes in the PHP MySQL >>
>module when running PHP as module vs CGI, if triggered by the execution
>>> of some CGI script?
I did some more analysis of the 9/24/2009 dump files and have learned a
bit more about the failure.
As I mentioned before DosUnsetExceptionHandler is trapping because the
exception registration record chain is munged.
The chain is anchored in the TIB, which looks like this
# dd %00240210 (tib)
regrec stklo stkhi
%00240210 0246ff70 02450000 02470000 00240228
%00240220 00000014 00000115 0000000b 00000200
The first word points to the first registration record which is happens to
be on the stack (see stklo, stkhi). Walking the chain we get.
# dd 0246ff70
%0246ff70 20039498 20030150 00000b20 00000001
This registration record has been trashed. It turns out that the address
is valid, so we can chain to the next record.
# dd 20039498
0053:20039498 00000b14 20030150 00000000 00000000
However, b14 and not a valid pointer and DosUnsetExceptionHandler traps.
The question is what code found a way to overwrite the registration
record.
If we look at the data in the vicinity of 0246ff70 we find
# dd 0246ff70-40
frame0 ret from apr_allocator_destroy
0053:0246ff30 00350d58 0246ffb4 1da8efe7 00fb16c0
0053:0246ff40 00000008 00000000 00000000 00000000
0053:0246ff50 00000b14 0246ff9b 20039498 20030150
bork?
0053:0246ff60 20030000 0246ff94 1f286a4e 2003013c
bork? bork? bork? bork?
0053:0246ff70 20039498 20030150 00000b20 00000001
bork? bork? bork?
0053:0246ff80 2003013c 00000001 200394a0 0246ffd4
frame1 ret from DOS32UNSETEXCEPTIONHANDLER
0053:0246ff90 0000000b 0246ffb4 1f25006b 0246ffd4
0053:0246ffa0 00000001 1ab5ba80 200394a0 0246ffd4
The addresses (i.e. 20039498 etc.) are valid and in high memory. With
sysvm dump data I could tell who owns it. You are going to need to use
Theseus.
>My thought was that as the PHP MySQL module was the object identified by
>the memory address in question, and that as PHP itself is running as
>module under httpd.exe (the latter, of course, brought down by the
>failing MySQL dll), and that as PHP is *not* currently executed as CGI
>(thus not reliant upon the ExecCGI directive), that ExecCGI should have
>no direct bearing upon running a MySQL query via PHP, and thus, should
>not be related to the resource starvation (unless the CGI was running
>and consuming sockets so that when the MySQL query attempted to request
>one - or more - there were none available *at that time*...?). Yes, I
>see your point now. The error itself does not point to the trigger, only
>the bullet at the end of the barrel. ;-) I just had to think it through.
There's more to it. Since it's pretty clear we have memory corruption,
this might not be to only location stomped upon. Get rid of the
corruption and all sorts of odd problems might disappear.
We need something that will stop httpd when it writes to to wrong
location. This is a job for a debugger.
>>> Then the only thing I can think which might be consuming file
>>> handles would be MySQL.
>>>
>>
>> MySQL is NOT installed on this machine. NO database is installed on
>> this machine.
>>
> Oops... My error! I had this image in my head of your first
> follow-up to
> this mentioning MySQL... LOL!
<g> I run databases on dedicated servers to have machines "free" to
perform development and tests of applications. It doesn't make sense
for me to have chance to loose any data because of a typo.
> What does your http-mpm.conf look like?
puhh ... I assume it's default with file date August 05 2006.
Only "specials" at this server are several aliases and multiple deny
rules to block access from Google and the like. It should be a private
machine even when there is chance to get hands on it from outside. ;)
cu, Dieter
On 10/06/09 09:21 pm, Steven Levine thus wrote :
> In <4ACBAD91...@2rosenthals.com>, on 10/06/09
> at 04:50 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>> Server uptime: 1 day 13 hours 42 minutes 2 seconds
>>
>
>
>> I haven't been this happy since the last stock split I had. :-)
>>
>
> I assume you mean a 2 for 1 split not and not a 1 for 10. :-)
>
>
Indeed! :-)
We'll need to see with the next dump whether the address has changed.
>
>> My thought was that as the PHP MySQL module was the object identified by
>> the memory address in question, and that as PHP itself is running as
>> module under httpd.exe (the latter, of course, brought down by the
>> failing MySQL dll), and that as PHP is *not* currently executed as CGI
>> (thus not reliant upon the ExecCGI directive), that ExecCGI should have
>> no direct bearing upon running a MySQL query via PHP, and thus, should
>> not be related to the resource starvation (unless the CGI was running
>> and consuming sockets so that when the MySQL query attempted to request
>> one - or more - there were none available *at that time*...?). Yes, I
>> see your point now. The error itself does not point to the trigger, only
>> the bullet at the end of the barrel. ;-) I just had to think it through.
>>
>
> There's more to it. Since it's pretty clear we have memory corruption,
> this might not be to only location stomped upon. Get rid of the
> corruption and all sorts of odd problems might disappear.
>
>
True enough; makes sense.
> We need something that will stop httpd when it writes to to wrong
> location. This is a job for a debugger.
>
>
Shouldn't the OS stop a process which attempts to access memory which it
does not own (GPF/SegFault)? Or do you mean that we need to stop httpd
*before* it attempts to do this, thus before causing the SegFault?
Hi,
>> The addresses (i.e. 20039498 etc.) are valid and in high memory. With
>> sysvm dump data I could tell who owns it. You are going to need to use
>> Theseus.
>We'll need to see with the next dump whether the address has changed. >
I suspect that locations such as 20039498 are pointers to some heap in
upper memory, probably owned by php or mysql. It's very likely that the
exact address will change, but the general location will not.
Does Theseus tell you who owns the addresses I label borked (i.e.
20039498)?
>> We need something that will stop httpd when it writes to to wrong
>> location. This is a job for a debugger.
>Shouldn't the OS stop a process which attempts to access memory which it
>does not own (GPF/SegFault)?
That's exactly what happened.
If you think about it, httpd can legally write to the majority of the
linear memory in its address space. To see this, use Theseus. Select an
httpx instance and then Process->Page table. Any page marked read/write,
user and r/w is exactly that. As always with Theusus, you can MB2 on the
window and select Help->Explanation for detailed column and field
descriptions.
>Or do you mean that we need to stop httpd
>*before* it attempts to do this, thus before causing the SegFault?
Sorta. We need to stop httpd at the time it writes to where it should not
be writing.
Bad luck has it that addresses such as 20039498 are valid and that the
exception handler chain is accessed infrequently. Otherwise, the trap
would have occurred at the time the bad write occurred or soon after.
On 10/07/09 02:13 pm, Steven Levine thus wrote :
> In <4ACCC137...@2rosenthals.com>, on 10/07/09
> at 12:26 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> Hi,
>
>
>>> The addresses (i.e. 20039498 etc.) are valid and in high memory. With
>>> sysvm dump data I could tell who owns it. You are going to need to use
>>> Theseus.
>>>
>
>
>> We'll need to see with the next dump whether the address has changed. >
>>
>
> I suspect that locations such as 20039498 are pointers to some heap in
> upper memory, probably owned by php or mysql. It's very likely that the
> exact address will change, but the general location will not.
>
> Does Theseus tell you who owns the addresses I label borked (i.e.
> 20039498)?
>
>
Description of Linear Object 20039498, PID = 0446, name = 'HTTPX':
It was allocated by LIBC063.
The address is at offset 00009498 into the memory object.
Arena Record:
har pages linear flg next prev link hash hob hal hco / Decoded
flags
1785 00001000 20030000 161 11F8 16BE 0000 0000 185F 0000 1B1E / Write
User Read
Object Record:
hob har next flgs ownr hmte sown,cnt lt st / owner / decoded_flags
185F 1785 0000 422C 1B1E 082A 0000 00 00 00 / p-HTTPX /
API_allocated_object zero-init read user write
Context Record(s):
hco next ptda flgs / process / decoded_flags
1B1E 32D9 FFD2 16 / / read user write
>>> Invalid hco = 32D9. Return code = 65537.
32D9 0000 0000 00 / /
>
>>> We need something that will stop httpd when it writes to to wrong
>>> location. This is a job for a debugger.
>>>
>
>
>> Shouldn't the OS stop a process which attempts to access memory which it
>> does not own (GPF/SegFault)?
>>
>
> That's exactly what happened.
>
> If you think about it, httpd can legally write to the majority of the
> linear memory in its address space. To see this, use Theseus. Select an
> httpx instance and then Process->Page table. Any page marked read/write,
> user and r/w is exactly that. As always with Theusus, you can MB2 on the
> window and select Help->Explanation for detailed column and field
> descriptions.
>
>
Ah, so. Quite a bit of space, indeed.
>> Or do you mean that we need to stop httpd
>> *before* it attempts to do this, thus before causing the SegFault?
>>
>
> Sorta. We need to stop httpd at the time it writes to where it should not
> be writing.
>
>
Gotcha.
> Bad luck has it that addresses such as 20039498 are valid and that the
> exception handler chain is accessed infrequently. Otherwise, the trap
> would have occurred at the time the bad write occurred or soon after.
>
>
Per the above, could the problem have actually exposed a bug in LIBC, or
do we still suspect something in Apache or PHP? And, what does "Invalid
hco=32D9" mean? Does context record = pointer, thus implying an invalid
pointer?
HI,
>Description of Linear Object 20039498, PID = 0446, name = 'HTTPX': It was
>allocated by LIBC063.
>The address is at offset 00009498 into the memory object.
That's pretty much what I expected. It was allocated in hi memory by
libc's malloc which eventually call OS/2's DosAllocMem. >Context
Record(s):
> hco next ptda flgs / process / decoded_flags
>1B1E 32D9 FFD2 16 / / read user write
> >>> Invalid hco = 32D9. Return code = 65537.
>32D9 0000 0000 00 / /
I wouldn't worry about this too much. Hcos list the processes that are
mapped into the same address space. Hcos are somewhat dynamic, so the hco
might have been valid when Theseus got the intial list and it was gone by
the time Theseus requested the hco details.
>Ah, so. Quite a bit of space, indeed.
The situation was a bit different for OS/2 1.2 and still is for device
drivers. These use segmented addressing. Requiring a selector for each
chunk of memory gives much finer grain control over what memory is
addressable. However, its much more computer intensive which is why the
flat model is preferred these days.
>Per the above, could the problem have actually exposed a bug in LIBC, or
>do we still suspect something in Apache or PHP? And, what does "Invalid
>hco=32D9" mean?
It's very unlikely to be a libc bug. It's much more likely to be some bad
pointer math in the php or mysql module/extension.
On 10/07/09 07:05 pm, Steven Levine thus wrote :
> In <4ACD0DF1...@2rosenthals.com>, on 10/07/09
> at 05:53 PM, Lewis G Rosenthal <lgros...@2rosenthals.com> said:
>
> HI,
>
>
>> Description of Linear Object 20039498, PID = 0446, name = 'HTTPX': It was
>> allocated by LIBC063.
>> The address is at offset 00009498 into the memory object.
>>
>
> That's pretty much what I expected. It was allocated in hi memory by
> libc's malloc which eventually call OS/2's DosAllocMem. >Context
> Record(s):
>
>> hco next ptda flgs / process / decoded_flags
>> 1B1E 32D9 FFD2 16 / / read user write
>>
>>>>> Invalid hco = 32D9. Return code = 65537.
>>>>>
>> 32D9 0000 0000 00 / /
>>
>
> I wouldn't worry about this too much. Hcos list the processes that are
> mapped into the same address space. Hcos are somewhat dynamic, so the hco
> might have been valid when Theseus got the intial list and it was gone by
> the time Theseus requested the hco details.
>
>
I see. Makes sense, now that you put it that way.
>> Ah, so. Quite a bit of space, indeed.
>>
>
> The situation was a bit different for OS/2 1.2 and still is for device
> drivers. These use segmented addressing. Requiring a selector for each
> chunk of memory gives much finer grain control over what memory is
> addressable. However, its much more computer intensive which is why the
> flat model is preferred these days.
>
>
Interesting.
>> Per the above, could the problem have actually exposed a bug in LIBC, or
>> do we still suspect something in Apache or PHP? And, what does "Invalid
>> hco=32D9" mean?
>>
>
> It's very unlikely to be a libc bug. It's much more likely to be some bad
> pointer math in the php or mysql module/extension.
>
Okay; I'll take your word for it. :-)
BTW, I added back the second Listen directive and bounced the daemon. It
re-hupped itself within 60 seconds, and again, a couple minutes
thereafter. I removed the second Listen directive, and restarted it at
22:01:48 EDT. So far, it is staying up again. Looks like it is indeed
the second listening port which is causing the problem. I am going to
set up for a process dump, switch it back again, and get what you need.
I'll force a dump as it stands now (single listening port) for
comparison purposes.
Hi,
>Okay; I'll take your word for it. :-)
If it was a libc bug, we'd have a lot more apps with issues.
>BTW, I added back the second Listen directive and bounced the daemon. It
>re-hupped itself within 60 seconds, and again, a couple minutes
>thereafter.
When you get a chance, try adding Listen 8080. That's what I have here
and I'd like to know if the port matters or if I just don't have the right
apps installed.
I'm going to try 81 here and see what happens.
i'm almost wondering if this may be due to more than one process trying to use
the same libraries or such... the main thing i note is that this your setup is
trying to do https on the second port whereas steve's is not... 80/443 on yours
vs 80/8080 on steves... what happens if you switch to non-ssl on your secondary
port??
Doubt it helps but I have
Listen 40
Listen 80
And all seems fine but then I don't run php. One vhost uses 40 and the
rest 80.
--
Regards
Dave Saville
On Thu, 08 Oct 2009 00:02:38 -0400 "waldo kitty" <wkit...@windstream.net>
wrote:
I don't run SSL on this server. Both listeners are already unencrypted.
My ports:
80
81
Steve's:
80
8080
On 10/08/09 03:56 am, Dave Saville thus wrote :
I think in my case that the PHP stuff must make the difference.
I've switched the second port on and off a couple of times, and it's
fairly consistent with the restarts every few minutes to a few hours
with both ports enabled.
I haven't tried shuffling modules, although PHP is loaded last by
default (as it gets loaded from an included conf (httpd-php.conf), at
the end of httpd.conf.
Hi Lewis,
>I think in my case that the PHP stuff must make the difference.
I'm sure that is part of it. Hopefully, it is also tied to a specific
HTTP request which we have not yet identified.
>I haven't tried shuffling modules, although PHP is loaded last by
>default (as it gets loaded from an included conf (httpd-php.conf), at
>the end of httpd.conf.
This is not likely to avoid the failure, although with luck it might make
the failure occur at a better time for analysis. Since we went through
the effort of editing the attached, let's use it next time you get the
urge to watch apache die.
Perhap's seeing the dump file in context with the apache logs will help.
[trim]
> I don't run SSL on this server. Both listeners are already unencrypted.
my bad... i thought you had stated that you were using ssl on the second port :?
this really sounds like something to do with sharing stuff between the multiple
listening ports... like something is in blocking mode when it should not be or
possibly even something similar...
> I haven't tried shuffling modules, although PHP is loaded last by
> default (as it gets loaded from an included conf (httpd-php.conf), at
> the end of httpd.conf.
i really, from an extremely basic standpoint, do not believe that it has
anything to do with module loading order but something much more basic in nature...
sadly, though, my main apache server is basically nothing more than a reverse
proxy frontend that only serves several hundred static html pages and farms all
other traffic off to backend apache servers for active/dynamic pages on the main
domain as well as shuffling everything off to the other servers for the
name-based virtual domains that i operate... i also do not do ssl here and i'm
also still on the old apache stuffs since my frontline server is (still!) OS2
Warp 3 Connect (of which i'm pretty proud of in this day in age) :)
this is why i mentioned snort in one of my earlier posts in this thread... on my
system, since i implemented a much better firewall than i had before, i do not
see the problems that i used to get... those problems were "random lockups" and
crashes which, when reviewing the available logging stuffs i attributed to "flaw
breakers" which are used to try to p0wn servers and workstations... since i
implemented the firewall i run and snort with the VRT and ET rules sets, i do
not see these occurrences any more... at least not manifested as they were...
now i see them in the intrusion protection logs and also note that the
"protectorate" has set a block for the IP issuing the request... other than
that, my server(s) all just keep on ticking and ticking and ticking...
my participation in this thread is mainly as an outside observer and "thought
master" :)
>> I haven't tried shuffling modules, although PHP is loaded last by
>> default (as it gets loaded from an included conf (httpd-php.conf), at
>> the end of httpd.conf.
>
> This is not likely to avoid the failure, although with luck it might make
> the failure occur at a better time for analysis. Since we went through
> the effort of editing the attached, let's use it next time you get the
> urge to watch apache die.
>
> Perhap's seeing the dump file in context with the apache logs will help.
i do hope you guys do find this problem and get it taken care of as i hope to
one day move up to Warp 4 or even to ECS as long as the basic functionality my
services require are not lost... even then, though, i can run things with
forwarding and still get the job done :)
On 10/08/09 10:37 pm, waldo kitty thus wrote :
No worries.
Way back when, I was only able to get residential cable at my home
office where the data center is located. I couldn't even get DSL in
those days. Residential cable accounts have ports 80 & 25 blocked. I,
being rather naive about such things at the time, instead of trying port
8080, bumped to port 81 (and 24 for SMTP, which wasn't such a bad choice).
Naturally, things being what they are on the net, Google cached a bunch
of links containing ":81" which I didn't want to lose, so when I finally
got DSL (and now, with business class cable), I started listening on
both ports, not thinking that I could quite easily use my firewall to
handle the port translation.
SSL on OS/2 used to be a pain. 1.3, as you know, requires a different
exe altogether (which Brian provided for us), but 2.0 (which was what I
used in the beginning) uses mod_ssl. I don't recall offhand whether
Brian's builds provided this, but I do recall that it at least seemed
rather involved to set up (now, I do that sort of configuration on other
platforms all the time). So, it just seemed logical to run all of my SSL
stuff on NetWare, which was already set up to do that "out of the box."
Now, I have no less than four separate Apache instances running on
NetWare, each listening on several ports, and all doing SSL as well as
clear. I'll probably migrate at least some of these to OS/2 at some
point when we get this issue sorted out, if only to better load balance
things (I'd also like to be able to not directly expose *any* services
on the NetWare servers to the outside, though NetWare provides such a
great db platform that I will surely continue running MySQL on it -
unfortunately, the NetWare Postgres port is now quite dated, and nobody
seems to have picked up the gauntlet on that).
Anyway, this was the saga of how I ended up with 81 & (finally) 80 on my
OS/2 box. :-)