Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Siege 4.1.3

243 views
Skip to first unread message

Paul Smedley

unread,
Jun 26, 2022, 5:07:45 AM6/26/22
to Apache HTTP Server for OS/2
I managed to get 4.1.3 building....
https://smedley.id.au/tmp/siege-4.1.3-os2-20220626.zip

David McKenna

unread,
Jun 26, 2022, 9:38:20 AM6/26/22
to Apache for OS/2
Thanks Paul! This one works as well as the last, with all the same quirks...

Regards,

Steven Levine

unread,
Jun 26, 2022, 12:12:54 PM6/26/22
to apa...@googlegroups.com
In <9e31d806-6acf-4d9e...@googlegroups.com>, on 06/26/22
at 06:38 AM, David McKenna <davidmc...@gmail.com> said:

Hi,

>Thanks Paul! This one works as well as the last, with all the same
>quirks...

I pushed an interim patch that should convince siege to respond better to
Ctrl-C kill requests.

You ticket #3177 has not gotten much attention recently. You might want
to post a note to the ticket asking for a status update.

Steven

--
----------------------------------------------------------------------
"Steven Levine" <ste...@earthlink.net> Warp/DIY/BlueLion etc.
www.scoug.com www.arcanoae.com www.warpcave.com
----------------------------------------------------------------------

Paul Smedley

unread,
Jun 27, 2022, 5:03:34 AM6/27/22
to apa...@googlegroups.com

Hey all,

On 27/6/22 01:36, Steven Levine wrote:
> In <9e31d806-6acf-4d9e...@googlegroups.com>, on 06/26/22
> at 06:38 AM, David McKenna <davidmc...@gmail.com> said:
>
> Hi,
>
>> Thanks Paul! This one works as well as the last, with all the same
>> quirks...
>
> I pushed an interim patch that should convince siege to respond better to
> Ctrl-C kill requests.
https://smedley.id.au/tmp/siege-4.1.3-os2-20220627.zip includes this
patch- seems to work OK here.

Cheers,

Paul

David McKenna

unread,
Jun 27, 2022, 6:54:35 AM6/27/22
to Apache for OS/2
Thanks Paul! This one is a mixed bag here - <CTRL>C does work, but I can't get it to use my URLS.TXT file - when I try 'siege -f C:\siege\etc\urls.txt' (like I always do) it comes back with a list of options as if I typed something wrong. It also still does not honor the 'time' directive (I set to 2 minutes) and I wonder if it is not honoring the 'failures' directive now too. Here is the result of 'siege 192.168.21.2' after running about 4 minutes I hit <CTRL>C:

[C:\siege\bin]siege 192.168.21.2
** SIEGE 4.1.3
** Preparing 25 concurrent users for battle.
The server is now under siege...
Lifting the server siege...siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:                 460718 hits
Availability:                  99.94 %
Elapsed time:                 236.22 secs
Data transferred:             172.23 MB
Response time:                  0.01 secs
Transaction rate:            1950.38 trans/sec
Throughput:                     0.73 MB/sec
Concurrency:                   24.64
Successful transactions:      460718
Failed transactions:             280
Longest transaction:            1.03
Shortest transaction:           0.00

LOG FILE: /var/log/siege.log
You can disable this log file notification by editing
D:\HOME/.siege/siege.conf and changing 'show-logfile' to false.

 Notice it says 'siege aborted due to excessive socket failure' AFTER I hit <CTRL>C. Maybe the <CTRL>C is the cause of the failures?...

Regards,

David McKenna

unread,
Jun 27, 2022, 7:11:58 AM6/27/22
to Apache for OS/2
 Failures set to '512':

[C:\siege\bin]siege 192.168.21.2
** SIEGE 4.1.3
** Preparing 25 concurrent users for battle.
The server is now under siege...
Lifting the server siege...siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:                  11565 hits
Availability:                  95.57 %
Elapsed time:                   6.27 secs
Data transferred:               4.32 MB
Response time:                  0.01 secs
Transaction rate:            1844.50 trans/sec
Throughput:                     0.69 MB/sec
Concurrency:                   24.66
Successful transactions:       11565
Failed transactions:             536
Longest transaction:            1.17

Shortest transaction:           0.00

LOG FILE: /var/log/siege.log
You can disable this log file notification by editing
D:\HOME/.siege/siege.conf and changing 'show-logfile' to false.

   and failures set to 128:-)

[C:\siege\bin]siege 192.168.21.2
** SIEGE 4.1.3
** Preparing 25 concurrent users for battle.
The server is now under siege...
Lifting the server siege...siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:                   8746 hits
Availability:                  98.29 %
Elapsed time:                   5.00 secs
Data transferred:               3.27 MB
Response time:                  0.01 secs
Transaction rate:            1749.20 trans/sec
Throughput:                     0.65 MB/sec
Concurrency:                   21.91
Successful transactions:        8746
Failed transactions:             152
Longest transaction:            1.10

Shortest transaction:           0.00

LOG FILE: /var/log/siege.log
You can disable this log file notification by editing
D:\HOME/.siege/siege.conf and changing 'show-logfile' to false.

 So it appears that 'failures' is what is actually stopping the siege...

Regards,

Steven Levine

unread,
Jun 27, 2022, 12:25:21 PM6/27/22
to apa...@googlegroups.com
In <5c4029f6-3b8a-4a53...@googlegroups.com>, on 06/27/22
at 04:11 AM, David McKenna <davidmc...@gmail.com> said:

Hi David,

> So it appears that 'failures' is what is actually stopping the siege...

As I mentioned to Paul, this is just the 1st cut at getting the
termination code to work without a working pthread_cancel.

It's sorta working. In response to the Ctrl-C all but one of the crew
threads terminated as expected. The stuck thread prevents the process
from terminating when failures = 0 is configured.

Paul Smedley

unread,
Jun 29, 2022, 3:45:12 AM6/29/22
to apa...@googlegroups.com
Hey Guys,

On 28/6/22 01:46, Steven Levine wrote:
> In <5c4029f6-3b8a-4a53...@googlegroups.com>, on 06/27/22
> at 04:11 AM, David McKenna <davidmc...@gmail.com> said:
>
> Hi David,
>
>> So it appears that 'failures' is what is actually stopping the siege...
>
> As I mentioned to Paul, this is just the 1st cut at getting the
> termination code to work without a working pthread_cancel.
>
> It's sorta working. In response to the Ctrl-C all but one of the crew
> threads terminated as expected. The stuck thread prevents the process
> from terminating when failures = 0 is configured.

The following build has additional fixes from Steven -
https://github.com/psmedley/siege-os2/pull/2 plus some fixes to make it
link from me.

https://smedley.id.au/tmp/siege-4.1.3-os2-20220629.zip

Cheers,

Paul

Steven Levine

unread,
Jun 29, 2022, 10:42:13 AM6/29/22
to 'Paul Smedley' via Apache for OS/2
In <0ce48cdd-5175-e25a...@smedley.id.au>, on 06/29/22
at 05:15 PM, "'Paul Smedley' via Apache for OS/2"
<apa...@googlegroups.com> said:

Hi Paul,

>The following build has additional fixes from Steven -
>https://github.com/psmedley/siege-os2/pull/2 plus some fixes to make it
>link from me.

>https://smedley.id.au/tmp/siege-4.1.3-os2-20220629.zip

Sorry about the sloppy coding.

We are getting closer. This build stops on Ctrl-C, unless all the socket
connects fail. I thought I had the needed patches in place for this, but
I guess not. pr#3 coming. :-)

Paul, when you get a momenent, please add a copy of readme.os2 to the repo
along with a reference copy of your config.h.

Thanks,

Steven Levine

unread,
Jun 29, 2022, 11:18:53 AM6/29/22
to 'Paul Smedley' via Apache for OS/2
In <0ce48cdd-5175-e25a...@smedley.id.au>, on 06/29/22
at 05:15 PM, "'Paul Smedley' via Apache for OS/2"
<apa...@googlegroups.com> said:

Hi all,

Looks like the required code is in place, but the signal is getting lost
when there's too much output to the console.

I'll try pulling a sleep call after the socket connect failures and see if
this is sufficient.

Steven Levine

unread,
Jun 29, 2022, 5:09:18 PM6/29/22
to 'Paul Smedley' via Apache for OS/2
In <0ce48cdd-5175-e25a...@smedley.id.au>, on 06/29/22
at 05:15 PM, "'Paul Smedley' via Apache for OS/2"
<apa...@googlegroups.com> said:

Hi all,

Turns out the lost Ctrl-C requests are only a problem when running
concurrent = 1.

Even two crew threads allows the Ctrl-C to be recognized. I was running
with one thread because it makes it vastly easier to step though the code
in the debugger. This not a standard configuration. With one thread, I
can avoid the Ctrl-C failures with:

delay = 0.001

so, I don't see the need for additional patches at this time.

It's not clear exactly why Ctrl-C requests are not getting passed to the
signal handler thread (thread 2), but I have some ideas. I suspect the
signal gets discarded as a side effect of some other kernel processing
related to the almost continuous screen output.

Running

siege -f urls.txt >nul 2>&1

suppresses all screen out and also allows Ctrl-C to be effective.

So siege-on and let us know if you run into any issues we need to look at.

Paul Smedley

unread,
Jun 30, 2022, 6:00:46 AM6/30/22
to apa...@googlegroups.com
Hey Steven,

On 30/6/22 00:03, Steven Levine wrote:
> In <0ce48cdd-5175-e25a...@smedley.id.au>, on 06/29/22
> at 05:15 PM, "'Paul Smedley' via Apache for OS/2"
> <apa...@googlegroups.com> said:
>> The following build has additional fixes from Steven -
>> https://github.com/psmedley/siege-os2/pull/2 plus some fixes to make it
>> link from me.
>
>> https://smedley.id.au/tmp/siege-4.1.3-os2-20220629.zip
>
> Sorry about the sloppy coding.
No problem:)

> Paul, when you get a momenent, please add a copy of readme.os2 to the repo
> along with a reference copy of your config.h.
Done.

Cheers,

Paul

David McKenna

unread,
Jun 30, 2022, 6:29:26 AM6/30/22
to Apache for OS/2
 Thanks guys for the new version and all the work you put into siege! This one seems to work well - only issue is the 'time' directive is still ignored.

  The latest Apache and php8.1 builds hold up against siege very well here - haven't had them crash yet. Can't say the same about AFINETK and SOCKETSK, it seems siege really finds their weaknesses and crashes the system. Running ACPI.PSD /MAXCPU=1 cures that, but otherwise I get a trap in one or the other (on either the server computer or the siege computer) at least once every time running siege with SMP. Setting siege itself to single-processor mode helps, but still get traps even that way occasionally.

Regards,

Paul Smedley

unread,
Jun 30, 2022, 6:52:24 AM6/30/22
to apa...@googlegroups.com
Hi Dave,

On 30/6/22 19:59, David McKenna wrote:
>  Thanks guys for the new version and all the work you put into siege!
> This one seems to work well - only issue is the 'time' directive is
> still ignored.

You mean this parameter?
-t NUMm, --time=NUMm
This option is similar to --reps but instead of specifying the number
of times each user should run, it specifies the amount of time each
should run.

The value format is “NUMm”, where “NUM” is an amount of time and the “m”
modifier is either S, M, or H for seconds, minutes and hours. To run
siege for an hour, you could select any one of the following
combinations: -t3600S, -t60M, -t1H. The modifier is not case sensitive,
but it does require no space between the number and itself.

I'll try investigate.

Cheers,

Paul

David McKenna

unread,
Jun 30, 2022, 7:06:04 AM6/30/22
to Apache for OS/2
Hi Paul,

  I have to admit I didn't try it on the command line. I have 'time = 2M' in my siege.conf file, and when running siege, it never stops at 2 minutes (or any other value I try). Not a big deal, but would be nice if it worked.

Regards,

Steven Levine

unread,
Jun 30, 2022, 11:09:13 AM6/30/22
to 'Paul Smedley' via Apache for OS/2
In <82f7bc8e-f9ee-caca...@smedley.id.au>, on 06/30/22
at 08:22 PM, "'Paul Smedley' via Apache for OS/2"
<apa...@googlegroups.com> said:

Hi guys,

>The value format is NUMm , where NUM is an amount of time and the m
>modifier is either S, M, or H for seconds, minutes and hours. To run
>siege for an hour, you could select any one of the following
>combinations: -t3600S, -t60M, -t1H. The modifier is not case sensitive,
> but it does require no space between the number and itself.

>I'll try investigate.

This fails for the same reason as the Ctrl-C kill failed.

With debug enabled, you will get the message:

if(my.debug){ printf("TIMED OUT!!\n"); fflush(stdout); }

but since siege_timer also uses the unimplemented

pthread_kill(handler, SIGTERM);

nothing useful happens. Patch coming or more acurately, modified patch
coming. that sets os2_pthread_cancel_requested as well as testing it.
If this is not sufficient, we well have to expose timer_mutex so that
crew_cancel can post it if needed.

We will try the simpler solutio first.

Paul Smedley

unread,
Jun 30, 2022, 5:13:10 PM6/30/22
to apa...@googlegroups.com
Hey All,

On 1/7/22 00:31, Steven Levine wrote:
> In <82f7bc8e-f9ee-caca...@smedley.id.au>, on 06/30/22
> at 08:22 PM, "'Paul Smedley' via Apache for OS/2"
> <apa...@googlegroups.com> said:
>
> Hi guys,
>
>> The value format is NUMm , where NUM is an amount of time and the m
>> modifier is either S, M, or H for seconds, minutes and hours. To run
>> siege for an hour, you could select any one of the following
>> combinations: -t3600S, -t60M, -t1H. The modifier is not case sensitive,
>> but it does require no space between the number and itself.
>
>> I'll try investigate.
>
> This fails for the same reason as the Ctrl-C kill failed.
>
> With debug enabled, you will get the message:
>
> if(my.debug){ printf("TIMED OUT!!\n"); fflush(stdout); }
>
> but since siege_timer also uses the unimplemented
>
> pthread_kill(handler, SIGTERM);
>
> nothing useful happens. Patch coming or more acurately, modified patch
> coming. that sets os2_pthread_cancel_requested as well as testing it.
> If this is not sufficient, we well have to expose timer_mutex so that
> crew_cancel can post it if needed.
>
> We will try the simpler solutio first.

Simple seems to work -
https://smedley.id.au/tmp/siege-4.1.3-os2-20220701.zip

Using:
siege.exe -t 15s https://os2ports.smedley.id.au

I get:
Transactions: 1003 hits
Availability: 83.93 %
Elapsed time: 17.49 secs
Data transferred: 6.24 MB
Response time: 0.34 secs
Transaction rate: 57.35 trans/sec
Throughput: 0.36 MB/sec
Concurrency: 19.75
Successful transactions: 418
Failed transactions: 192
Longest transaction: 1.29
Shortest transaction: 0.01

Cheers,

Paul

David McKenna

unread,
Jun 30, 2022, 5:49:15 PM6/30/22
to Apache for OS/2
Yup... time=2M in siege.conf:

[C:\siege\bin]siege -f c:\siege\etc\urls.txt

** SIEGE 4.1.3
** Preparing 25 concurrent users for battle.
The server is now under siege...siege aborted due to excessive socket failure; y

ou
can change the failure threshold in $HOME/.siegerc

Transactions:                  11712 hits
Availability:                  93.83 %
Elapsed time:                 126.89 secs
Data transferred:             434.81 MB
Response time:                  0.25 secs
Transaction rate:              92.30 trans/sec
Throughput:                     3.43 MB/sec
Concurrency:                   23.03
Successful transactions:       11238
Failed transactions:             770
Longest transaction:            6.20
Shortest transaction:           0.00

 Although it says aborted due to socket failure, I tried different times, and they all worked. Thanks!

Regards,

Steven Levine

unread,
Jun 30, 2022, 6:24:59 PM6/30/22
to 'Paul Smedley' via Apache for OS/2
In <9c8f3e88-4c41-bf01...@smedley.id.au>, on 07/01/22
at 06:43 AM, "'Paul Smedley' via Apache for OS/2"
<apa...@googlegroups.com> said:

Hi,

>Using:
>siege.exe -t 15s https://os2ports.smedley.id.au

>I get:
>Transactions: 1003 hits
>Availability: 83.93 %
>Elapsed time: 17.49 secs
>Data transferred: 6.24 MB
>Response time: 0.34 secs
>Transaction rate: 57.35 trans/sec
>Throughput: 0.36 MB/sec
>Concurrency: 19.75
>Successful transactions: 418
>Failed transactions: 192
>Longest transaction: 1.29
>Shortest transaction: 0.01

Looks good here too. There's a minor nit with the Failed transactions
count. Running with 50 threads, I always see:

Failed transactions: 50

regardless of the duration. What is probably happening is that when
os2_pthread_cancel_requested is TRUE, __request correctly returns FALSE
here:

browser.c:312
if ((ret = __request(this, tmp))==FALSE) {
__increment_failures();
}

So, the request than never happened is counted as an error. Patching this
to

if ((ret = __request(this, tmp))==FALSE &&
!os2_pthread_cancel_requested) {
__increment_failures();
}

will correct the counting.

Now, I guess it's back to trying to expose the remaining httpd/php issues.

If anyone gets the urge, they might want to see how well ab.exe, which
ships with httpd, is working these days. It's always good to have a
alternate testing tool available.

It may also be time for me to understand what "guru meditation" means in
Massimo's world.

On a somewhat related note, I have an update to deadman.exe v0.6 almost
ready to release. It adds the ability to monitor multiple logs files,
which may be useful if running multiple vhosts. v0.5 added support for
rebooting on request. This should be more robust than the typical setboot
/b method which can fail if the system is low or out of resources. A
deadman requested reboot probably can still fail, but the probability is
much lower than other reboot methods.

On a less related note, I finally got the urge to take another look at
cvs2git. Cvs2git converts a CVS repository to git repository. It's a
python app, and at one time, our python port was not up to running the
code. Once the python port issues got resolved, I was unable to
configure the cvs2svn options to export the files with proper DOS line
ending. In cvs speak, we need to do cvs co -kkv for text files and cvs co
-kb for binary files. This give the expected CR/LF line endings for text
files and expands the keywords and leave CRs in binary files unsullied.
Expanding the keywords during conversion makes sense because git does not
do keyword expansion. The expanded keywords serve as a historical
comment.

I'm still not able to set the options to make this happen, so I added some
temporary code to override the selected options do what I wanted them to
do. Perhaps Michael Haggerty, the cvs2svn maintainer, can tell me what
options I should be using.

Steven Levine

unread,
Jun 30, 2022, 9:55:38 PM6/30/22
to apa...@googlegroups.com
In <70d4b3d2-52b0-4870...@googlegroups.com>, on 06/30/22
at 02:49 PM, David McKenna <davidmc...@gmail.com> said:

Hi,

>Yup... time=2M in siege.conf:

>[C:\siege\bin]siege -f c:\siege\etc\urls.txt
>** SIEGE 4.1.3
>** Preparing 25 concurrent users for battle.
>The server is now under siege...siege aborted due to excessive socket
>failure; y
>ou
>can change the failure threshold in $HOME/.siegerc

>Transactions: 11712 hits
>Availability: 93.83 %
>Elapsed time: 126.89 secs
>Data transferred: 434.81 MB
>Response time: 0.25 secs
>Transaction rate: 92.30 trans/sec
>Throughput: 3.43 MB/sec
>Concurrency: 23.03
>Successful transactions: 11238
>Failed transactions: 770
>Longest transaction: 6.20
>Shortest transaction: 0.00

> Although it says aborted due to socket failure, I tried different times,
> and they all worked. Thanks!

That's because the timer was ignored. The message means that siege died
because failure limit was exceeded, rather than the timer expiring.

The message is misleading because the counter increments for any error,
not just socket errors.

Paul Smedley

unread,
Jul 1, 2022, 2:26:09 AM7/1/22
to apa...@googlegroups.com

Paul Smedley

unread,
Jul 1, 2022, 2:30:26 AM7/1/22
to apa...@googlegroups.com
Latest results for a 2 minute test here (apache2 running on Ubuntu FYI)

Transactions: 12280 hits
Availability: 100.00 %
Elapsed time: 121.41 secs
Data transferred: 29.39 MB
Response time: 0.24 secs
Transaction rate: 101.14 trans/sec
Throughput: 0.24 MB/sec
Concurrency: 24.33
Successful transactions: 2013
Failed transactions: 0
Longest transaction: 4.39
Shortest transaction: 0.01

Steven Levine

unread,
Jul 1, 2022, 11:00:34 AM7/1/22
to 'Paul Smedley' via Apache for OS/2
In <21f6f9ac-5b1f-0096...@smedley.id.au>, on 07/01/22
at 03:56 PM, "'Paul Smedley' via Apache for OS/2"
<apa...@googlegroups.com> said:

Hi all,
This build counts better, but there's one more patch coming.

With -t5s and with the server not running I get the expected:

...
[error] socket: unable to connect sock.c:282: Invalid argument

Transactions: 0 hits
Availability: 0.00 %
Elapsed time: 4.99 secs
Data transferred: 0.00 MB
Response time: 0.00 secs
Transaction rate: 0.00 trans/sec
Throughput: 0.00 MB/sec
Concurrency: 0.00
Successful transactions: 0
Failed transactions: 2456
Longest transaction: 0.00
Shortest transaction: 0.00

Fun fact. Running with the stderr redirected, we get:

[ [1;33merror [0m] socket: unable to connect sock.c:282: Invalid argument
siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions: 0 hits
Availability: 0.00 %
Elapsed time: 0.53 secs
Data transferred: 0.00 MB
Response time: 0.00 secs
Transaction rate: 0.00 trans/sec
Throughput: 0.00 MB/sec
Concurrency: 0.00
Successful transactions: 0
Failed transactions: 10049
Longest transaction: 0.00
Shortest transaction: 0.00

which shows how much excessive screen output can affect performance. It
also tells me the that a I still not suppressing all the spurious error
counts.

Redirecting output with

>siege -t5s -f urls.txt 2>tmp.out

and failures = 0, we get:

[ [1;33merror [0m] socket: unable to connect sock.c:282: Invalid argument

Transactions: 0 hits
Availability: 0.00 %
Elapsed time: 4.25 secs
Data transferred: 0.00 MB
Response time: 0.00 secs
Transaction rate: 0.00 trans/sec
Throughput: 0.00 MB/sec
Concurrency: 0.00
Successful transactions: 0
Failed transactions: 84922
Longest transaction: 0.00
Shortest transaction: 0.00

which is a significantly more failures per second. :-)

Steven Levine

unread,
Jul 16, 2022, 1:18:39 PM7/16/22
to apa...@googlegroups.com
In <2023fc86-fd07-4da4...@googlegroups.com>, on 06/30/22
at 03:29 AM, David McKenna <davidmc...@gmail.com> said:

Hi David,

It's time to collect my notes for your ticket 3177.

You get intermittent trap E's in afinetk or socketsk if accessing via
www.davemckenna.com and not running /MAXCPU=1.

Running /MAXCPU=1 or accessing via 192.168.21.2 avoids the traps.

Is the correct?

BTW, there should be a new E1000B build available soon. When if appears,
you might as well test it. It's unlikely to avoid the traps, but when/if
we get back to debugging this issue, we will want to be testing against
current binaries.

Another BTW, does your MB have an usable serial port connector and do you
happen to have a 2nd machine with a usable serial port connector? If so,
kernel debugging the issue is a possibility.

David McKenna

unread,
Jul 16, 2022, 5:34:58 PM7/16/22
to Apache for OS/2
Hi Steven,

  Ticket 3177 (at ArcaNoae Mantis) is for my server computer which is running Apache 2.4.53 server with php 8.1. It has an Intel NIC. If I run full SMP, then 'siegeing' the apache server (whether www.davemckenna.com or 192.168.21.2) from another computer hooked to the same router switch (usually my desktop computer) will eventually produce a trap in AFINETK, which I uploaded an example to ArcaNoae. If I set that computer to run /MAXCPU=1, I don't seem to get the traps in AFINETK. I have never got a trap in SOCKETSK on that computer.

  On the other hand, my desktop computer where I run 'siege' from, has seen traps in both AFINETK and SOCKETSK (and other things too) when I 'siege' using www.davemckenna.com, but rarely (but not never) when using 192.168.21.2. It has a Realtek NIC. If I set 'siege' to run in single processor mode using 'Execmode -sp' then I can use www.davemckenna.com as well as 192.168.21.2 (but even then will very occasionally get a trap). Never tried the desktop computer using /MAXCPU=1. I never reported this to ArcaNoae because of the experimental nature of getting siege working here.

  Hope this clarifies my setup/situation and sorry if there is any confusion. I do have a serial port on both the server computer and desktop computer, and can collect debugging data from them (with a laptop). Also already updated the Intel NIC driver from the ArcaNoae 'Experimental Builds' download link.

Regards,

Steven Levine

unread,
Jul 18, 2022, 1:08:52 AM7/18/22
to apa...@googlegroups.com
In <66bc229c-be3e-411e...@googlegroups.com>, on 07/16/22
at 02:34 PM, David McKenna <davidmc...@gmail.com> said:

Hi David,

Thanks. I think now have a better understanding of the moving parts.

>Never tried the desktop computer using /MAXCPU=1.

It might be a worthwhile test. Execmode -sp is functionally quite
different from /MAXCPU=1. Execmode -sp forces all the process's threads
to run on the same CPU, but network traffic for the process can still
occur on another CPU.

>I never reported this
>to ArcaNoae because of the experimental nature of getting siege working
>here.

We had porting issues to resolve, but we did nothing I would call
experimental. Siege is a typical TCP/IP application that follows the
standards.

I have copies of both Ticket3177-20210828-dump.7z and
Ticket3177-10152021.7z. The later is a bit easier to work with, but both
traps occured for the same reason - the edx content is borked. The trap
is at:

# %ln -m %f2523443 (eip)
!afinetk ip_insertoptions + F

# u %f2523443
%f2523443 8b4a08 mov ecx,dword ptr [edx+08] ;
trap here

The traps occur because EDX=0000836c, which is not a valid 32-bit pointer.
I have some ideas how this might have occurred, but I need to spend more
time with the dump file and try to figure out how and when the pointer
went bad.

The afinetk and sockets drivers differ in a number of ways from the legacy
afinet and socket drivers. The K drivers are mostly 32-bit and use the
KEE interface. When the KEE drivers are in use, the ring3 part of the
TCP/IP stack sets up a dynamic call gate and calls directly into the
socketsk (IIRC) driver passing read to use 32-bit pointers.

At the time of the trap, the ring3 state active thread was:

Current slot number: 00a5
Slot Pid Ppid Csid Ord Sta Pri pTSD pPTDA pTCB Disp SG
Name *00a5# 0185 0183 0185 0007 run 0200 f8ca4000 f9502228 f944ebac 0e88
14 HTTPX

eax=00000079 ebx=00000185 ecx=00000000 edx=02a1f970 esi=02a1fa18
edi=02a1f9d8 eip=1e480027 esp=02a1f930 ebp=02a1fa34 iopl=0 -- -- -- nv up
ei pl zr na pe nc cs=005b ss=0053 ds=0053 es=0053 fs=150b gs=0000
cr2=00000000 cr3=00211000 p=00 005b:1e480027 c3 retd

# %ln -m %1e480027
CallGate + 8

CallGate is a function in tcpip32.dll which as it name implies is how we
entered the driver.

That bad news is we don't have sources for the socketsk and afinetk
drivers. The good news is that the freebsd sources are available and they
are close enough to what's in the OS/2 drivers to be useful.

David McKenna

unread,
Jul 18, 2022, 6:51:24 AM7/18/22
to Apache for OS/2
Hi Steven,

  Thanks for looking at the trap file - hope it contains the key to a fix, although it's not clear to me how you can fix it without the OS/2 source...

  I do have a system dump from the desktop computer running siege, so I guess I'll create a ticket at ArcaNoae once I've done a little more testing (especially with /MAXCPU=1). Hope you don't mind if I drop your name :-)

Regards,

Steven Levine

unread,
Jul 22, 2022, 1:10:42 PM7/22/22
to apa...@googlegroups.com
In <b14a05fe-71b1-49f2...@googlegroups.com>, on 07/18/22
at 03:51 AM, David McKenna <davidmc...@gmail.com> said:

Hi David,

> Thanks for looking at the trap file - hope it contains the key to a
>fix, although it's not clear to me how you can fix it without the OS/2
>source...

It's a skill set kind of thing. I'm in my 70's so I come from the days
when source code did not exist as we think of it today. While my machine
language skills are rusty compared the what they were back when I started
with computers, they are still good enough for this kind a debugging.

PM coming.

David McKenna

unread,
Jul 22, 2022, 6:36:08 PM7/22/22
to Apache for OS/2
Hi Steven,

  Got your PM and downloaded the file. Installed it on both the Apache (server) computer and the siege (desktop) computer. Both computers are running full SMP. In no case did I get a system trap on the Apache computer. I did get one on the siege computer.
Did these tests:

1st run: siege is set for single processor mode and use 192.168.21.2 to address the apache server. Set up siege for 15 minutes, but it stopped due to errors (set to 512) before then with about a 98% success rate. The apache server was still running, although there were a couple exceptq files there - attached. No system traps.

2nd run: siege is set for multi-processor mode and use 192.168.21.2 to address the Apache server. Set for 15 minutes, but it stopped due to the Apache server crashing after about 10 minutes. Many exceptq files and POPUPLOG on the server. Attached. No system traps.

3rd run: siege is set for multi-processor mode and use davemckenna.com to address the apache server (with davemckenna.com defined in the HOSTS file of the siege computer). Set for 15 minutes, but it stopped due to the Apache server crashing after about 10 minutes. Many exceptq files and POPUPLOG on the server. Attached. No system traps.

4th run: siege is set for multi-processor mode and davemckenna.com is used to address the Apache server (davemckenna.com NOT defined in the HOSTS file of the siege computer). Siege computer traps after about 30 seconds in 'SOFFICE', Apache computer continues running with no errors. I have a (old) dump file from this situation if needed.

Regards,
2nd run.7z
3rd run.7z
1st run.7z

Steven Levine

unread,
Jul 22, 2022, 9:05:50 PM7/22/22
to apa...@googlegroups.com
In <465c2468-925f-40b2...@googlegroups.com>, on 07/22/22
at 03:36 PM, David McKenna <davidmc...@gmail.com> said:

Hi David,

> Got your PM and downloaded the file. Installed it on both the Apache
>(server) computer and the siege (desktop) computer. Both computers are
>running full SMP. In no case did I get a system trap on the Apache
>computer. I did get one on the siege computer.

Thanks for the testing. The results sound promising. Without the system
traps, we have more opportunities for other residual issues to show up.

>1st run: siege is set for single processor mode and use 192.168.21.2 to
>address the apache server. Set up siege for 15 minutes, but it stopped
>due to errors (set to 512) before then with about a 98% success rate.
>The apache server was still running, although there were a couple
>exceptq files there - attached. No system traps.

Offhand these look like unhandled malloc failures in the php code base.
Nothing platform specific. We should be able to detect these and avoid
the traps.

>2nd run: siege is set for multi-processor mode and use 192.168.21.2 to
>address the Apache server. Set for 15 minutes, but it stopped due to the
>Apache server crashing after about 10 minutes. Many exceptq files and
>POPUPLOG on the server. Attached. No system traps.

These were mostly libc lock issues. We seen these before. The libc heap
code exits while holding the lock. I need to check my notes and see what
the status is. I may need to submit a ticket to bitwiseworks for this.

>3rd run: siege is set for multi-processor mode and use davemckenna.com to
> address the apache server (with davemckenna.com defined in the HOSTS
>file of the siege computer). Set for 15 minutes, but it stopped due to
>the Apache server crashing after about 10 minutes. Many exceptq files
>and POPUPLOG on the server. Attached. No system traps.

This is more libc heap corruption followed by libc exiting while holding
the heap lock.

>4th run: siege is set for multi-processor mode and davemckenna.com is
>used to address the Apache server (davemckenna.com NOT defined in the
>HOSTS file of the siege computer). Siege computer traps after about 30
>seconds in 'SOFFICE', Apache computer continues running with no errors.
>I have a (old) dump file from this situation if needed.

Let's open a arcanoae mantis ticket for this one and upload the dump to
the AN FTP. I need to look at the dump file decide what's next this one.

Speaking of tickets, perhaps we should open a siege testing ticket on
Paul's mantis? This will avoid cluttering the list file attachments that
are going to bounce when sent to the other gmail users. It's possible
Paul's zoho provider bounced your message. If they reject zip files, they
may reject .7z files too.

Paul Smedley

unread,
Jul 22, 2022, 9:20:24 PM7/22/22
to apa...@googlegroups.com
Hey guys,

On 23/7/22 09:51, Steven Levine wrote:
> Speaking of tickets, perhaps we should open a siege testing ticket on
> Paul's mantis? This will avoid cluttering the list file attachments that
> are going to bounce when sent to the other gmail users. It's possible
> Paul's zoho provider bounced your message. If they reject zip files, they
> may reject .7z files too.

Happy either way - FYI I did get the 7z attachments.

Cheers,

Paul

David McKenna

unread,
Jul 23, 2022, 9:14:40 AM7/23/22
to Apache for OS/2
OK, I created the ticket (3300) at ArcaNoae and uploaded the dump file. 

Regards,

Steven Levine

unread,
Jul 23, 2022, 3:00:12 PM7/23/22
to apa...@googlegroups.com
In <14df4ce2-35d8-4ffc...@googlegroups.com>, on 07/23/22
at 06:14 AM, David McKenna <davidmc...@gmail.com> said:

Hi David,

>OK, I created the ticket (3300) at ArcaNoae and uploaded the dump file.

The dump file tells us that the trap is at the same location as the other
afinetk traps. The code path is slightly different so the patch does not
get to do it's thing.

I will tweak the patch to handle this.

David A surprised by me my jumping in immediately on the ticket. I
suggested he wait until I had a chance to get some answers from the dump
file.

Steven Levine

unread,
Jul 24, 2022, 9:23:16 PM7/24/22
to apa...@googlegroups.com
In <14df4ce2-35d8-4ffc...@googlegroups.com>, on 07/23/22
at 06:14 AM, David McKenna <davidmc...@gmail.com> said:

Hi David,

>OK, I created the ticket (3300) at ArcaNoae and uploaded the dump file.

A closer look at the dump file tells us it was supposed to trap. The code
path looks a bit different because the trap occurred before you installed
the patched afinetk.sys on the siege client.

I will mark the ticket as a duplicate of 3177.

David McKenna

unread,
Jul 24, 2022, 10:59:55 PM7/24/22
to Apache for OS/2
Hi Steven,

  OK, I just uploaded a new dump file from a trap while using the test afinetk.sys you gave me. This trap happened about 4 minutes into a siege using URL's in SMP mode. The server was unaffected.

Regards,

Steven Levine

unread,
Jul 25, 2022, 2:52:59 PM7/25/22
to apa...@googlegroups.com
In <12629d98-4734-4e55...@googlegroups.com>, on 07/24/22
at 07:59 PM, David McKenna <davidmc...@gmail.com> said:

Hi David,

> OK, I just uploaded a new dump file from a trap while using the test
>afinetk.sys you gave me. This trap happened about 4 minutes into a siege
>using URL's in SMP mode.

As expected, this trap occurred for a different reason. Initial analysis
implies an SMP related failure, but I need to spend some more time to
better understand what the runnings threads doing at the time of the trap.

Can you check if running the client with the patched afinetk.sys and
/MAXCPU=1 avoids the trap?

We should probably do the same check and determine if just exexmode -sp is
sufficient to avoid the trap.

To avoid uploading duplicate dump files, you should probably check if the
traps are different. Do you know how to do this?

Thanks,

Steven Levine

unread,
Jul 25, 2022, 3:54:30 PM7/25/22
to apa...@googlegroups.com
at 03:36 PM, David McKenna <davidmc...@gmail.com> said:

Hi David,

I've reviewed the data in your *run*.7z files.

The initial cause of each set of failures is reported the error log files
as:

[Fri Jul 22 15:44:16 2022] zend_mm_free_heap detected heap corrupted for
pid:120 (78) tid:1 chunk->heap 0x25200040 heap 0x21400040

The error reports and traps are simply side-effects of this initial
failure.

Each chunk is owned by a heap and the chunk header points the heap that
owns the chunk. This relationship is created when the chunk is allocated.
For some as yet unknown reason and at some as yet unknown location in the
code, the code attempts to free the chunk from the wrong heap. Both the
heap and the chunk pointers look valid so I don't think the problem is
that something clobbered the pointer in the chunk header.

Can we check if this error can still occur with /MAXCPU=1? My notes don't
say one way or another.

Thanks,

David McKenna

unread,
Jul 25, 2022, 5:13:47 PM7/25/22
to Apache for OS/2
Hi Steven,

  I have tested with the 'siege' computer running /MAXCPU=1 with URL's and have never got a trap yet. I have tested with 'siege' set to single processor mode via 'EXECMODE -sp' with URL's and SMP and got a trap once (probably the one from the earlier dump I'm guessing). I always get a trap on the siege computer when I run siege with URL's and full SMP.

  The error files are all from the Apache server computer - are you saying the 'heap corrupted' errors from Apache are causing the seige computer to trap?

Regards,

David McKenna

unread,
Jul 25, 2022, 5:16:53 PM7/25/22
to Apache for OS/2
BTW - since you gave me that test afinetk.sys, I have not got one single trap on the Apache server computer...

Regards,

Steven Levine

unread,
Jul 25, 2022, 5:51:39 PM7/25/22
to apa...@googlegroups.com
In <06652502-ef47-4b9d...@googlegroups.com>, on 07/25/22
at 02:16 PM, David McKenna <davidmc...@gmail.com> said:

Hi,

>BTW - since you gave me that test afinetk.sys, I have not got one single
>trap on the Apache server computer...

This is good. I assume you mean system traps in afinetk. Unfortunately
trap is a generic term so context helps. We have system traps and process
traps. Process traps are what get us popuplog entries and exceptq
reports.

David McKenna

unread,
Jul 25, 2022, 6:00:09 PM7/25/22
to Apache for OS/2
Hi Steven,

  Yes, system traps. I have gotten process traps (as seen in the siege test run 7zips I uploaded earlier). Thanks for setting me straight on that...

Regards,

Steven Levine

unread,
Jul 25, 2022, 6:19:49 PM7/25/22
to apa...@googlegroups.com
In <ea1a754b-c051-4a7b...@googlegroups.com>, on 07/25/22
at 02:13 PM, David McKenna <davidmc...@gmail.com> said:

Hi David,

> I have tested with the 'siege' computer running /MAXCPU=1 with URL's
>and have never got a trap yet. I have tested with 'siege' set to single
>processor mode via 'EXECMODE -sp' with URL's and SMP and got a trap once
>(probably the one from the earlier dump I'm guessing).

Makes sense. The older system dump was definitely the result of the now
patch afinetk.sys issue. The current system dump is something else.

>I always get a
>trap on the siege computer when I run siege with URL's and full SMP.

I assume you mean when running with the patched afinetk.sys too. If so,
this is expected because the reason for the system trap does not yet
appear to have anything to do with afinetk.sys.

> The error files are all from the Apache server computer - are you
>saying the 'heap corrupted' errors from Apache are causing the seige
>computer to trap?

No such luck. We have two different issues to resolve. One is on the
server where it gets into a state that it corrupts the heap. This is most
likely an out of memory issue that's not handled properly. The good news
is that it results in a process trap rather than a system trap. I am
thinking about how best to capture more data at the time the error is
detected. There's nothing you can do yet for this. I need to instrument
php7.dll to generate an exceptq report and/or a process dump rather than
the limited log entry we get now.

The other issue is on the client. Since the client does not trap when
running /MAXCPU=1, the implication is that the existing locking is not
sufficient when running in SMP mode. Just do be sure, please do a bit
more testing with execmode -sp. The results will help me determine where
I sould be looking.

FWIW, when I look at the dump there's some evidence that you are running
with one of those large hosts files that resolves known spammer to
0.0.0.0. If so, we might want test a small hosts file that omits these
entries.

Thanks,

Steven
Regards,

Steven Levine

unread,
Jul 25, 2022, 6:28:40 PM7/25/22
to apa...@googlegroups.com
In <e6e94034-461a-437c...@googlegroups.com>, on 07/25/22
at 03:00 PM, David McKenna <davidmc...@gmail.com> said:

Hi,

> Yes, system traps. I have gotten process traps (as seen in the siege
>test run 7zips I uploaded earlier). Thanks for setting me straight on
>that...

NP. It's a messy subject. If you're curious, grab a copy of the OS/2
Debugging Handbook from

http://www.edm2.com/index.php/The_OS/2_Debugging_Handbook

unless you already have one and browse the sections that discuss traps and
exceptions and such.

David McKenna

unread,
Jul 25, 2022, 7:30:42 PM7/25/22
to Apache for OS/2
Hi Steven,

  Yes, the siege computer traps with the test afinetk.sys when siege is run full SMP with URL's every time. I am indeed using an anti-ad HOSTS file. I'll move it away and test without to see if there is any effect. I'll also set siege to run in single processor mode and see if I get any more system traps that way... might take a while to be sure. And thanks for the link to the debugging guide!

Regards,

David McKenna

unread,
Jul 25, 2022, 9:03:51 PM7/25/22
to Apache for OS/2
 Well, an unexpected result...  I set my HOSTS file on the siege computer to contain only '127.0.0.1 localhost' and ran siege using URL's and full SMP for 15 minutes 3 times back-to-back. No system traps! I would only see a couple 'time out' alerts while it ran and all ended normally showing 99.85% success rate with ~25000 hits. I also noticed that the CPU meter on the task bar would only hit at most 25% on 1 CPU, but mostly stayed around 6% with occasionally a second CPU coming on and hitting maybe 5%.

 Put back in the old anti-ad HOSTS file (from https://www.github.com/StevenBlack/hosts), ran siege and within a couple of minutes the system trapped like before. All 4 CPU's were pegged the whole time. I would see several errors while running like 'temporary resolution error' and 'Failed to make an SSL connection' followed by 'SSL_write() failed (syscall)'. So it seems the HOSTS file is an accomplice in this situation...

 Apache didn't flinch during any of this - no process traps and the error log only showed a few 'resource temporarily unavailable' notices.  I have the dump from this system trap if you want to see it, but to me looks just like the other one already uploaded.

Regards,

Steven Levine

unread,
Jul 26, 2022, 2:36:54 AM7/26/22
to apa...@googlegroups.com
In <cf699463-25cd-4e9a...@googlegroups.com>, on 07/25/22
at 06:03 PM, David McKenna <davidmc...@gmail.com> said:

Hi,

> Well, an unexpected result...

Not totally. I've seen a couple of instances where unexpectedly large
internal tables cause problems in stack. I don't recall this one
specifically, but I have seen dns queries with exceptionally large answer
lists take out the stack.

>showing 99.85% success rate with ~25000 hits. I also noticed that the
>CPU meter on the task bar would only hit at most 25% on 1 CPU, but
>mostly stayed around 6% with occasionally a second CPU coming on and
>hitting maybe 5%.

It's interesting that siege does not appear to be able to keep all four
CPUs active.

>'SSL_write() failed (syscall)'. So it seems the HOSTS file is an
>accomplice in this situation...

I'm sure the size of the host file was responsible for a lot of this.
With the large HOSTS file, each thread is going to spend a lot of CPU
cycles a reparsing the HOST file. IIRC, the code that does this is naive
and does little or no caching.

>I
>have the dump from this system trap if you want to see it, but to me
>looks just like the other one already uploaded.

When you say the same, do you mean that the trap screen looks the same
when viewed with pmdf? How are you comparing the dump files?

What I see with pmdf is:

# TRAP SCREEN INFORMATION
OS/2 Kernel Revision 14.203_SMP
Exception in module: SOFFICE
TRAP 000e ERRCD=0000 ERACC=**** ERLIM=******** CPU=01
EAX=f3654114 EBX=f9c46996 ECX=f3654114 EDX=00000000 ESI=00000000
EDI=f3654114 EBP=000052b8 FLG=00010286
CS:EIP=0168:00000000 CSACC=c09b CSLIM=ffffffff
SS:ESP=1530:00005244 SSACC=1097 SSLIM=0000449f
DS=0160 DSACC=c093 DSLIM=ffffffff CR0=8001003b
ES=0160 ESACC=c093 ESLIM=ffffffff CR2=00000000
FS=0000 FSACC=**** FSLIM=********
GS=0000 GSACC=**** GSLIM=********

The lack of EIP that points anywhere useful in not helpful. :-)

Scanning the stack for interesting symbols, we find:

# %findsym 1530:00005244 l500 (ss:sp)
Finding from %fef6cda4 to %fef6d2a4 by dword
1530:52c4 fef6ce24: ffec2ea8 = _TICKSpinLock
1530:534c fef6ceac: fff0f667 = _FormatRegisters
1530:5360 fef6cec0: fff0f667 = _FormatRegisters
1530:5368 fef6cec8: fff14d67 = MPHaltFF + 86
1530:5370 fef6ced0: fff14452 = _IPIRouter + 66
1530:5398 fef6cef8: ffebaed4 = _PCBFirst
1530:53a4 fef6cf04: 001209ab = _apiop + 7df
1530:53b0 fef6cf10: fff11543 = spin_4_lock + 6
1530:53e4 fef6cf44: fff08af2 = EndIntHookOuter + 171
1530:53f8 fef6cf58: fff50053 = _VMAllocKHB + 5bf
1530:53fc fef6cf5c: 17f4036c = aSharememService + b8
1530:5424 fef6cf84: 1e130719 = strpbrk + 21
Invalid linear address: %fef6d000
Scan stopped at fef6d000 stopaddr fef6d2a4

The IPIRouter reference typically means that the trap occurred on another
CPU was was sent to CPU 1 for reporting. It not entirely clear which
thread on which CPU trapped. Currently I suspect it's CPU 3 which is
trying to read the hosts file at the time of the trap.

Please upload the dump file. Even if it's the same trap, with luck the
assocatied data will differ enough to provide some additional clues.

David McKenna

unread,
Jul 26, 2022, 6:14:31 AM7/26/22
to Apache for OS/2
Steven,

  OK... trap file uploaded....

Regards,

Steven Levine

unread,
Jul 26, 2022, 1:42:29 PM7/26/22
to apa...@googlegroups.com
In <df9e936a-4990-40cc...@googlegroups.com>, on 07/25/22
at 04:30 PM, David McKenna <davidmc...@gmail.com> said:


Hi,

>I'll move it away and test without to see if there is any effect.

If you want to experiment, you might want to try different subsets of the
anti-ads hosts file. Perhaps there is a size that avoids the traps. The
code that reads the host files is 32-bit code, but there are still parts
of the stack that have 16-bit dependencies.

FWIW, the folks that implemented the 32-bit tcpip32.dll goofed and linked
the DLL single thread. This resulted in the expected issues when SMP
machines became more widely used. IBM's fix was to wrap some of the known
failing calls in a mutex. As to why the did not link the DLL multi-thread
we will probably never know. As often happens with band-aid fixes, IBM
did not wrap all the calls that needed wrapping. The DLL you are using
was patched many years ago to deal with this. However, it's always
possible that more patching is needed and that your traps are related.
The dump files will tell us more over time.

Steven Levine

unread,
Jul 26, 2022, 2:43:09 PM7/26/22
to apa...@googlegroups.com
In <9f01d198-d807-4ff8...@googlegroups.com>, on 07/26/22
at 03:14 AM, David McKenna <davidmc...@gmail.com> said:

Hi David,

> OK... trap file uploaded....

Thanks. We got lucky. The traps is at the same location

TRAP SCREEN INFORMATION
OS/2 Kernel Revision 14.203_SMP
Exception in module: SOFFICE
TRAP 000e ERRCD=0000 ERACC=**** ERLIM=******** CPU=01
EAX=f36f3914 EBX=f9c46996 ECX=f36f3914 EDX=00000000 ESI=00000000
EDI=f36f3914 EBP=000052b8 FLG=00010286
CS:EIP=0168:00000000 CSACC=c09b CSLIM=ffffffff
SS:ESP=1530:00005244 SSACC=1097 SSLIM=0000449f
DS=0160 DSACC=c093 DSLIM=ffffffff CR0=8001003b
ES=0160 ESACC=c093 ESLIM=ffffffff CR2=00000000
FS=0000 FSACC=**** FSLIM=********
GS=0000 GSACC=**** GSLIM=********

No Symbols Found

but it triggered on CPU 1, so dumping the ring0 stack we get

Location Address Symbol Description
-------- ------- ------ -----------
%f8a9ad2c 0108:8020 PDD Data Segment for ___HLP$
%f8a9ad32 1000:36bc _tkPanicMsgLen + 104
%f8a9ad38 %fff0f667 _FormatRegisters
%f8a9ad4c %fff0f667 _FormatRegisters
%f8a9ad52 0100:ea60 PDD Code Segment for ___HLP$
%f8a9ad58 %fff0e3f7 KernelFaultEntry + 183
%f8a9ada0 %00010285 SIEGE _array_pop + 1
%f8a9adf2 0400:0000 _PSDStack
%f8a9ae1c %f6105bf0 SOCKETS recvit + 19a
%f8a9ae3e %0001e564 SIEGE _load_conf + e0b
%f8a9ae48 %f6103440 SOCKETS soo_winsock_select + a49
%f8a9ae4e %000103ca SIEGE _array_print + 156
%f8a9aea0 %f6105950 SOCKETS recvmsg + 87
%f8a9aece %000103ca SIEGE _array_print + 156
%f8a9aee8 %f610ff40 SOCKETS GenIOCtrl32 + dc
%f8a9af00 %fff115cd _MPR0SubSysEnter + d
%f8a9af08 %f610fa27 SOCKETS KEECallgateEntry + 42
%f8a9af22 0b20:fe38 PDD Code Segment for UNICODE$
Stack Frame Address f8a9af44 reached

There is no IPI, which simplifies analysis.

Looking at the last call on the stack, we find

# %lnc recvit + 19a
Looking for call instruction related to recvit + 19a
SOCKETS soreceive:
%f610c951 55 push ebp
ln %f610c951
%f610c951 SOCKETS soreceive

This tells me I need to figure out where and how soreceive got lost.

BTW, I thought I had explained the SOFFICE reference, but I don't see
anything in mail archives. The trap address is reported to the kernel as
168:00000000 because something "jumped to zero." When the kernel attempts
to convert this to a module name it determines that 168 is a FLAT
selector, so it searches the 32-bit objects of the loaded modules for one
that includes offset 00000000 and since every .exe that contains 32-bit
code includes an object that contains offset 00000000 it reports the first
one it finds. The output is not technically correct, but no one noticed
it when Scott added the code. Other than looking odd, the output is not a
problem in practice.

The .p# command reports the correct module:

# .p#
Slot Pid Ppid Csid Ord Sta Pri pTSD pPTDA pTCB Disp SG
Name *009a# 0059 0054 0059 0014 run 0300 f8a9a000 fe3c83f8 fe380a80 0d88
14 SIEGE

and the r command p field shows us the CPU the thread was running on

# r
eax=00000050 ebx=00004f7b ecx=00000000 edx=0520ec64 esi=0520ed64
edi=0520ec90 eip=1e130027 esp=0520ec50 ebp=0520ec78 iopl=0 -- -- -- nv up
ei ng nz na po cy cs=005b ss=0053 ds=0053 es=0053 fs=150b gs=0000
cr2=00000000 cr3=00211000 p=00 005b:1e130027 c3 retd

The p values are off by one, so this is really CPU 1. This is a pmdf nit.

David McKenna

unread,
Jul 29, 2022, 3:46:55 PM7/29/22
to Apache for OS/2
 Had a chance to do some more testing last night. Here is what I did:

1. Changed all the 0.0.0.0 addresses in the HOSTS file (which is about 3.9MB in size) to 127.0.0.1, because that is what it used to always be, but was changed to 0.0.0.0 for some reason. Didn't help though, got a trap (in SOFFICE) in less than a minute that looks a lot like the last one I uploaded. This one says 'Call gate is not in a call' in 'SYSTEM SYNOPSIS'. Don't know if this matters, but I noticed that the call gate statement was different in each trap I got.

2. Set back to 0.0.0.0, then cut back the HOSTS file to just under 2MB in size. Ran a 15 minute siege, but it trapped in about 5 minutes. This trap is different in that it trapped in SIEGE instead of SOFFICE. This one is also different because in SYSTEM SYNOPSIS it has this towards the end:

Call Gate:
 is in a call to:
%1ffc08f8 DOSCALL1 DOS32ISETFILEPTR  

with a list of callers.
 
3. Trimmed HOSTS file to just under 1MB in size. Ran a 15 minute siege that trapped after about 10 minutes. Also trapped in SIEGE, and SYSTEM SYNOPSIS says the call gate: Code has been swapped out.

4. Trimmed HOSTS again to just under 512MB, then ran a 15 minute siege - it completed with a 99.85% success rate. Maybe I should increase the run time and try again....

 Anyway, I have all of the dumps for the trials that trapped if you are interested. Let me know if there are any other tests I can try...

Regards,

David McKenna

unread,
Jul 29, 2022, 3:49:14 PM7/29/22
to Apache for OS/2
 Forgot to mention that the CPU meter decreased in activity as the HOSTS file got smaller. At 512MB it even would go down to only 2 CPU's active occasionally...

Regards,

David McKenna

unread,
Jul 29, 2022, 3:51:10 PM7/29/22
to Apache for OS/2
 512KB... sorry...

Steven Levine

unread,
Jul 30, 2022, 1:51:08 AM7/30/22
to apa...@googlegroups.com
In <c49d41ac-0649-4dab...@googlegroups.com>, on 07/29/22
at 12:46 PM, David McKenna <davidmc...@gmail.com> said:

Hi David,

Thanks for testing. Personally, I find this kind of testing annoying, but
neccessary.

>1. Changed all the 0.0.0.0 addresses in the HOSTS file (which is about
>3.9MB in size) to 127.0.0.1, because that is what it used to always be,
>but was changed to 0.0.0.0 for some reason. Didn't help though, got a
>trap (in SOFFICE) in less than a minute that looks a lot like the last
>one I uploaded.

This is pretty much what I would have expected. As you found there is
some undocumented size limit.

>This one says 'Call gate is not in a call' in 'SYSTEM
>SYNOPSIS'. Don't know if this matters, but I noticed that the call gate
>statement was different in each trap I got.

What you see is going to depend on what CPU traps and whether or the the
trapping CPU is CPU 1.

>Call Gate:
> is in a call to:
>%1ffc08f8 DOSCALL1 DOS32ISETFILEPTR

I have something similar in one of the traps. I've not done the drill to
convert the file handle to a file name, but it's almost surely the host
file.

>3. Trimmed HOSTS file to just under 1MB in size. Ran a 15 minute siege
>that trapped after about 10 minutes. Also trapped in SIEGE, and SYSTEM
>SYNOPSIS says the call gate: Code has been swapped out.

FWIW, pmdf is not telling the whole truth here. The message is a holdover
from the days when swapping actually occurred. The code is actually in
memory, but the page table mappings have not marked the page present
because they were not recently accessed. When running under the kernel
debugger, there is the .i command which can force the needed page table
updates to make the code accessible. Pmdf cannot do this.

> Anyway, I have all of the dumps for the trials that trapped if you are
>interested. Let me know if there are any other tests I can try...

Please hang on to them for a while. I don't need them yet.

Steven Levine

unread,
Jul 30, 2022, 2:09:24 AM7/30/22
to apa...@googlegroups.com
In <f4cb2caf-84c1-4e3b...@googlegroups.com>, on 07/29/22
at 12:49 PM, David McKenna <davidmc...@gmail.com> said:

Hi,

> Forgot to mention that the CPU meter decreased in activity as the HOSTS
>file got smaller. At 512MB it even would go down to only 2 CPU's active
>occasionally...

Reading and parsing a 512KB file repeatedly is going to keep even a fast
CPU busy.

I don't recall exactly what triggers the HOST file to be read, but I do
recall thinking that it happens more often that is needed. It may occur
occur for every domain to IP address conversion.

As to why this would result in the memory corruption we see is an open
question. Under the kernel debugger I can set a breakpoint at the address
that traps on your system. This allowed me to understand what the pointer
should point to and what it does. I still don't have any good ideas why
the big hosts file causes the pointer to be destroyed.

David McKenna

unread,
Jul 30, 2022, 12:24:20 PM7/30/22
to Apache for OS/2
 Did a little more testing this morning.Here is what I did:

1. Ran siege again using the 512kb HOSTS file for 20 minutes. It didn't trap.

2. Wondered if somehow, something about the contents of the HOSTS file might influence the result, so created a new 512kb HOSTS file using the end of the original one (instead of the beginning as before). It trapped in about 3 minutes.

3. Ran siege again with the 1st 512kb HOSTS file. It trapped in about 5 minutes.

4. Ran again with the 512kb HOSTS file from the end of the original one. It didn't trap in 20 minutes.

 So it seems there is not some magic cut-off value for the HOSTS file size, but the chance of a trap decreases as the HOSTS file gets smaller...

Regards,

Reply all
Reply to author
Forward
0 new messages