Bug with auto-detecting the number of workers

Edward Dore

unread,

Apr 17, 2015, 10:13:27 AM4/17/15

to openlitespee...@googlegroups.com

I've installed OpenLiteSpeed 1.3.10 on a server with 16 cores/32 threads using the yum repository and have run into some rather strange behaviour out of the box as soon as it is launched (using the default configuration provided by the RPM).

Basically the parent openlitespeed process which is running as root sits permanently at 100% CPU usage from as soon as it is launched.

The admin web interface works, but doing anything which interacts with the service like a graceful restart or toggling the debug log just sits loading the page for ages before eventually timing out.

The /usr/local/lsws/logs/error.log log file will then have 5 of the following entries:

2015-04-17 13:48:28.072 [NOTICE] [AdminPHP] Send SIGTERM to process [3889].

Followed by the below entry printed once a second indefinitely:

2015-04-17 13:48:33.091 [NOTICE] [AdminPHP] Send SIGKILL to process [3889] that won't stop.

The service is also completely unresponsive to the stop or restart commands via /usr/local/lsws/bin/lswsctrl and the only way of getting it to end is with kill -9 on both the parent process and worker 01.

I've eventually tracked this down to some very odd behaviour with the configuration for the number of workers. The value for this option is blank by default, which I guess means that it is supposed to be auto-detected?

The help for this options says that it should be set to an integer number between 1 and 16, but I guess that the auto-detection based on the number of processor cores/threads is failing for some reason and setting it to a larger number.

As far as I can tell, with the number of workers option left blank it has always spawned 20 workers with the processes numbered between #01 and #32. The exact numbers are different each time, but the first one is always #01 and the last one is always #32.

An example from "ps auxf" (I'm not sure how well this will paste!):

root 3974 99.3 0.0 38140 3292 ? R 14:19 1:16 openlitespeed (lshttpd - main)

root 3976 0.0 0.0 35292 944 ? S 14:19 0:00 \_ openlitespeed (lscgid)

nobody 3978 0.0 0.0 37548 3260 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #01)

nobody 3979 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #02)

nobody 3980 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #03)

nobody 3981 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #06)

nobody 3982 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #08)

nobody 3983 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #09)

nobody 3984 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #12)

nobody 3985 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #14)

nobody 3986 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #16)

nobody 3987 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #19)

nobody 3988 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #20)

nobody 3989 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #21)

nobody 3990 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #23)

nobody 3991 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #25)

nobody 3992 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #26)

nobody 3993 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #27)

nobody 3994 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #28)

nobody 3995 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #29)

nobody 3996 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #30)

nobody 3997 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #31)

nobody 3998 0.0 0.0 37468 1720 ? S 14:19 0:00 \_ openlitespeed (lshttpd - #32)

As soon as I set the value for the number of workers option down to 16, all of the strange behaviour goes away and everything works as normal.

I've tried this on the latest version of both CentOS 6 and 7 (both clean installs), so I don't think that it's anything kernel specific around detecting the number of processors. This server has a pair of Intel Xeon E5-2667 v3 processors.

Regards,

Edward

Kevin Fwu

unread,

Apr 24, 2015, 10:47:24 AM4/24/15

to openlitespee...@googlegroups.com

Hi Edward,

Just to update you, I was able to reproduce this error and will be working on fixing it.

I have no ETA, but I will be sure to update you when the problem is resolved.

Kevin

Kevin Fwu

unread,

Apr 24, 2015, 11:30:40 AM4/24/15

to openlitespee...@googlegroups.com

Hi Edward,

The patch at the bottom should fix the default number of workers issue.

The strange behavior is still a bug, which will require more time, but if you wish to use the default number of workers, the patch should fix it.

Kevin

Patch:

diff --git src/main/httpserver.cpp src/main/httpserver.cpp

index 4e8ed94..b9dc89b 100644

--- src/main/httpserver.cpp

+++ src/main/httpserver.cpp

@@ -2407,9 +2407,11 @@ int HttpServerImpl::configServerBasics(int reconfig, const XmlNode *pRoot)

procConf.setPriority(ConfigCtx::getCurConfigCtx()->getLongValue(pRoot,

"priority", -20, 20, 0));

+ int iNumProc = PCUtil::getNumProcessors();

+ iNumProc = (iNumProc > 8 ? 8 : iNumProc);

HttpServerConfig::getInstance().setChildren(

ConfigCtx::getCurConfigCtx()->getLongValue(pRoot,

- "httpdWorkers", 1, 16, PCUtil::getNumProcessors()));

+ "httpdWorkers", 1, 16, iNumProc));

const char *pGDBPath = pRoot->getChildValue("gdbPath");

Kevin Fwu

unread,

Apr 24, 2015, 12:19:28 PM4/24/15

to openlitespee...@googlegroups.com

Hi Edward,

The following patch should fix the 100% CPU problem if you want to still be able to use 32 processes.

Let me know if either/both of these fix the issues!

Kevin

Patch:

diff --git src/main/lshttpdmain.cpp src/main/lshttpdmain.cpp

index 86ef1af..61640c1 100644

--- src/main/lshttpdmain.cpp

+++ src/main/lshttpdmain.cpp

@@ -1318,8 +1318,11 @@ int LshttpdMain::guardCrash()

struct pollfd pfds[2];

HttpSignals::init(sigchild);

if (iNumChildren >= 32)

+ {

m_pProcState = (int *)malloc(((iNumChildren >> 5) + 1) * sizeof(

int));

+ memset(m_pProcState, 0, ((iNumChildren >> 5) + 1) * sizeof(int));

+ }

startAdminSocket();

pfds[0].fd = m_fdAdmin;

pfds[0].events = POLLIN;

Edward Dore

unread,

Apr 24, 2015, 5:11:12 PM4/24/15

to Kevin Fwu, openlitespee...@googlegroups.com

Hi Kevin,

Thanks for the patch. I'm afraid I can't test it right now as we're using the RPM version of OpenLiteSpeed from the yum repository and I don't really want to try and build it from source on what is now a production server.

We're fine running 16 processes - I'm sure it's far more than we will ever need anyway :)

I'll hopefully have another server with a large number of processor cores available for testing next week, so I'll try and give the patch a go then and see if it fixes the problem.

Edward

--
You received this message because you are subscribed to a topic in the Google Groups "OpenLiteSpeed Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openlitespeed-development/Ka9t5ZUBEzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openlitespeed-deve...@googlegroups.com.
To post to this group, send email to openlitespee...@googlegroups.com.
Visit this group at http://groups.google.com/group/openlitespeed-development.
For more options, visit https://groups.google.com/d/optout.

Edward Dore

unread,

Apr 27, 2015, 11:27:51 AM4/27/15

to Kevin Fwu, openlitespee...@googlegroups.com

Hi Kevin,

I've tried this on a new server with 48 cores today. The stock OpenLiteSpeed 1.3.10 and 1.4.7 source tarballs both exhibit the exact same behaviour, but once I applied your patch to OpenLiteSpeed 1.4.7, it worked as expected - the parent processes was using 0% CPU and 48 workers were spawned.

Edward

Kevin Fwu

unread,

Apr 27, 2015, 11:32:10 AM4/27/15

to openlitespee...@googlegroups.com, kf...@litespeedtech.com

Hi Edward,

Perfect, thanks for the update!

Also, thanks for letting us know about the bug, always appreciated.

Kevin

Reply all

Reply to author

Forward