Bug with auto-detecting the number of workers

126 views
Skip to first unread message

Edward Dore

unread,
Apr 17, 2015, 10:13:27 AM4/17/15
to openlitespee...@googlegroups.com
I've installed OpenLiteSpeed 1.3.10 on a server with 16 cores/32 threads using the yum repository and have run into some rather strange behaviour out of the box as soon as it is launched (using the default configuration provided by the RPM).

Basically the parent openlitespeed process which is running as root sits permanently at 100% CPU usage from as soon as it is launched.

The admin web interface works, but doing anything which interacts with the service like a graceful restart or toggling the debug log just sits loading the page for ages before eventually timing out.

The /usr/local/lsws/logs/error.log log file will then have 5 of the following entries:

2015-04-17 13:48:28.072 [NOTICE] [AdminPHP] Send SIGTERM to process [3889].

Followed by the below entry printed once a second indefinitely:

2015-04-17 13:48:33.091 [NOTICE] [AdminPHP] Send SIGKILL to process [3889] that won't stop.

The service is also completely unresponsive to the stop or restart commands via /usr/local/lsws/bin/lswsctrl and the only way of getting it to end is with kill -9 on both the parent process and worker 01.

I've eventually tracked this down to some very odd behaviour with the configuration for the number of workers. The value for this option is blank by default, which I guess means that it is supposed to be auto-detected?

The help for this options says that it should be set to an integer number between 1 and 16, but I guess that the auto-detection based on the number of processor cores/threads is failing for some reason and setting it to a larger number.

As far as I can tell, with the number of workers option left blank it has always spawned 20 workers with the processes numbered between #01 and #32. The exact numbers are different each time, but the first one is always #01 and the last one is always #32.

An example from "ps auxf" (I'm not sure how well this will paste!):

root      3974 99.3  0.0  38140  3292 ?        R    14:19   1:16 openlitespeed (lshttpd - main)
root      3976  0.0  0.0  35292   944 ?        S    14:19   0:00  \_ openlitespeed (lscgid)
nobody    3978  0.0  0.0  37548  3260 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #01)
nobody    3979  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #02)
nobody    3980  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #03)
nobody    3981  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #06)
nobody    3982  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #08)
nobody    3983  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #09)
nobody    3984  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #12)
nobody    3985  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #14)
nobody    3986  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #16)
nobody    3987  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #19)
nobody    3988  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #20)
nobody    3989  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #21)
nobody    3990  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #23)
nobody    3991  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #25)
nobody    3992  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #26)
nobody    3993  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #27)
nobody    3994  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #28)
nobody    3995  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #29)
nobody    3996  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #30)
nobody    3997  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #31)
nobody    3998  0.0  0.0  37468  1720 ?        S    14:19   0:00  \_ openlitespeed (lshttpd - #32)

As soon as I set the value for the number of workers option down to 16, all of the strange behaviour goes away and everything works as normal.

I've tried this on the latest version of both CentOS 6 and 7 (both clean installs), so I don't think that it's anything kernel specific around detecting the number of processors. This server has a pair of Intel Xeon E5-2667 v3 processors.

Regards,
Edward

Kevin Fwu

unread,
Apr 24, 2015, 10:47:24 AM4/24/15
to openlitespee...@googlegroups.com
Hi Edward,

Just to update you, I was able to reproduce this error and will be working on fixing it.  

I have no ETA, but I will be sure to update you when the problem is resolved.

Kevin

Kevin Fwu

unread,
Apr 24, 2015, 11:30:40 AM4/24/15
to openlitespee...@googlegroups.com
Hi Edward,

The patch at the bottom should fix the default number of workers issue.

The strange behavior is still a bug, which will require more time, but if you wish to use the default number of workers, the patch should fix it.

Kevin


Patch:

diff --git src/main/httpserver.cpp src/main/httpserver.cpp
index 4e8ed94..b9dc89b 100644
--- src/main/httpserver.cpp
+++ src/main/httpserver.cpp
@@ -2407,9 +2407,11 @@ int HttpServerImpl::configServerBasics(int reconfig, const XmlNode *pRoot)
         procConf.setPriority(ConfigCtx::getCurConfigCtx()->getLongValue(pRoot,
                                      "priority", -20, 20, 0));
 
+        int iNumProc = PCUtil::getNumProcessors();
+        iNumProc = (iNumProc > 8 ? 8 : iNumProc);
         HttpServerConfig::getInstance().setChildren(
                 ConfigCtx::getCurConfigCtx()->getLongValue(pRoot,
-                        "httpdWorkers", 1, 16, PCUtil::getNumProcessors()));
+                        "httpdWorkers", 1, 16, iNumProc));
 
         const char *pGDBPath = pRoot->getChildValue("gdbPath");
 

Kevin Fwu

unread,
Apr 24, 2015, 12:19:28 PM4/24/15
to openlitespee...@googlegroups.com
Hi Edward,

The following patch should fix the 100% CPU problem if you want to still be able to use 32 processes.

Let me know if either/both of these fix the issues!
Kevin

Patch: 

diff --git src/main/lshttpdmain.cpp src/main/lshttpdmain.cpp
index 86ef1af..61640c1 100644
--- src/main/lshttpdmain.cpp
+++ src/main/lshttpdmain.cpp
@@ -1318,8 +1318,11 @@ int LshttpdMain::guardCrash()
     struct pollfd   pfds[2];
     HttpSignals::init(sigchild);
     if (iNumChildren >= 32)
+    {
         m_pProcState = (int *)malloc(((iNumChildren >> 5) + 1) * sizeof(
                                          int));
+        memset(m_pProcState, 0, ((iNumChildren >> 5) + 1) * sizeof(int));
+    }
     startAdminSocket();
     pfds[0].fd = m_fdAdmin;
     pfds[0].events = POLLIN;

Edward Dore

unread,
Apr 24, 2015, 5:11:12 PM4/24/15
to Kevin Fwu, openlitespee...@googlegroups.com
Hi Kevin,

Thanks for the patch. I'm afraid I can't test it right now as we're using the RPM version of OpenLiteSpeed from the yum repository and I don't really want to try and build it from source on what is now a production server.

We're fine running 16 processes - I'm sure it's far more than we will ever need anyway :)

I'll hopefully have another server with a large number of processor cores available for testing next week, so I'll try and give the patch a go then and see if it fixes the problem.

Edward

--
You received this message because you are subscribed to a topic in the Google Groups "OpenLiteSpeed Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openlitespeed-development/Ka9t5ZUBEzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openlitespeed-deve...@googlegroups.com.
To post to this group, send email to openlitespee...@googlegroups.com.
Visit this group at http://groups.google.com/group/openlitespeed-development.
For more options, visit https://groups.google.com/d/optout.

Edward Dore

unread,
Apr 27, 2015, 11:27:51 AM4/27/15
to Kevin Fwu, openlitespee...@googlegroups.com
Hi Kevin,

I've tried this on a new server with 48 cores today. The stock OpenLiteSpeed 1.3.10 and 1.4.7 source tarballs both exhibit the exact same behaviour, but once I applied your patch to OpenLiteSpeed 1.4.7, it worked as expected - the parent processes was using 0% CPU and 48 workers were spawned.

Edward

Kevin Fwu

unread,
Apr 27, 2015, 11:32:10 AM4/27/15
to openlitespee...@googlegroups.com, kf...@litespeedtech.com
Hi Edward,

Perfect, thanks for the update!

Also, thanks for letting us know about the bug, always appreciated.

Kevin
Reply all
Reply to author
Forward
0 new messages