Hi there, it's me again ;)
The last two evenings I was thinking about the threading model of the cf_server. While I was following a rather naive model in earlier versions (managed queue with a pool of worker threads where one worker handles a complete session) I was thinking about a slightly modified way for the 4.0 release. But I'm not really sure if it is a good way. Please let me know what you think.
In the 4.0 release we also changed connection management a lot: we made the protocol stateless and RESTful so that we can handle keep alive and put the connection back to the queue when the request was served. This is a good basic prerequisite.
I was thinking what part of the servation requires most time for the worker thread. This is surely the I/O over the socket, especally if it is a ip socket (by the way: we also support IPv6 now ;). And there is not really a way to avoid it: the data has to be sent.
But what we can do is avoiding idle time in the worker threads. BSD and Linux both provide a fast and reliable API for non-blocking multiplexed I/O (kqueue and epoll). So what I was thinking about is creating a write and read buffer where the main loop thread non-blocking reads the header of an incoming request and then dispatches the request to a worker. The worker may read a request body, but this has to happen blocking I think. Then it creates a response, fills the write buffer with it and returns. The main loop thread now writes non-blocking until the buffer is empty and waits for data to read and dispatches again or closes the connection. Following this approach we may minimize the time a worker needs to handle a request but we also increase the memory footprint.
I hope you can understand what I mean. What do you think? Is this approach silly? Or may we gain some performance improvements?
Greetings,
CK
Sorry for the late answer. I have seen your mail but simply had no time
sit down and type a reply.
I noticed that you have already begun implementing these new workers, but
I'll answer your mail anyway. :)
> I hope you can understand what I mean. What do you think? Is this
> approach silly? Or may we gain some performance improvements?
To be honest, I haven't really tested the pre-4.0 approach for
performance. So I don't know if it actually works well under high load ...
Personally I like the idea of having a worker process the complete
request. Nice and simple. And I don't really see why switching to your
suggestion would result in a performance improvement. ATM the workers will
have to sleep if the socket is in use already. Making them write their
output to a central buffer instead allows the workers to avoid this idle
time but doesn't get the data sent more quickly in the end, does it. And
instead of having a couple of sleeping workers occupying memory we'll have
this buffer occupying memory. Or am I getting this all wrong [again ;)]?
--
Alex
sorry for the delay. We, my wife and me, moved to Switzerland the last weekend so we were right busy the last few weeks ;-)
Am 19.09.2010 um 22:54 schrieb Alexander Nitsch:
> I noticed that you have already begun implementing these new workers, but I'll answer your mail anyway. :)
Yeah, since noone answered I thought I just try it out. Changing back is no problem thanks to a VCS ;-)
>> I hope you can understand what I mean. What do you think? Is this approach silly? Or may we gain some performance improvements?
>
> To be honest, I haven't really tested the pre-4.0 approach for performance. So I don't know if it actually works well under high load ...
It works very bad. In fact it worked so bad that I had to add the shared memory segment feature.
> Personally I like the idea of having a worker process the complete request. Nice and simple.
I don't want to change that. I only want to avoid write() and read() calls to avoid useless IOWAIT threads. When we avoid IOWAIT we have more (CPU) time for serving more requests. This is basically the success story of asynchronous I/O (as used in e.g. lighttpd) and event based programming (as e.g. jused by node.js). But keep in mind, it would only be an advantage if we have a high load situation. In a situation where only a few requests have to be served there will be no advantage.
> And I don't really see why switching to your suggestion would result in a performance improvement. ATM the workers will have to sleep if the socket is in use already. Making them write their output to a central buffer instead allows the workers to avoid this idle time but doesn't get the data sent more quickly in the end, does it.
No, it doesn't. To improve the time which is needed to serve a request we can only do a few things: optimize the locking strategy, do as much as possible asynchronous (so, avoid locking and lock fine-grained) and optimize the algorithms for runtime performance. It would „only” free us some time to serve more requests. Nothing more, nothing less.
LG,
CK
--
http://ck.kennt-wayne.de/
> I hope you can understand what I mean. What do you think? Is this approach
> silly? Or may we gain some performance improvements?
Personally, I don't think that I/O has been the main problem for the
previous versions of the CForum. For the time a single worker is waiting
for I/O, there are other workers that actually do some work - if you
make the total number of worker threads high enough, that isn't really a
problem.
That being said, I do think that a stateless protocol will improve
performance just because of the simple fact that with a stateful
protocol the round-trip time plays an important role.
With respect to the other issues, I see locking as the main performance
killer. In the old code, lots of things are locked to ensure consistency
- which severely degrades performance if there are a lot of writes in
the system (which is the case in the forum). I think one can make a good
case that locking should be avoided as much as possible (which will have
to lead to a very careful design of the data structures to avoid
bottlenecks and race conditions) and I would also want to suggest to
have a look at atomic intrinsics of processors that allow for some
operations in a lock-free manner. For details, see:
http://gcc.gnu.org/wiki/Atomic
http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
see __sync_bool_compare_and_swap and __sync_val_compare_and_swap
I think that we can gain much more by reducing the amount of locking
than by somehow managing the I/O - since that is a thing that the kernel
already does for us. Of course, one can always consider also to
implement your idea additionally if I/O really pans out to be a major
bottleneck.
Regards,
Christian
Am 17.10.2010 um 17:24 schrieb Christian Seiler:
>> I hope you can understand what I mean. What do you think? Is this approach
>> silly? Or may we gain some performance improvements?
>
> Personally, I don't think that I/O has been the main problem for the
> previous versions of the CForum.
No, it wasn't. I never said that :-) All I was looking for were possibilities for performance improvements.
But since you both don't believe that we can improve performance with this implementation I'll revert the changes and stick back to the old read-write-workers.
> With respect to the other issues, I see locking as the main performance
> killer.
Indeed. As I stated already I was thinking of a locking policy: no thread should lock more than one or two, I don't really know, locks at once. This reduces the risk of dead locks and race conditions and reduces the absolute number of locks. Of course, this is not enough and your ideas to use atomic intrinsics are very good.
> (which will have
> to lead to a very careful design of the data structures to avoid
> bottlenecks and race conditions)
Yes, this would be very nice. But I'm not really sure if this is possible. The old forum already locked very granular (one lock for the index access, one lock for the access to threads). This was the way I tried to avoid bottlenecks. But the one big lock to the index seems a must-have for me if we want to hold all data in the memory. To avoid this we could, of course, go back and use files which we read at need. Then we don't really need a index in the memory to gain access to the data. But I don't really know… we have to try, I think.
Greets,
CK