Ulrich Windl wrote:
> On 3 Nov 2009 at 2:54, guy keren wrote:
>> Hi,
>> the logging code in open-iscsi uses a "logarea" structure in shared
>> memory protected by a SysV semaphore (using semop system calls) - and
>> also places the sembuf array structure used in the semop calls in this
>> same shared-memory area (i.e. inside the logarea struct that is
>> allocated in shared memory at function logarea_init).
>> as a result, both the iscsid logging process and the iscsid control
>> process attempt to use this structure in a non-synchronized manner,
> The fact that the logging and the control process use the shared memory in a
> unsynchronized way seems unrelated to the fact that both structures are located in
> the same memory area, or I didn't understand your statement. For performance
> reasons it seems wise to locate the controlling semaphores close to the area being
> controlled.
>> which is racy and may result either a deadlock or data corruption (we
>> saw these deadlocks several times).
>> the relevant code of the logging process is in usr/log.c, function
>> log_flush(). the relevant code of the control process is in the same
>> file, function dolog().
>> the deadlock senario:
>> 1. the logging process has the semaphore held. the control process is
>> doing some work.
>> 2. the logging process is about to release the semaphore. it sets
>> the sem_op parameter in the sembuf structure to '1'.
>> 3. the control process now wants to add a logging record. it sets
>> the sem_op parameter in the sembuf structure to '-1'.
> Ah, I understand: not the semaphore structure is in shared memory, but the
> parameter structure for calling the semop(). OK, that's bad. Probably those
> structures should be local (on the stack). I was confused with POSIX semaphores
> where shared memory is required.
>> 4. the control process invokes semop and gets blocked (because the
>> semaphore is held by the logging process).
>> 5. the logging process invokes semop and also gets blocked for the
>> same reason.
>> we're in deadlock.
> Good spotting!
> Ulrich
>> to get a data corruption, we'll need a slightly different scheduling -
>> i.e. that the process that wants to take the lock will update the sembuf
>> struct first - and then the process releasing the lock would modify
>> sem_op to '1' - and we'll have both processes increasing the semaphore's
>> value instead of one increasing and one decreasing it - and thus the
>> semaphore's value will later allow both processes to grab the semaphore
>> at the same time.
>> solution:
>> to solve this, i have created a local variable on the stack of both
>> processes, that is used with the semop calls. another possible solution
>> is to move the sembuf structure to a global variable that is NOT placed
>> in shared memory.
>> before i send a patch - is there a preference either way? or some other way?
>> thanks,
>> --guy
(finally) attached is the patch: