Solving M produces N consumers scalability problem

Anatol Pomozov

unread,

Nov 1, 2013, 4:50:02 PM11/1/13

to

Hi,

I am looking at fuse scalabily issue that I hit recently on my
multi-socket server machines
http://permalink.gmane.org/gmane.comp.file-systems.fuse.devel/13490

The short description is that we have several thousands threads that
make filesystem access to a fuse fs. Fuse spawns several threads
(usually 4-8) that read the requests from kernel process it and write
response back to the kernel.

The problem is that fuse kernel module uses a global list where it
keeps all active requests. But both consumers and producers are at
different CPU's and getting lock to access the list is very expensive
operation. I have a test case that shows ~35% of the system time is
spent in _raw_spin_lock when accessing to this global list. I want to
solve this scalability problem.

One idea is not to use the spin_lock. It is the 'fair spin_lock' that
has scalability problems
http://pdos.csail.mit.edu/papers/linux:lock.pdf Maybe lockless
datastructures can help here?

Another idea is avoid global datasctructures but I have a few
questions here. Let's say we want to use per-CPU lists. But the
problem is that producers/consumers are not distributed across all
CPUs. Some CPU might have too many producers, some other might not
have consumers at all. So we need some kind of migration from hot CPU
to the cold one. What is the best way to achieve it? Are there any
examples how to do this? Any other ideas?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

unread,

Nov 4, 2013, 2:00:03 PM11/4/13

to

Anatol Pomozov <anatol....@gmail.com> writes:
>
> One idea is not to use the spin_lock. It is the 'fair spin_lock' that
> has scalability problems
> http://pdos.csail.mit.edu/papers/linux:lock.pdf Maybe lockless
> datastructures can help here?

The standard spin lock is already improved.
But better locks just give you a small advantage, they don't
solve the real scaling problem.

>
> Another idea is avoid global datasctructures but I have a few
> questions here. Let's say we want to use per-CPU lists. But the
> problem is that producers/consumers are not distributed across all
> CPUs. Some CPU might have too many producers, some other might not
> have consumers at all. So we need some kind of migration from hot CPU
> to the cold one. What is the best way to achieve it? Are there any
> examples how to do this? Any other ideas?

per cpu is the standard approach, but usually overkill. Also
requires complex code to drain etc.

Some older patches also use per node, but that works very poorly
these days (nodes are far too big)

One way I like is to simply use a global (allocated) array of queues,
sized by total number of possible cpus (but significantly smaller) and
use the cpu number as a hash into the array.

-Andi

--
a...@linux.intel.com -- Speaking for myself only

Anatol Pomozov

unread,

Nov 4, 2013, 3:30:03 PM11/4/13

to

Hi

Thanks for your reply

On Mon, Nov 4, 2013 at 10:53 AM, Andi Kleen <an...@firstfloor.org> wrote:
> Anatol Pomozov <anatol....@gmail.com> writes:
>>
>> One idea is not to use the spin_lock. It is the 'fair spin_lock' that
>> has scalability problems
>> http://pdos.csail.mit.edu/papers/linux:lock.pdf Maybe lockless
>> datastructures can help here?
>
> The standard spin lock is already improved.
> But better locks just give you a small advantage, they don't
> solve the real scaling problem.

Do you know in what version the improvement happened? I use kernel 3.3
and I can backport the changes to my custom kernel to make the
situation better.

>> Another idea is avoid global datasctructures but I have a few
>> questions here. Let's say we want to use per-CPU lists. But the
>> problem is that producers/consumers are not distributed across all
>> CPUs. Some CPU might have too many producers, some other might not
>> have consumers at all. So we need some kind of migration from hot CPU
>> to the cold one. What is the best way to achieve it? Are there any
>> examples how to do this? Any other ideas?
>
> per cpu is the standard approach, but usually overkill. Also
> requires complex code to drain etc.
>
> Some older patches also use per node, but that works very poorly
> these days (nodes are far too big)
>
> One way I like is to simply use a global (allocated) array of queues,
> sized by total number of possible cpus (but significantly smaller) and
> use the cpu number as a hash into the array.

This solution pretty-much equivalent to per-CPU data structures. And
in this case there also will be "hot" nodes and "cold" nodes. So my
question remains - what is the best way to implement data migration
between nodes, is there a standard solution for this? examples?

Andi Kleen

unread,

Nov 4, 2013, 3:30:04 PM11/4/13

to

> This solution pretty-much equivalent to per-CPU data structures. And

No it's not, it doesn't require one queue per CPU.

A CPU these days isn't really a CPU anymore, but often a CPU thread,
which is much more light weight. So having a queue per CPU is often
total overkill, and amplifies any per queue costs (like invalidation)

-Andi