Running scylla with a limited cpuset

551 views
Skip to first unread message

sid@arista.com

<sid@arista.com>
unread,
Jan 6, 2017, 1:31:45 PM1/6/17
to ScyllaDB users, Krishnanand Thommandra
Hi,
We are trying to setup scylla with a limited cpuset in our 6 node configuration.

Here is what the default cpuset looks like with all 48 cpus in the node:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47


If I want to reduce the cpuset to half by picking one from each core. These are the steps in am following:


1. Stop scylla server

2. update cpuset.conf

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23"

3. run iotune with the new cpuset to update io.conf

iotune --evaluation-directory /root/data_io --format envfile --options-file /etc/scylla.d/io.conf --cpuset "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23"

4. restart scylla server.


Are these steps enough ?


Thanks, Sid

Dor Laor

<dor@scylladb.com>
unread,
Jan 6, 2017, 1:45:00 PM1/6/17
to ScyllaDB users, Krishnanand Thommandra
In general yes. The only thing I'm not sure about is whether the cpu you manually selected are
aligned with one hyper thread per core. It could be that you just reduced the core count. 
You'll need to check the hardware topology in your machine. I don't know how to do it,
I bet a simple google search will do or someone with the knowledge will jump on the thread soon.
 

Thanks, Sid

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/2fc8624d-351a-46b7-962e-6417ee5b198a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sid@arista.com

<sid@arista.com>
unread,
Jan 6, 2017, 1:54:03 PM1/6/17
to ScyllaDB users, kthommandra@arista.com
Inline.


On Friday, January 6, 2017 at 10:45:00 AM UTC-8, Dor Laor wrote:
On Fri, Jan 6, 2017 at 10:31 AM, sid via ScyllaDB users <scyllad...@googlegroups.com> wrote:
Hi,
We are trying to setup scylla with a limited cpuset in our 6 node configuration.

Here is what the default cpuset looks like with all 48 cpus in the node:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47


If I want to reduce the cpuset to half by picking one from each core. These are the steps in am following:


1. Stop scylla server

2. update cpuset.conf

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23"

3. run iotune with the new cpuset to update io.conf

iotune --evaluation-directory /root/data_io --format envfile --options-file /etc/scylla.d/io.conf --cpuset "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23"

4. restart scylla server.


Are these steps enough ?



In general yes. The only thing I'm not sure about is whether the cpu you manually selected are
aligned with one hyper thread per core. It could be that you just reduced the core count. 
You'll need to check the hardware topology in your machine. I don't know how to do it,
I bet a simple google search will do or someone with the knowledge will jump on the thread soon.
I did the check :). The cpus are selected in such a way that only one HT per core is selected.
A follow up questions would be:
Do I have to build a new schema ? or can I just restart scylla and carry on with whatever schema is in there. 
 

Thanks, Sid

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.

Dor Laor

<dor@scylladb.com>
unread,
Jan 6, 2017, 1:57:30 PM1/6/17
to ScyllaDB users, Krishnanand Thommandra
On Fri, Jan 6, 2017 at 10:54 AM, sid via ScyllaDB users <scyllad...@googlegroups.com> wrote:
Inline.

On Friday, January 6, 2017 at 10:45:00 AM UTC-8, Dor Laor wrote:
On Fri, Jan 6, 2017 at 10:31 AM, sid via ScyllaDB users <scyllad...@googlegroups.com> wrote:
Hi,
We are trying to setup scylla with a limited cpuset in our 6 node configuration.

Here is what the default cpuset looks like with all 48 cpus in the node:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47


If I want to reduce the cpuset to half by picking one from each core. These are the steps in am following:


1. Stop scylla server

2. update cpuset.conf

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23"

3. run iotune with the new cpuset to update io.conf

iotune --evaluation-directory /root/data_io --format envfile --options-file /etc/scylla.d/io.conf --cpuset "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23"

4. restart scylla server.


Are these steps enough ?



In general yes. The only thing I'm not sure about is whether the cpu you manually selected are
aligned with one hyper thread per core. It could be that you just reduced the core count. 
You'll need to check the hardware topology in your machine. I don't know how to do it,
I bet a simple google search will do or someone with the knowledge will jump on the thread soon.
I did the check :). The cpus are selected in such a way that only one HT per core is selected.

Good :)
 
A follow up questions would be:
Do I have to build a new schema ? or can I just restart scylla and carry on with whatever schema is in there. 

No need for any change, just restart it. Scylla will realize it had fewer shard than it used to have
and will consolidate (in this case) the different sstables to fewer ones.
The opposite case works too where Scylla splits common sstables into multiple ones, it requires
more work for Scylla but none for the user.
 
 

Thanks, Sid

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/2fc8624d-351a-46b7-962e-6417ee5b198a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Krishnanand Thommandra

<kthommandra@arista.com>
unread,
Jan 9, 2017, 7:27:16 PM1/9/17
to Dor Laor, ScyllaDB users
As part of scylla server startup, the scylla-prepare script also sets up the NIC (eg: IRQ affinity, setup_xps etc). These routines by default ignore the cpuset of the scylla server and just use all the visible cpus (eg: hwloc-distrib <nic irq count | xps count> etc)

I'm guessing that this may not be ideal but it _may_ not be _severely_ detrimental to the performance either. We may have to perform certain experiments to understand if this is really a concern or not in our setup. 

Note the scenario is that scylla server is operating with certain cores only while the NIC configuration considers all CPUs. 

Any comments/suggestions ?

-krishna

On Fri, Jan 6, 2017 at 10:56 AM, Dor Laor <d...@scylladb.com> wrote:

Dor Laor

<dor@scylladb.com>
unread,
Jan 9, 2017, 7:38:28 PM1/9/17
to Krishnanand Thommandra, Vladislav Zolotarov, ScyllaDB users
On Mon, Jan 9, 2017 at 4:27 PM, Krishnanand Thommandra <kthom...@arista.com> wrote:
As part of scylla server startup, the scylla-prepare script also sets up the NIC (eg: IRQ affinity, setup_xps etc). These routines by default ignore the cpuset of the scylla server and just use all the visible cpus (eg: hwloc-distrib <nic irq count | xps count> etc)

I'm guessing that this may not be ideal but it _may_ not be _severely_ detrimental to the performance either. We may have to perform certain experiments to understand if this is really a concern or not in our setup. 


Good point. Vlad, can we easily improve the MQ path in the script to utilize certain cores? It's just a bitwise operation

Vladislav Zolotarov

<vladz@scylladb.com>
unread,
Jan 9, 2017, 9:35:46 PM1/9/17
to Dor Laor, ScyllaDB users, Krishnanand Thommandra


On Jan 9, 2017 7:38 PM, "Dor Laor" <d...@scylladb.com> wrote:
On Mon, Jan 9, 2017 at 4:27 PM, Krishnanand Thommandra <kthom...@arista.com> wrote:
As part of scylla server startup, the scylla-prepare script also sets up the NIC (eg: IRQ affinity, setup_xps etc). These routines by default ignore the cpuset of the scylla server and just use all the visible cpus (eg: hwloc-distrib <nic irq count | xps count> etc)

I'm guessing that this may not be ideal but it _may_ not be _severely_ detrimental to the performance either. We may have to perform certain experiments to understand if this is really a concern or not in our setup. 


Good point. Vlad, can we easily improve the MQ path in the script to utilize certain cores? It's just a bitwise operation

Yes, it shouldn't be a problem. I'll send a version of posix_net_conf.sh that spreads the IRQs between the specific cores according to the given mask tomorrow. 

Vlad Zolotarov

<vladz@scylladb.com>
unread,
Jan 10, 2017, 1:42:29 PM1/10/17
to Dor Laor, ScyllaDB users, Krishnanand Thommandra

Here is the script with the new feature.

invoke it as follows:

./posix_net_conf.sh <interface name, e.g. eth0> --use-cpu-mask <cpu mask, e.g. 0x55> -mq
posix_net_conf.sh

Krishnanand Thommandra

<kthommandra@arista.com>
unread,
Jan 10, 2017, 1:57:45 PM1/10/17
to Vlad Zolotarov, Dor Laor, ScyllaDB users
Thanks Vlad.

In addition to this change, I think for our Intel NIC, we'll need to setup the RSS table appropriately so that proper rx queues are used. The tx queues would get limited and setup appropriately by the script.

-krishna

Sidhartha Agrawal

<sid@arista.com>
unread,
Jan 10, 2017, 2:56:24 PM1/10/17
to scylladb-users@googlegroups.com, Vlad Zolotarov, Dor Laor
Couple of more questions on the issue of cpuset:

1. running iotune takes some time(2-4 mins). To save this time when I am trying to play with different cpusets, once I know what the parameters in io.conf looks for a particular cpuset, is it okay to just update io.conf and cpuset.conf directly and reset scyalla-server - without running iotune over and over.

2. When I run iotune for 6 CPUs
iotune --evaluation-directory /root/data_io --format envfile --options-file /etc/scylla.d/io.conf --cpuset "0,1,2,12,13,14"

io.conf only has

SEASTAR_IO="--max-io-requests=51"
i.e. the num-io-queues is missing, is this expected ?

-Sid

You received this message because you are subscribed to a topic in the Google Groups "ScyllaDB users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scylladb-users/Y8lwLL1Z5po/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Vlad Zolotarov

<vladz@scylladb.com>
unread,
Jan 10, 2017, 3:25:54 PM1/10/17
to Krishnanand Thommandra, Dor Laor, ScyllaDB users



On 01/10/2017 01:57 PM, Krishnanand Thommandra wrote:
Thanks Vlad.

In addition to this change, I think for our Intel NIC, we'll need to setup the RSS table appropriately so that proper rx queues are used. The tx queues would get limited and setup appropriately by the script.

If you are using Intel's 10G NIC managed by the ixgbe driver you need to know that their RSS is limited by 16 queues, which means that higher queues will not get RSS filtered traffic.
The script attached to my previous email is going to spread all IRQs between the cores you provide. So, no need to tweak RSS table - only the number of Rx queues - limit them by 16 and all of them are going to be RSS queues that will be properly configured by default.
Our script configures XPS on all present queues for egress so you are going to be covered in this aspect.

Vlad Zolotarov

<vladz@scylladb.com>
unread,
Jan 10, 2017, 4:00:53 PM1/10/17
to Sidhartha Agrawal, scylladb-users@googlegroups.com, Dor Laor



On 01/10/2017 02:56 PM, Sidhartha Agrawal wrote:
Couple of more questions on the issue of cpuset:

1. running iotune takes some time(2-4 mins). To save this time when I am trying to play with different cpusets, once I know what the parameters in io.conf looks for a particular cpuset, is it okay to just update io.conf and cpuset.conf directly and reset scyalla-server - without running iotune over and over.

It should be ok most of the time. I'd expect iotune returned value to depend on whether you have or have not core in your mask that are "near" your disks in the mask you provide. I'd check if there are cores that are not near your disk(s) and if there are - run iotune for 3 times:
  • for all cores - use the result for masks that include both close and far cores
  • for all cores near disk - use the result for masks that include only close cores
  • for all cores far from disks - use for masks that include only far cores

To get cpumask of cores near:

hwloc-calc pci=<BDF of your disk controller>




2. When I run iotune for 6 CPUs
iotune --evaluation-directory /root/data_io --format envfile --options-file /etc/scylla.d/io.conf --cpuset "0,1,2,12,13,14"
io.conf only has
SEASTAR_IO="--max-io-requests=51"
i.e. the num-io-queues is missing, is this expected ?

I think it's ok.

Glauber Costa

<glauber@scylladb.com>
unread,
Jan 10, 2017, 5:29:17 PM1/10/17
to ScyllaDB users, Sidhartha Agrawal, Dor Laor
On Tue, Jan 10, 2017 at 4:00 PM, Vlad Zolotarov <vl...@scylladb.com> wrote:
>
>
> On 01/10/2017 02:56 PM, Sidhartha Agrawal wrote:
>
> Couple of more questions on the issue of cpuset:
>
> 1. running iotune takes some time(2-4 mins). To save this time when I am
> trying to play with different cpusets, once I know what the parameters in
> io.conf looks for a particular cpuset, is it okay to just update io.conf and
> cpuset.conf directly and reset scyalla-server - without running iotune over
> and over.
>

What really matters in io.conf is the number of concurrent requests.
Scylla will refuse to boot (unless in dev mode) if your I/O depth
(nr_requests / num_io_queues) is lower than 4.

So if your disk is tuned to 40 concurrent requests with 5 cores, and
you move to 10 cores, you are fine and there is no need to update
anything. If you move to 11 cores and up, then the file needs to
change.

>
> It should be ok most of the time. I'd expect iotune returned value to depend
> on whether you have or have not core in your mask that are "near" your disks
> in the mask you provide. I'd check if there are cores that are not near your
> disk(s) and if there are - run iotune for 3 times:
>
> for all cores - use the result for masks that include both close and far
> cores
> for all cores near disk - use the result for masks that include only close
> cores
> for all cores far from disks - use for masks that include only far cores
>
>
> To get cpumask of cores near:
>
> hwloc-calc pci=<BDF of your disk controller>
>
>
>
> 2. When I run iotune for 6 CPUs
> iotune --evaluation-directory /root/data_io --format envfile --options-file
> /etc/scylla.d/io.conf --cpuset "0,1,2,12,13,14"
> io.conf only has
> SEASTAR_IO="--max-io-requests=51"
> i.e. the num-io-queues is missing, is this expected ?
>
>
> I think it's ok.
>

if it is mising, it means "one per available core".
That's the most flexible case because it will adapt to whatever number
of cores you have, so that's always the preferred option. We only add
the number of io queues parameter if --max-io-requests is too low, and
won't allow a minimum of 4 concurrent requests per I/O queue. With
this setting, you can go to 12 cores without needing to lower the
number of io queues.
>>>>>> https://groups.google.com/d/msgid/scylladb-users/785e01e7-8873-432b-b364-dec906d78086%40googlegroups.com.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>
>> -- You received this message because you are subscribed to a topic in the
>> Google Groups "ScyllaDB users" group. To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/scylladb-users/Y8lwLL1Z5po/unsubscribe. To
>> unsubscribe from this group and all its topics, send an email to
>> scylladb-user...@googlegroups.com. To post to this group, send
>> email to scyllad...@googlegroups.com. Visit this group at
>> https://groups.google.com/group/scylladb-users.
>> To view this discussion on the web visit
> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.
> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/scylladb-users/6d5a5c18-18b6-ca36-108d-83dd2e092ba7%40scylladb.com.

Vlad Zolotarov

<vladz@scylladb.com>
unread,
Jan 13, 2017, 11:55:37 AM1/13/17
to Krishnanand Thommandra, Dor Laor, ScyllaDB users



On 01/10/2017 03:25 PM, Vlad Zolotarov wrote:



On 01/10/2017 01:57 PM, Krishnanand Thommandra wrote:
Thanks Vlad.

In addition to this change, I think for our Intel NIC, we'll need to setup the RSS table appropriately so that proper rx queues are used. The tx queues would get limited and setup appropriately by the script.

If you are using Intel's 10G NIC managed by the ixgbe driver you need to know that their RSS is limited by 16 queues, which means that higher queues will not get RSS filtered traffic.
The script attached to my previous email is going to spread all IRQs between the cores you provide. So, no need to tweak RSS table - only the number of Rx queues - limit them by 16 and all of them are going to be RSS queues that will be properly configured by default.
Our script configures XPS on all present queues for egress so you are going to be covered in this aspect.

By the way, we have an experimental version on posix_net_conf.sh that should improve the performance on the systems like you have. The main idea is that we setup RPS in addition to distributing IRQs on the systems that have "too many" CPUs, which is clearly your case.
If you want I can send you an under-the-shelf version of it to play with or you may simply configure it yourself (look how we do it in the -sf mode, look for a setup_rps() function in the posix_net_conf.sh).
Reply all
Reply to author
Forward
0 new messages