Virtualizing ScyllaDB

229 views
Skip to first unread message

Robbie Barton

<thesunshinerain@gmail.com>
unread,
Mar 2, 2016, 1:39:24 PM3/2/16
to ScyllaDB users
Need advice on how to virtualize ScyllaDB.

I have a cluster of 2-6 1U boxes with 4-8 SSD drives in each.

In Cassandra, I run Xen on each box and dedicate each SSD to it's own virtual machine.
By associating Cassandra Nodes to physical disks, I isolate disk failures to a single node which
can be drained and removed from the cluster.

This works OK and I've tried Docker and other solutions to minimize the impact of virtualization
but haven't had as much luck as I'd like.  I hope some day I can get rid of Xen
and run one node per drive on the raw hardware.

I'm also testing ScyllaDB and am very impressed with the performance and hope to use
it as a drop in replacement after GA.
Since the Scylla developers came from virtualization backgrounds, maybe they can do better.

In other words, is it possible to create /var/lib/cassandra[1-n] filesystems on one machine and
run one ScyllaDB/Cassandra node per filesystem without using Xen to virtualize each node?

Glauber Costa

<glauber@scylladb.com>
unread,
Mar 2, 2016, 1:55:40 PM3/2/16
to scylladb-users@googlegroups.com
Hi


On Wed, Mar 2, 2016 at 1:39 PM, Robbie Barton <thesuns...@gmail.com> wrote:
> Need advice on how to virtualize ScyllaDB.
>
> I have a cluster of 2-6 1U boxes with 4-8 SSD drives in each.
>
> In Cassandra, I run Xen on each box and dedicate each SSD to it's own
> virtual machine.
> By associating Cassandra Nodes to physical disks, I isolate disk failures to
> a single node which
> can be drained and removed from the cluster.
>
> This works OK and I've tried Docker and other solutions to minimize the
> impact of virtualization
> but haven't had as much luck as I'd like. I hope some day I can get rid of
> Xen
> and run one node per drive on the raw hardware.
>
> I'm also testing ScyllaDB and am very impressed with the performance and
> hope to use
> it as a drop in replacement after GA.

We are glad to hear that!

> Since the Scylla developers came from virtualization backgrounds, maybe they
> can do better.
>
> In other words, is it possible to create /var/lib/cassandra[1-n] filesystems
> on one machine and
> run one ScyllaDB/Cassandra node per filesystem without using Xen to
> virtualize each node?

It should be totally possible to do that.

Here are the steps, assuming one instance per disk (total of X), and N
processors in the system.
I am also assuming that you will already somehow have X IPs already available

1) create directories like /var/lib/scylla-<X>
2) for each disk sdX in the system:
mkfs.xfs /dev/sdX (we currently only support XFS)
mount -o noatime /dev/sdX /var/lib/scylla-<X>
3) for each instance:
generate a scylla.yaml file (you can just copy the /etc/ one somewhere)
change it so that:
- each instance will have its own IP
- the commitlog directory is /var/lib/scylla-<X>/commitlog
- the data directory is /var/lib/scylla-<X>/data
4) start scylla, adding the options:
- --smp <N/X> --cpuset <a disjoint set of all your processors>

You can still probably just use docker for most of that. As long as
each container has sensible --smp and --cpuset options, and a local
scylla.yaml that points to the correct location, it should all work.

Getting the --smp and --cpuset options is of the utmost importance:
scylla does polling for all its I/O and inter-shard requests, so for
this reason having it share a processor with another instance can be
really detrimental to the performance of the instance.

I, myself, run clusters in the same machine all the time (mainly for
testing), and if you really want to run it this way, I see no reason
why it wouldn't work in production as well.

>
> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.
> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/scylladb-users/2a36f6a5-d0ca-4a5c-a718-3cb5192d08fa%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Robbie Barton

<thesunshinerain@gmail.com>
unread,
Mar 3, 2016, 12:17:22 PM3/3/16
to ScyllaDB users
Thanks! That should work great!

Tzach Livyatan

<tzach@scylladb.com>
unread,
Mar 3, 2016, 12:20:20 PM3/3/16
to ScyllaDB users
Glauber, any reason to actually do that on production?
I

 

>
> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.
> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/scylladb-users/2a36f6a5-d0ca-4a5c-a718-3cb5192d08fa%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

Glauber Costa

<glauber@scylladb.com>
unread,
Mar 3, 2016, 12:26:46 PM3/3/16
to scylladb-users@googlegroups.com
Maybe Robbie would want to tell us more about why he wants to do it -
I am merely providing the how.

In my understanding, running a single physical box per node is still
the best way to keep things isolated with minimum churn.

But as far as I understand, he seems to be okay for his deployment
protecting against disk failures only - and seem to be aware that he
isn't, like that, protecting himself against other kinds of failures.
Everything will still be in the same box, and in the same controller.
Needless to mention, in the same datacenter. Any failure outside a
single-disk failure, and his whole cluster goes down. But again, it
seems to me that he is well aware of that.

I wouldn't call it the most common of deployments, but I don't see why
it wouldn't work.
> https://groups.google.com/d/msgid/scylladb-users/CAEB7%3D3YsEaE2fyz%3DCNHrey5n7KES%3DxEQHEk5EuwObFUd-n-L5g%40mail.gmail.com.

Robbie Barton

<thesunshinerain@gmail.com>
unread,
Mar 3, 2016, 1:02:07 PM3/3/16
to ScyllaDB users
Obviously, the single node per physical box is best, but we don't always have that much space available in our datacenters.
I'd love to learn more about the kind of hardware that people are using to deploy this way. Blade servers?

I suspect many "single node per physical box" installations are actually running on Xen or VMWare servers that are
really on the same box, sharing the same disk or raid array. I'd rather just cut the overhead and eliminate the hypervisor and
get as close to the real hardware as possible.

The reality is that density is on the rise.  Datacenter space is expensive and some of us do have budgetary concerns.
It's becoming very cost effective to buy 1U boxes with several hot-swap SSD drives. It's not cost effective to buy a bunch
of single disk computers and try to house and manage them as bare metal.

So I ask this question:
Assume I have two boxes, each with four nodes (One datacenter, two racks).
I add a dataless quorum node on a third box to avoid split-brain.
Could I continue to read and write if one of the two boxes is shutdown and I use Consistency ONE or ANY?

Avi Kivity

<avi@scylladb.com>
unread,
Mar 3, 2016, 1:11:13 PM3/3/16
to scylladb-users@googlegroups.com
On 03/03/2016 08:02 PM, Robbie Barton wrote:
Obviously, the single node per physical box is best, but we don't always have that much space available in our datacenters.
I'd love to learn more about the kind of hardware that people are using to deploy this way. Blade servers?

I suspect many "single node per physical box" installations are actually running on Xen or VMWare servers that are
really on the same box, sharing the same disk or raid array. I'd rather just cut the overhead and eliminate the hypervisor and
get as close to the real hardware as possible.


It makes more sense to run a single large node on that box.

You run on multiple nodes in order to gain high availability and improved throughput.  Splitting a large host into multiple small nodes has negative impact on both.


The reality is that density is on the rise.  Datacenter space is expensive and some of us do have budgetary concerns.
It's becoming very cost effective to buy 1U boxes with several hot-swap SSD drives. It's not cost effective to buy a bunch
of single disk computers and try to house and manage them as bare metal.

So I ask this question:
Assume I have two boxes, each with four nodes (One datacenter, two racks).
I add a dataless quorum node on a third box to avoid split-brain.
Could I continue to read and write if one of the two boxes is shutdown and I use Consistency ONE or ANY?

Buy three cheaper nodes.  Don't try to game high availability, it is tricky enough if you do it the old-fashioned way.


You can partition a host into multiple nodes, but don't run nodes from the same cluster on a single host.  As soon as you grow beyond the minimal three nodes, prefer physical nodes, although of course you can continue with virtual nodes.


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

Robbie Barton

<thesunshinerain@gmail.com>
unread,
Mar 3, 2016, 1:21:01 PM3/3/16
to ScyllaDB users
Thanks for the advice! I really appreciate it.

If I only run a single node, wouldn't I have to put all my SSDs in a RAID?
If I just threw them into a big volume group, one disk failure would take out the entire node.
I've lost a lot more disks than nodes, although they are SSD, so not sure about that.
I was thinking that Scylla's awareness of NUMA might help a bit.

Would it be better to define each box as a separate datacenter?

Avi Kivity

<avi@scylladb.com>
unread,
Mar 3, 2016, 1:37:39 PM3/3/16
to scylladb-users@googlegroups.com
On 03/03/2016 08:21 PM, Robbie Barton wrote:
Thanks for the advice! I really appreciate it.

If I only run a single node, wouldn't I have to put all my SSDs in a RAID?

Yes, for a single node, RAID10 or RAID5 or RAID6 would be the choice.



If I just threw them into a big volume group, one disk failure would take out the entire node.
I've lost a lot more disks than nodes, although they are SSD, so not sure about that.
I was thinking that Scylla's awareness of NUMA might help a bit.

No, without RAID or multiple nodes, you are vulnerable to a single point of failure.



Would it be better to define each box as a separate datacenter?

No.  Datacenters are best used when you have nodes in multiple datacenters.  Designating a group of nodes in a datacenter allows Scylla to optimize latency by issuing fewer synchronous calls across data centers, and by reducing bandwidth for replication.

Reply all
Reply to author
Forward
0 new messages