O_DIRECT

877 views
Skip to first unread message

AL

<avinash.lakshman@gmail.com>
unread,
Mar 18, 2017, 12:45:48 PM3/18/17
to ScyllaDB users
Does scylladb use O_DIRECT for all disk I/O? Does this slow down flushes and compactions?

Regards
AL

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 12:53:06 PM3/18/17
to scylladb-users@googlegroups.com
Yes it does. From core/reactor.cc

future<file>
reactor::open_file_dma(sstring name, open_flags flags, file_open_options options) {
    static constexpr mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH; // 0644
    return _thread_pool.submit<syscall_result<int>>([name, flags, options, strict_o_direct = _strict_o_direct] {
        // We want O_DIRECT, except in two cases:
        //   - tmpfs (which doesn't support it, but works fine anyway)
        //   - strict_o_direct == false (where we forgive it being not supported)
        // Because open() with O_DIRECT will fail, we open it without O_DIRECT, try
        // to update it to O_DIRECT with fcntl(), and if that fails, see if we
        // can forgive it.

Use grep command please.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/95e97109-8ea2-460c-bcbe-cd10083b0976%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Respect,
Shiv Shankar Dayal

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 1:06:07 PM3/18/17
to scylladb-users@googlegroups.com
From the man page:

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT.

Further towards bottom:

O_DIRECT The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. However there is currently no filesystem-independent interface for an application to discover these restrictions for a given file or filesystem. Some filesystems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3). Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the filesystem. Since Linux 2.6.0, alignment to the logical block size of the underlying storage (typically 512 bytes) suffices. The logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command: blockdev --getss O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2). The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions. O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some filesystems may not implement the flag and open() will fail with EINVAL if it is used. Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files. The behavior of O_DIRECT with NFS will differ from local filesystems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will bypass the page cache only on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O. In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default. "The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."—Linus

I have done the work which you should have done.

On Sat, Mar 18, 2017 at 10:23 PM, Shiv Shankar Dayal <shivshan...@gmail.com> wrote:
Yes it does. From core/reactor.cc

future<file>
reactor::open_file_dma(sstring name, open_flags flags, file_open_options options) {
    static constexpr mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH; // 0644
    return _thread_pool.submit<syscall_result<int>>([name, flags, options, strict_o_direct = _strict_o_direct] {
        // We want O_DIRECT, except in two cases:
        //   - tmpfs (which doesn't support it, but works fine anyway)
        //   - strict_o_direct == false (where we forgive it being not supported)
        // Because open() with O_DIRECT will fail, we open it without O_DIRECT, try
        // to update it to O_DIRECT with fcntl(), and if that fails, see if we
        // can forgive it.

Use grep command please.
On Sat, Mar 18, 2017 at 10:15 PM, AL <avinash....@gmail.com> wrote:
Does scylladb use O_DIRECT for all disk I/O? Does this slow down flushes and compactions?

Regards
AL

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
--
Respect,
Shiv Shankar Dayal

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 1:07:33 PM3/18/17
to scylladb-users@googlegroups.com
Now since ScyllDB runs on XFS, you would find http://oss.sgi.com/archives/xfs/2004-09/msg00410.html useful. Go ahead read it.
Message has been deleted
Message has been deleted

bllhastings@gmail.com

<bllhastings@gmail.com>
unread,
Mar 18, 2017, 6:40:53 PM3/18/17
to ScyllaDB users, avinash.lakshman@gmail.com
Yes it does.

bllhastings@gmail.com

<bllhastings@gmail.com>
unread,
Mar 18, 2017, 6:43:25 PM3/18/17
to ScyllaDB users
This reads like a hilarious exchange. But why bring what looks like a private feud to a public list. This looks like a private exchange between two folks who either know each other really well or abhor each other. If this had transpired on the list I would support the ban. Unless you are a Trump supported. Sorry couldn't resist.

On Saturday, March 18, 2017 at 2:39:53 PM UTC-7, Shiv Shankar Dayal wrote:
I ask for a ban on Avinash Lakshman for misbehavior. Given below is the transcript of what happened later:

MIME-Version: 1.0
Received: by 10.140.99.111 with HTTP; Sat, 18 Mar 2017 14:35:37 -0700 (PDT)
In-Reply-To: <CACHUEQ9uH01VE6Jh=ONPmxxvOe=_RYsDSzia7R0mTtOm1tyyAQ@mail.gmail.com>
References: <95e97109-8ea2-460c-bcbe-cd1008...@googlegroups.com> <CALJ_jGQrJqy8zZFoqrTY+MyBw35Z-H-9NW5jd2VmHGTq-zbo6w@mail.gmail.com> <CALJ_jGTGa50DTnJRnnKKauEVdgThwagsuTHF-D_Ovw...@mail.gmail.com> <CALJ_jGRE2WBJ7w2MfHeQS+3uWY8b=WR0r2ANi8=uSZdH...@mail.gmail.com> <CALJ_jGTFsVpZ0MwGpSSyDo7SPMhLnPob_E6o2hjFT...@mail.gmail.com> <c2a51b1c-04bf-4bff-9afb-f5aea2...@googlegroups.com> <CALJ_jGSvm_ypZBpB-s5Q8d1GkX=yf=HVH21F-Mz2...@mail.gmail.com> <CACHUEQ9QhducmPuMcxMgA7ERefYfTLAPPquiisR0DykER6xLxg@mail.gmail.com> <CALJ_jGSnUK-gw8RnMYKM5HoVTmMTQ6LNekOZdqG9rxfNA...@mail.gmail.com> <CALJ_jGS+OeEVRXRRTm6fNX94BBm2xgLGfsoqmg2RFdHG...@mail.gmail.com> <CACHUEQ8JB6Mj3q2awYgcU0+BekRpLP4o9FKQAs3vQp3X+Sztxw@mail.gmail.com> <CALJ_jGSNFRTzAHGwOA9_pvCgvYP2sLcyZnR-M5_N1N4ui2OM+g...@mail.gmail.com> <CALJ_jGTYWJoXGVPwWrYvC2AaJkbOZcCPDaRatrS=Djw7-...@mail.gmail.com> <CACHUEQ-5gi+h0bCneObfsVkK_+Pu0=iJhhNRAhjO...@mail.gmail.com> <CALJ_jGT337hvBs8mKYEOwAqvSFB43jpobvu9BHzowt...@mail.gmail.com> <CACHUEQ9cWeQohUH0kM5F=KQ4=gpH+gHsd=OeX38meY...@mail.gmail.com> <CALJ_jGSRZQY2MhQqAjc-t75hwhYBNQGKXYSdUagbuhYozzC-c...@mail.gmail.com> <CACHUEQ9uH01VE6Jh=ONPmxxvOe=_RYsDSzia7R0mTtOm1tyyAQ@mail.gmail.com>
Date: Sun, 19 Mar 2017 03:05:37 +0530
Delivered-To: shivshan...@gmail.com
Message-ID: <CALJ_jGSnob-qzDEEkGQKEZPCc-MKTzrdQisXiZn-F3dZcBDVRw@mail.gmail.com>
Subject: Re: Private message regarding: [scylladb-users] O_DIRECT
From: Shiv Shankar Dayal <shivshan...@gmail.com>
To: Avinash Lakshman <avinash....@gmail.com>
Content-Type: multipart/alternative; boundary=001a11c1211c3567cd054b0813b1

--001a11c1211c3567cd054b0813b1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Good that you have shown your standards.

https://github.com/cockroachdb/cockroach/releases?after=3Dbeta-20160328 loo=
k
at that alpha release. That is 176k lines of golang code. Try doing it on
seastar code like ScyllaDB is done.

Now I definitely think that you are the Cassandra inventor if you got
minion assholes, motherfuckers like me to grep code for you.

I will keep replying. Never come on ScyllaDB mailing list else you will see
what I do to you or to any mailing list where I am present.

Do you want me to copy paste this there just to enhance your reputation?

On Sun, Mar 19, 2017 at 3:02 AM, Avinash Lakshman <
avinash....@gmail.com> wrote:

> I got minion assholes like you to grep shit for me motherfucker
> On Sat, Mar 18, 2017 at 2:31 PM Shiv Shankar Dayal <
> shivshan...@gmail.com> wrote:
>
>> Do you have a clue about the difficulty of the project? It is not a one
>> man task. Unless people help write code do you think it is punching
>> keyboard like we are doing on this email.
>>
>> Read the Seastar code and DPDK and SPDK. You will piss in your pant lose=
r
>> who cannot grep a codebase.
>>
>> On Sun, Mar 19, 2017 at 2:58 AM, Avinash Lakshman <
>> avinash....@gmail.com> wrote:
>>
>> I do believe you are now talking like a fucktard. For a moment I thought
>> you were serious. When you say you are not a serious programmer and aski=
ng
>> others to build stuff that shows you are one hell of a pathetic loser. I
>> don't forgive fucks like you. Go build something and then we can pick up
>> this thread.
>>
>> On Sat, Mar 18, 2017 at 2:25 PM Shiv Shankar Dayal <
>> shivshan...@gmail.com> wrote:
>>
>> It is a mathematical theorem. You are blessed with ignorance. Forgiven.
>>
>> On Sun, Mar 19, 2017 at 2:54 AM, Avinash Lakshman <
>> avinash....@gmail.com> wrote:
>>
>>
>> Just do it CLP or CAP or whatever the new theme is. As for me I am livin=
g
>> the life I don't care which 3 letter acronym people drum up
>>
>> On Sat, Mar 18, 2017 at 2:22 PM Shiv Shankar Dayal <
>> shivshan...@gmail.com> wrote:
>>
>> Initial release will require like 200k lines of code. That is more than =
a
>> year.
>>
>> On Sun, Mar 19, 2017 at 2:51 AM, Shiv Shankar Dayal <
>> shivshan...@gmail.com> wrote:
>>
>> Which you did not know. :) LOL.
>>
>> On Sun, Mar 19, 2017 at 2:50 AM, Avinash Lakshman <
>> avinash....@gmail.com> wrote:
>>
>> See that's the spirit.
>>
>> On Sat, Mar 18, 2017 at 2:19 PM Shiv Shankar Dayal <
>> shivshan...@gmail.com> wrote:
>>
>> Cassandra is AP. A DB developer cannot leave consistency to developers.
>> The CAP theorem is not really CAP theorem but CLP theorem.
>> Consistency-Latency-Pratition tolerance. A good database will minimize t=
he
>> latency. Cassandra fails miserably at that. Alas, Avi Kivity is not a DB
>> guy but a system programmer.
>>
>> I have asked Dor Laor and Avi Kivity for a port of Cockroach to Seastar
>> after 5 years. I am on it.
>>
>> https://github.com/shivshankardayal/orderD
>>
>> I know that complaining does not help.
>>
>> On Sun, Mar 19, 2017 at 2:46 AM, Shiv Shankar Dayal <
>> shivshan...@gmail.com> wrote:
>>
>> Well, Cassandra is a very bad database. That is why I said that.
>>
>> On Sun, Mar 19, 2017 at 2:46 AM, Avinash Lakshman <
>> avinash....@gmail.com> wrote:
>>
>> Your sense of humor seems pretty pathetic as you. ROFL for what?
>>
>> On Sat, Mar 18, 2017 at 2:11 PM Shiv Shankar Dayal <
>> shivshan...@gmail.com> wrote:
>>
>> Damn. How I wished you were the inventor and I would have crushed you.
>> ROFL.
>>
>> On Sat, Mar 18, 2017 at 10:53 PM, AL <avinash....@gmail.com> wrote:
>>
>> I wish I were the inventor. LOL. Thanks for the links.

On Sat, Mar 18, 2017 at 10:40 PM, Shiv Shankar Dayal <shivshan...@gmail.com> wrote:
https://www.linkedin.com/in/avinashlakshman/ is this you? If yes, then you are Cassandra inventor, the DB written in Java. LOLs.

To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
--
Respect,
Shiv Shankar Dayal



--
Respect,
Shiv Shankar Dayal



--
Respect,
Shiv Shankar Dayal



--
Respect,
Shiv Shankar Dayal

Dor Laor

<dor@scylladb.com>
unread,
Mar 18, 2017, 7:33:53 PM3/18/17
to ScyllaDB users
We use O_DIRECT. Here is an explanation how it's better than the
Here is a more extensive video about Scylla with details about the cache, user space
schedule and compactions: 

I erased the private exchange from the list. Let's act as it never happened and
not dragged to it in the future as well.


On Sat, Mar 18, 2017 at 3:43 PM, <bllha...@gmail.com> wrote:
This reads like a hilarious exchange. But why bring what looks like a private feud to a public list. This looks like a private exchange between two folks who either know each other really well or abhor each other. If this had transpired on the list I would support the ban. Unless you are a Trump supported. Sorry couldn't resist.

On Saturday, March 18, 2017 at 2:39:53 PM UTC-7, Shiv Shankar Dayal wrote:
I ask for a ban on Avinash Lakshman for misbehavior. Given below is the transcript of what happened later:

MIME-Version: 1.0
Received: by 10.140.99.111 with HTTP; Sat, 18 Mar 2017 14:35:37 -0700 (PDT)
In-Reply-To: <CACHUEQ9uH01VE6Jh=ONPmxxvOe=_RYsDSzia7R0...@mail.gmail.com>
References: <95e97109-8ea2-460c-bcbe-cd10083b...@googlegroups.com> <CALJ_jGQrJqy8zZFoqrTY+MyBw35Z-H-9NW5jd2VmHGTq-zbo6w@mail.gmail.com> <CALJ_jGTGa50DTnJRnnKKauEVdgThwagsuTHF-D_Ovwjroq_+Tg@mail.gmail.com> <CALJ_jGRE2WBJ7w2MfHeQS+3uWY8b=WR0r2ANi8=uSZdH...@mail.gmail.com> <CALJ_jGTFsVpZ0MwGpSSyDo7SPMhLnPob_E6o2hjFTUNkojZgHA@mail.gmail.com> <c2a51b1c-04bf-4bff-9afb-f5aea21a...@googlegroups.com> <CALJ_jGSvm_ypZBpB-s5Q8d1GkX=yf=HVH21F-Mz2...@mail.gmail.com> <CACHUEQ9QhducmPuMcxMgA7ERefYfTLAPPquiisR...@mail.gmail.com> <CALJ_jGSnUK-gw8RnMYKM5HoVTmMTQ6LNekOZdqG9rxfNAvMgUg@mail.gmail.com> <CALJ_jGS+OeEVRXRRTm6fNX94BBm2xgLGfsoqmg2RFdHG2W5GaQ@mail.gmail.com> <CACHUEQ8JB6Mj3q2awYgcU0+BekRpLP4o9FKQAs3vQp3X+Sztxw@mail.gmail.com> <CALJ_jGSNFRTzAHGwOA9_pvCgvYP2sLcyZnR-M5_N1N4ui2OM+g@mail.gmail.com> <CALJ_jGTYWJoXGVPwWrYvC2AaJkbOZcCPDaRatrS=Djw7-...@mail.gmail.com> <CACHUEQ-5gi+h0bCneObfsVkK_+Pu0=iJhhNRAhjO...@mail.gmail.com> <CALJ_jGT337hvBs8mKYEOwAqvSFB43jpobvu9BHzowt8ROqQVRw@mail.gmail.com> <CACHUEQ9cWeQohUH0kM5F=KQ4=gpH+gHsd=OeX38meY...@mail.gmail.com> <CALJ_jGSRZQY2MhQqAjc-t75hwhYBNQGKXYSdUagbuhYozzC-cA@mail.gmail.com> <CACHUEQ9uH01VE6Jh=ONPmxxvOe=_RYsDSzia7R0...@mail.gmail.com>
Date: Sun, 19 Mar 2017 03:05:37 +0530
Delivered-To: shivshan...@gmail.com
Message-ID: <CALJ_jGSnob-qzDEEkGQKEZPCc-MKTzrdQisXiZn...@mail.gmail.com>
Subject: Re: Private message regarding: [scylladb-users] O_DIRECT
From: Shiv Shankar Dayal <shivshan...@gmail.com>
To: Avinash Lakshman <avinash....@gmail.com>
Content-Type: multipart/alternative; boundary=001a11c1211c3567cd054b0813b1

--001a11c1211c3567cd054b0813b1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Good that you have shown your standards.

To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 8:33:09 PM3/18/17
to scylladb-users@googlegroups.com
Of course. I was trying to be serious and funny both but it went out of control. :D

But the question is that I believe that O_DIRECT would have been used with caching of ScyllaDB. Then why does it slow down and by how much? Any benchmarks available for this or doable easily?

On Sun, Mar 19, 2017 at 4:13 AM, <bllha...@gmail.com> wrote:
This reads like a hilarious exchange. But why bring what looks like a private feud to a public list. This looks like a private exchange between two folks who either know each other really well or abhor each other. If this had transpired on the list I would support the ban. Unless you are a Trump supported. Sorry couldn't resist.

On Saturday, March 18, 2017 at 2:39:53 PM UTC-7, Shiv Shankar Dayal wrote:
I ask for a ban on Avinash Lakshman for misbehavior. Given below is the transcript of what happened later:

MIME-Version: 1.0
Received: by 10.140.99.111 with HTTP; Sat, 18 Mar 2017 14:35:37 -0700 (PDT)
In-Reply-To: <CACHUEQ9uH01VE6Jh=ONPmxxvOe=_RYsDSzia7R0...@mail.gmail.com>
References: <95e97109-8ea2-460c-bcbe-cd10083b...@googlegroups.com> <CALJ_jGQrJqy8zZFoqrTY+MyBw35Z-H-9NW5jd2VmHGTq-zbo6w@mail.gmail.com> <CALJ_jGTGa50DTnJRnnKKauEVdgThwagsuTHF-D_Ovwjroq_+Tg@mail.gmail.com> <CALJ_jGRE2WBJ7w2MfHeQS+3uWY8b=WR0r2ANi8=uSZdH...@mail.gmail.com> <CALJ_jGTFsVpZ0MwGpSSyDo7SPMhLnPob_E6o2hjFTUNkojZgHA@mail.gmail.com> <c2a51b1c-04bf-4bff-9afb-f5aea21a...@googlegroups.com> <CALJ_jGSvm_ypZBpB-s5Q8d1GkX=yf=HVH21F-Mz2...@mail.gmail.com> <CACHUEQ9QhducmPuMcxMgA7ERefYfTLAPPquiisR...@mail.gmail.com> <CALJ_jGSnUK-gw8RnMYKM5HoVTmMTQ6LNekOZdqG9rxfNAvMgUg@mail.gmail.com> <CALJ_jGS+OeEVRXRRTm6fNX94BBm2xgLGfsoqmg2RFdHG2W5GaQ@mail.gmail.com> <CACHUEQ8JB6Mj3q2awYgcU0+BekRpLP4o9FKQAs3vQp3X+Sztxw@mail.gmail.com> <CALJ_jGSNFRTzAHGwOA9_pvCgvYP2sLcyZnR-M5_N1N4ui2OM+g@mail.gmail.com> <CALJ_jGTYWJoXGVPwWrYvC2AaJkbOZcCPDaRatrS=Djw7-...@mail.gmail.com> <CACHUEQ-5gi+h0bCneObfsVkK_+Pu0=iJhhNRAhjO...@mail.gmail.com> <CALJ_jGT337hvBs8mKYEOwAqvSFB43jpobvu9BHzowt8ROqQVRw@mail.gmail.com> <CACHUEQ9cWeQohUH0kM5F=KQ4=gpH+gHsd=OeX38meY...@mail.gmail.com> <CALJ_jGSRZQY2MhQqAjc-t75hwhYBNQGKXYSdUagbuhYozzC-cA@mail.gmail.com> <CACHUEQ9uH01VE6Jh=ONPmxxvOe=_RYsDSzia7R0...@mail.gmail.com>
Date: Sun, 19 Mar 2017 03:05:37 +0530
Delivered-To: shivshan...@gmail.com
Message-ID: <CALJ_jGSnob-qzDEEkGQKEZPCc-MKTzrdQisXiZn...@mail.gmail.com>
Subject: Re: Private message regarding: [scylladb-users] O_DIRECT
From: Shiv Shankar Dayal <shivshan...@gmail.com>
To: Avinash Lakshman <avinash....@gmail.com>
Content-Type: multipart/alternative; boundary=001a11c1211c3567cd054b0813b1

--001a11c1211c3567cd054b0813b1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Good that you have shown your standards.

To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 8:34:57 PM3/18/17
to scylladb-users@googlegroups.com

I am sorry for having contributed towards pollution in the list. Thanks Dor for the informative part.

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 9:24:38 PM3/18/17
to scylladb-users@googlegroups.com
For OP, that file core/reactor.cc is part of Seastar and not of ScyllaDB.

https://github.com/scylladb/seastar/blob/master/core/reactor.cc

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 18, 2017, 9:32:50 PM3/18/17
to scylladb-users@googlegroups.com
In future please separate questions between Seastar and ScyllaDB. I do not think that this question was well suited for this. But people know less about Seastar and more about ScyllaDB.

The way to Scylla goes though Seastar.

Pekka Enberg

<penberg@scylladb.com>
unread,
Mar 19, 2017, 2:06:38 AM3/19/17
to ScyllaDB users
On Sun, Mar 19, 2017 at 12:43 AM, <bllha...@gmail.com> wrote:
> This reads like a hilarious exchange.

No, it's not. Please keep the discussion on this list civil and
on-topic about ScyllaDB.

- Pekka

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 19, 2017, 4:40:43 AM3/19/17
to scylladb-users@googlegroups.com
Apologies.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Nadav Har'El

<nyh@scylladb.com>
unread,
Mar 19, 2017, 7:27:13 AM3/19/17
to scylladb-users@googlegroups.com, avinash.lakshman@gmail.com
On Sat, Mar 18, 2017 at 6:45 PM, AL <avinash....@gmail.com> wrote:
Does scylladb use O_DIRECT for all disk I/O? Does this slow down flushes and compactions?

Hi Avinash, I am sorry about what has transpired on this list after you asked your question. I believe that mailing lists should be friendly places, where good questions are met with serious and courteous responses - not ad-hominem attacks.

So since your question was indeed a good one, I would like to try and offer a serious response.

You are right that all disk I/O in ScyllaDB is done with O_DIRECT, i.e., we bypass the kernel's page cache. You are asking about bulk read and write operations (memtable flush and compactions) and wondering how bypassing the page cache doesn't hurt their performance. I would like to explain why not only it doesn't hurt the performance, it can actually improve things:

Traditionally, the page cache helps performance of bulk reads by "read-ahead": while the application reads previous data from memory (the page cache) the kernel ensures that the following file blocks are read from disk into the page cache, so that the next read calls will find the data already in memory. Similarly, the page cache helps bulk write performance by "write-behind": the writes (during memtable flush and compaction) are actually done to memory (the page cache), with the kernel flushing this memory to disk in parallel as quickly as possible.

Seastar (the asynchronous programming library on which Scylla is built) actually does exactly the same two things - read-ahead and write-behind - explicitly - instead of relying on the kernel to do them. For reads, as Seastar makes one disk block available to the application, it already submits asynchronous requests (Linux AIO) to read the next blocks from the disk. For writes, an application's write operation sends an asynchronous request, but does not necessarily wait for it to complete before allowing further writes to be sent on this file.

So, if Seastar basically does the same things as the kernel in this respect, you may be wondering why we bothered with it at all - is there a *benefit* of using O_DIRECT instead of letting the kernel's page cache deal with those things? Yes, there definitely are benefits. Here are the first few that came to my mind:

1. O_DIRECT is necessary for using Linux's AIO (asynchronous I/O). AIO is what allowed Seastar (and therefore Scylla) to work with a single thread per core: As I'm sure you know, using blocking system calls or a blocking mmap'ed memory (which Cassandra uses) requires that you have multiple threads per core, which need to be switched when one of them blocks on a read or write. The single-thread-per-core design is critical for Scylla's performance - both its high throughput and low 99.9th percentile latency.

2. By doing the read-ahead and write-behind in our own code, we have full control over it. We can control its amount based on measurements we do on the parallelism of the actual disk. Even more interestingly, we have an opportunity for "I/O scheduling" - we can decide that, for example, compaction should only be given 10% of the disk bandwidth, and delay the execution of read-ahead or write-behind if we're over this quota, and execute more disk requests from, say, query execution. When the read-ahead code is in the kernel, we have little control over how many requests it will send to the disk and when.

3. The default behavior of Linux's page cache is very wasteful of memory. For example, as a compaction reads an entire sstable, the page cache might contain (at its own discretion) a copy of the entire sstable - even though we know we will not need to read this sstable ever again. Any memory which is used for this page cache could have been put into better use for storing a bigger row cache, which would make request processing faster. So not having tight control over the page cache use slows down performance (of requests). Yes, this problem of the page cache waste can be mitigated with careful use of madvise() and I think Cassandra does use that to some degree, but you still don't get as much control as we get by doing this caching ourselves.

Finally, not using the kernel's page cache for request caching, and instead caching parsed rows, also has its benefits (and Dor's reply offered a few links on this subject). But your question asked specifically about the bulk operations (flush and compaction), so I'll not expand on this issue.

I hope this helped,
Nadav.

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 19, 2017, 1:51:59 PM3/19/17
to scylladb-users@googlegroups.com
Thanks Nadav for the beautiful answer.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

For more options, visit https://groups.google.com/d/optout.

Shiv Shankar Dayal

<shivshankar.dayal@gmail.com>
unread,
Mar 19, 2017, 2:03:04 PM3/19/17
to scylladb-users@googlegroups.com
Thanks Nadav for the beautiful answer.
On Sun, Mar 19, 2017 at 4:57 PM, Nadav Har'El <n...@scylladb.com> wrote:

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
Reply all
Reply to author
Forward
0 new messages