Re: Improve Write performance with Relax durability.

40 views
Skip to first unread message

Jia Zhai

unread,
Jun 1, 2016, 9:37:31 PM6/1/16
to d...@bookkeeper.apache.org, distribut...@googlegroups.com
+ distributedlog-user
For more input and comments. :)

Thanks.

On Thu, Jun 2, 2016 at 9:34 AM, Jia Zhai <zhai...@gmail.com> wrote:
Hello all,

I am wondering do you guys have any plans on supporting relax durability. Is it a good feature to have in bookkeeper (also for DistributedLog)?

I am thinking adding a new flag to bookkeeper#addEntry(..., Boolean sync). So the application can control whether to sync or not for individual entries.

- On the write protocol, adding a flag to indicate whether this write should sync to disk or not.
- On the bookie side, if the addEntry request is sync, going through original pipeline. If the addEntry disables sync,    complete the add callbacks after writing to the journal file and before flushing journal.
- Those add entries (disabled syncs) will be flushed to disks with subsequent sync add entries.

To my use cases on DistributedLog, this feature can be used for supporting streams that don't have strong durability requirements.

What do you guys think? Shall I create a jira to implement this?

Thanks a lot
-Jia

Sijie Guo

unread,
Jun 2, 2016, 3:19:55 AM6/2/16
to Jia Zhai, d...@bookkeeper.apache.org, distributedlog-user
This seems interesting to me. However, it might be safe to start with a flag configured per ledger, rather than per entry. Also, it would be good to hear the opinions from other people. JV, Matteo? (If I remembered correctly, Matteo mentioned that Yahoo might be working on similar thing)

+1 for creating a BOOKKEEPER jira to track this.

- Sijie

--
You received this message because you are subscribed to the Google Groups "distributedlog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to distributedlog-...@googlegroups.com.
To post to this group, send email to distribut...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/distributedlog-user/CALsc%2BXpJj3YT47bognhmEhHmahJkCgJUUY6Un4HVczfK_1MxPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Venkateswara Rao Jujjuri

unread,
Jun 2, 2016, 10:30:18 AM6/2/16
to d...@bookkeeper.apache.org, Jia Zhai, distributedlog-user
I agree that we must make this ledger property not perEntry write property.

But, biggest doubt in my mind is - this changes something fundamental. LAC.
Are we allowing sparse ledger? in failure scenario? Handling read side may become more complex.


> .
> For more options, visit https://groups.google.com/d/optout.
>



--
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then you win. - Mahatma Gandhi


Venkateswara Rao Jujjuri

unread,
Jun 3, 2016, 11:58:07 AM6/3/16
to d...@bookkeeper.apache.org, Jia Zhai, distributedlog-user
@sijie let me expand what I mean by " this changes something fundamental "

Everything starts that we are not persisting. Also I share lot of the points raised by @Matteo.

- In theory, we could loose all copies of EntryId X but persist EntryId X+Y.  How does reads,replication, consistency cope up with it?
- We could advance LAC, but loose last last set of entries. What do we do? do we adjust LAC? at what boundaries?
- One of the core principles of LOG is, if entry X is there , all the entries up until X are available too, with this we may need to deal with
   sparse / missing entries.

I believe this is more of a direction towards making BooKKeeper in-memory log, but I am afraid it is more of a core change.

Thanks,
JV

On Fri, Jun 3, 2016 at 12:05 AM, Matteo Merli <mme...@apache.org> wrote:
I was interested in trying something in this area, but never actually got
to do it.

A few random notes:

1. My suspicion, with no backing data at this point, is that simply
skipping the fsync
    for "non-durable" ledgers might not give a big improvement, just a bit
less latency
    for non-fsynced writes but roughly the same throughput. Imagine a
bookie
    receiving writes for 2 ledgers, 1 durable and the other non-durable.
    Since the entries are appended to the journal as they come in, the
fsync() for the
    durable ledger write will also carry on the data for the previous
non-durable ledger
    write, causing more IOPS if that was spanning a different disk block.
    Given that the bookie throughput is typically limited by the IOPS
capacity of the
    journal device, having non-durable write might help that much.

2.  The other options I was thinking were :
      - Do not append the non-durable entries to journal (redundancy is
anyway given by
        writing to multiple bookies). In this case though, a single bookie
could loose more
        entries depending on flushTime, and also could loose entries even
in case of
        process crash, not just kernel-panic or power-outage.

    - Use a separate journal for non-durable writes which will not be
fsynced()

    - Configure the durability at the bookie level and then use
placement/isolation policy to choose the
      appropriate set of bookies for a non-durable ledger.

3. How do bookie replication will operate when getting read-errors?

Matteo

On Thu, Jun 2, 2016 at 11:09 PM Sijie Guo <si...@apache.org> wrote:

> I think if a ledger is configured to be non-durable, it is kind of
> application's responsibility to tolerant the data loss.
> So I don't think it actually will have to change any in the bookkeeper
> client side.
>
> - Sijie
>
> On Thu, Jun 2, 2016 at 7:29 AM, Venkateswara Rao Jujjuri <
> juj...@gmail.com>

> >
> > .
> >
> > For more options, visit https://groups.google.com/d/optout.
> >
>

Sijie Guo

unread,
Jun 8, 2016, 1:40:08 AM6/8/16
to Venkateswara Rao Jujjuri, d...@bookkeeper.apache.org, Jia Zhai, distributedlog-user
I think that's a fair consideration. However I am thinking if we allow non-durable ledger, that means 1) application needs to handle the missing entries; 2) the re-replication should handle non-durable ledger by ignoring the non-existing entries if they are missing.

But Let's see how Jia is proposing.

- Sijie

Jia Zhai

unread,
Jun 9, 2016, 10:07:36 AM6/9/16
to Sijie Guo, Venkateswara Rao Jujjuri, d...@bookkeeper.apache.org, distributedlog-user
Thanks a lot for all of your suggestions,I would like to have a try, and will open a jira ticket, and make the proposal, discussion and testing there.

Jia Zhai

unread,
Aug 18, 2016, 11:56:58 AM8/18/16
to d...@bookkeeper.apache.org, Enrico Olivelli, Venkateswara Rao Jujjuri, distributedlog-user
Thanks a lot for taking care and providing this use case.

On Wed, Aug 10, 2016 at 3:53 AM, Sijie Guo <si...@apache.org> wrote:
On Wed, Aug 3, 2016 at 12:51 PM, Enrico Olivelli <eoli...@gmail.com>
wrote:

> Hi Jia,
> I have another similar use case for this feature.
> Let it be a ledger a db transaction log.
> The client issues a sequence of data manipulation instructions inside the
> scope of the transaction, if everything goes well a commit is finally added
> to the sequence. From the client perspective it is important to  wait for
> sync only for the last entry, that is the 'commit'.
> In my case all the entries will be added with sync=false and then the last
> with sync=true. But it is important that the addentry with sync  returns
> only if all the previous entries of the same sequence or of the same ledger
> have been written to stable storage.
>
Yup, I think that's a common usage pattern.



> In this case I see the real challenge is that entries span multiple
> bookies and it will be very hard to coordinate such a sync
>

Does making ensemble size equal to ack quorum size work here?


> At the moment for my projects is not very urgent but I think that it could
> be an useful feature
>
> Enrico

>> >>> > >> > To post to this group, send email to

>> >>> > >> > To view this discussion on the web visit
>> >>> > >> >
>> >>> > >>
>> >>> >
>> >>> https://groups.google.com/d/msgid/distributedlog-user/CALsc%
>> 2BXpJj3YT47bognhmEhHmahJkCgJUUY6Un4HVczfK_1MxPQ%40mail.gmail.com
>> >>> > >> > <
>> >>> > >>
>> >>> >
>> >>> https://groups.google.com/d/msgid/distributedlog-user/CALsc%
>> 2BXpJj3YT47bognhmEhHmahJkCgJUUY6Un4HVczfK_1MxPQ%40mail.
>> gmail.com?utm_medium=email&utm_source=footer
>> >>> > >> >
>> >>> > >> > .
>> >>> > >> > For more options, visit https://groups.google.com/d/optout.
>> >>> > >> >
>> >>> > >>
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > --
>> >>> > > Jvrao
>> >>> > > ---
>> >>> > > First they ignore you, then they laugh at you, then they fight
>> you,
>> >>> then
>> >>> > > you win. - Mahatma Gandhi
>> >>> > >
>> >>> > >
>> >>> > > --
>> >>> > > You received this message because you are subscribed to the Google
>> >>> Groups
>> >>> > > "distributedlog-user" group.
>> >>> > > To unsubscribe from this group and stop receiving emails from it,
>> >>> send an

>> >>> > > To post to this group, send email to

>> >>> > > To view this discussion on the web visit
>> >>> > >
>> >>> >
>> >>> https://groups.google.com/d/msgid/distributedlog-user/
>> CAKKTCLXLqqW6q3V%2Br%3Dt%3DdOhq-gue_fWNpAgaFrMXw%
>> 3DaCHUFomQ%40mail.gmail.com
>> >>> > > <
>> >>> >
>> >>> https://groups.google.com/d/msgid/distributedlog-user/
>> CAKKTCLXLqqW6q3V%2Br%3Dt%3DdOhq-gue_fWNpAgaFrMXw%
>> 3DaCHUFomQ%40mail.gmail.com?utm_medium=email&utm_source=footer

>> >>> > >
>> >>> > > .
>> >>> > >
>> >>> > > For more options, visit https://groups.google.com/d/optout.
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Jvrao
>> >> ---
>> >> First they ignore you, then they laugh at you, then they fight you,
>> then
>> >> you win. - Mahatma Gandhi
>> >>
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "distributedlog-user" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> an
>> >> email to distributedlog-user+unsub...@googlegroups.com.
>> >> To post to this group, send email to distributedlog-user@
>> googlegroups.com
>> >> .
>> >> To view this discussion on the web visit
>> >> https://groups.google.com/d/msgid/distributedlog-user/CAKKTCLXs42QqZY-
>> pw0YeL6uYqmDCEiFOxo5%3DRkXwcSg%3DEgrMJA%40mail.gmail.com
>> >> <https://groups.google.com/d/msgid/distributedlog-user/
>> CAKKTCLXs42QqZY-pw0YeL6uYqmDCEiFOxo5%3DRkXwcSg%3DEgrMJA%40mail.
>> gmail.com?utm_medium=email&utm_source=footer>

>> >> .
>> >>
>> >> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >
>> >
>>
> --
>
>
> -- Enrico Olivelli
>

Reply all
Reply to author
Forward
0 new messages