Async delete discussion

33 views
Skip to first unread message

Rob Shepherd

unread,
Jan 25, 2013, 12:07:49 PM1/25/13
to jour...@googlegroups.com
Dear Journal.io group.

I have done some testing in a small application which is a primitive synthesis of what our main use case will be.

1. a number of web threads (producers) will buffer date into the journal (sync write), then handoff the "Location" to a worker. (consumer)
2. The worker is a runnable which is serviced by a fixed size threadpool to deal with the journal entry (sync read)
3. the worker then deletes it.

if the application crashes, the application can redo() the journal on next boot to purge the journal entries.

I have noticed that the consumers of these journal locations always lag the producers; Even if the number of consumers to producers is at a 20:1 ratio.

I have reduced the lag to an almost negligible amount by introducing the following method into Journal.java

public void deleteAsync(Location location) throws IOException, IllegalStateException {
    accessor.updateLocation(location, Location.DELETED_RECORD_TYPE, false);
}

This does not issue a "sync" whilst deleting.


Please may I ask - what are the dangers of using such a function?  
The most obvious would be that un-synced deletes remain in the journal upon application crash.
(my application can actually tolerate that circumstance)

Are there any other side effects? what else could go wrong?


Are these deletes then batched like async writes?



Many thanks

Rob

Sergio Bossa

unread,
Jan 25, 2013, 6:20:36 PM1/25/13
to jour...@googlegroups.com
Hi Rob,

the delete method does a sync everytime to ensure you find all the
latest writes: if you make it async, you may not find them and get an
exception. In your case by the way, it just works because you have all
inserts synced too.

There are several options to improve it:
1) Make an async delete version which enforces/checks there are no
inflight writes.
2) Make an async delete version which does actual deletes in batches.
3) Mark the locations as deleted when in memory waiting to be batched.

I'm more in favor of either #1 or #3: what does people think?

By the way, it would be great if you could open an issue about
improving delete performance, so we can track it.

Cheers,

Sergio B.
--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Julien Eluard

unread,
Jan 28, 2013, 9:19:10 AM1/28/13
to jour...@googlegroups.com
Hi Sergio,

I am not too familiar with the internals of Journal.IO so not really sure what are the implications of those 3 options:

Find bellow some remarks/comments:

>1) Make an async delete version which enforces/checks there are no inflight writes
This sounds like an improvement in all cases. Could this method also forces sync when there are inflight writes? If so this would superset actual delete?

>2) Make an async delete version which does actual deletes in batches
Does that remove the sync need? Would the bacth improve perf over solution #1? Otherwise does not seem worth it.

>3) Mark the locations as deleted when in memory waiting to be batched. 
I don't quite understand this one.

Some related remark: in my scenario I could delete Location per batch. Would a delete(Location[] locations) make sense? Or is this essentially the same as #2?

Julien

Sergio Bossa

unread,
Feb 1, 2013, 2:20:40 PM2/1/13
to jour...@googlegroups.com
Hi Julien,

sorry for this late response.

>>1) Make an async delete version which enforces/checks there are no inflight
>> writes
> This sounds like an improvement in all cases. Could this method also forces
> sync when there are inflight writes? If so this would superset actual
> delete?

Good point.

>>2) Make an async delete version which does actual deletes in batches
> Does that remove the sync need? Would the bacth improve perf over solution
> #1? Otherwise does not seem worth it.

It just means delete operations would be batched like happens with
writes, so you don't have to pay for the sync on each delete, and cost
would be amortized.

>>3) Mark the locations as deleted when in memory waiting to be batched.
> I don't quite understand this one.

The sync is needed as we have to flush unwritten changes because the
location we're trying to delete may be among there: so this just means
we would need to inspect all unwritten changes (in memory) and discard
from there, in other words, avoid to write them altogether.

> Some related remark: in my scenario I could delete Location per batch. Would
> a delete(Location[] locations) make sense? Or is this essentially the same
> as #2?

Do you mean deleting an array of locations all together in the same operation?
It would be different than #2 and a good idea to investigate too.

Julien Eluard

unread,
Feb 2, 2013, 7:49:06 AM2/2/13
to jour...@googlegroups.com
Hi Sergio,

ok it makes more sense now. Then it looks like option #3 is a nice improvement but would only apply for batch not yet written. Probably not that frequent? And would not help for the sync write case (my current scenario :)).

>Do you mean deleting an array of locations all together in the same operation?
Yes that's what I meant. But then I am wondering if people would not expect atomicity that could be hard to obtain? BTW how do batched writes work in that regard?
It does not sound a good idea anyway and would break symmetry with write.

In my current scenario I want to keep the number of persisted Location fixed. So at some point every time I append something I will first delete the first Location. It does not really matter if the size is not strictly enforced all the time and failures can be repaired easily at restart time so async batches sound a good option.

Why are you not sold on #2? Looks like in my scenario that would be the way to go.

Julien

Sergio Bossa

unread,
Feb 2, 2013, 7:56:53 AM2/2/13
to jour...@googlegroups.com
On Sat, Feb 2, 2013 at 12:49 PM, Julien Eluard <julien...@gmail.com> wrote:

> In my current scenario I want to keep the number of persisted Location
> fixed. So at some point every time I append something I will first delete
> the first Location.

In this case, something similar to option #1 would help a lot
actually: as older batches would certainly be on disk, so we wouldn't
need to sync the in-memory ones.

> Why are you not sold on #2? Looks like in my scenario that would be the way
> to go.

Option #2 is unfortunately the most invasive: I included it there for
the sake of completeness and just in case others did not work, but I'd
rather avoid it :)

Julien Eluard

unread,
Feb 2, 2013, 12:20:07 PM2/2/13
to jour...@googlegroups.com
>Option #2 is unfortunately the most invasive
That's what I suspected. Better keep the implementation simple then.

>In this case, something similar to option #1 would help a lot 
Great! So option #1 seams the most interesting to me!

Julien
Reply all
Reply to author
Forward
0 new messages