Fwd: dynomite idea

1 view
Skip to first unread message

Todd Lipcon

unread,
Dec 12, 2008, 1:03:48 PM12/12/08
to dynomit...@groups.google.com
Sent the following to Cliff this morning, but he asked me to send to the group:

---------- Forwarded message ----------

Hey Cliff,

Just thought of an idea you might want to think about for reducing that 99th percentile latency:

I remember reading somewhere that the erlang distribution protocol doesn't interleave messages. So, if you're transferring a large value from one node to another, no control messages can come over that same pipe at the same time.

What if the actual value transfer were moved to an out-of-band protocol like TCP or possibly UDP for small packets? This would allow several values to be transferred concurrently if you allowed multiple TCP streams between nodes, and it would allow control packets to continue uninterrupted during transfer of large values.

I'm imaging something like a transfer_server registered which has a registry of TCP connections to each node and listens for incoming connections. Then when a large value needs to be transferred, you do something like transfer_server:cast({transfer_to, DestNode, ProcessName, Value}).

Thoughts?

-Todd

cliffmoon

unread,
Dec 26, 2008, 1:09:42 PM12/26/08
to Dynomite
Sorry, I've been out due to the holidays and whatnot. I really like
the idea of streaming values that go above a certain threshold. I'm
skeptical about setting up a sideband channel for this data, however.
I think it might be a good idea to spawn off a process on both sides
to send chunked packets of data simply using erlang messaging. That
would lessen the burden of dealing with and designing a sideband
protocol.

cliffmoon

unread,
Jan 1, 2009, 3:23:25 PM1/1/09
to Dynomite
I did a really quick implementation to see what the results might look
like. It's on branch here: http://github.com/cliffmoon/dynomite/tree/streaming

The chunk size is set to 5KB, not sure if that's the optimal sizing.
The initial impl might need to be refactored a bit too, I'm thinking
that all of the streaming logic should be moved into storage_server.
I checked it for correctness but not for performance, although it
seems to be pretty zippy. I'd be interested to see what it does to
Jason's 99% numbers.

Todd Lipcon

unread,
Jan 2, 2009, 12:09:37 PM1/2/09
to dynomit...@googlegroups.com
Hey Cliff,

Just took a look at the source. I'm a little nervous about not using some kind of ref or tag with the streaming messages - am I missing something or would it be possible for messages from an old stream to end up interleaved with a new stream? I guess given that pmap spawns new processes for each storage_server get, it's safe enough, but I think using erlang:make_ref/0 would make things a little clearer, and might come in handy if we swap out the streaming implementation for something based on gen_tcp later.

Also looking forward to the benchmark results after this change.

-Todd

cliffmoon

unread,
Jan 2, 2009, 12:43:49 PM1/2/09
to Dynomite
Right. In practice it wouldn't come up right now because at least one
end of the stream is a process spawned just for that stream. But
you're correct that it needs a ref to be a more generic stream
mechanism.

On Jan 2, 9:09 am, "Todd Lipcon" <tlip...@gmail.com> wrote:
> Hey Cliff,
>
> Just took a look at the source. I'm a little nervous about not using some
> kind of ref or tag with the streaming messages - am I missing something or
> would it be possible for messages from an old stream to end up interleaved
> with a new stream? I guess given that pmap spawns new processes for each
> storage_server get, it's safe enough, but I think using erlang:make_ref/0
> would make things a little clearer, and might come in handy if we swap out
> the streaming implementation for something based on gen_tcp later.
>
> Also looking forward to the benchmark results after this change.
>
> -Todd
>

JP

unread,
Jan 5, 2009, 10:19:01 AM1/5/09
to Dynomite
On Jan 1, 3:23 pm, cliffmoon <moonpolys...@gmail.com> wrote:
> I'd be interested to see what it does to
> Jason's 99% numbers.

I'm just back from vacation, so this is going to be a busy week, but
I'll try to pull this change into my branch and run the ec2
performance test, if I have a chance.

JP

cliffmoon

unread,
Jan 5, 2009, 10:43:25 AM1/5/09
to Dynomite
Great. Also, maybe a wiki page or a short explanation of how to run
the ec2 performance tests might be a good idea, so that in the future
we won't have to bother you to run tests for changes. Thanks.

JP

unread,
Jan 5, 2009, 6:08:18 PM1/5/09
to Dynomite
Working on a wiki page now at: http://github.com/jpellerin/dynomite/wikis/ec2-performance-tests
-- let me know if it makes sense.

Results w/streaming are marginally better than without, though the
worst time was a lot worse on one test run out of 3. Here are the
results from that run:

get avg: 14.1821250.3ms median: 7.9438690.3ms 99.9: 228.2829280.3ms
put avg: 19.9633930.3ms median: 11.6100310.3ms 99.9: 191.6468140.3ms
gets:
10% < 1.790ms
20% < 2.918ms
30% < 4.905ms
40% < 6.386ms
50% < 7.941ms
60% < 10.555ms
70% < 13.275ms
80% < 18.824ms
90% < 29.427ms
100% < 3685.608ms
puts:
10% < 3.440ms
20% < 4.979ms
30% < 6.438ms
40% < 8.214ms
50% < 11.607ms
60% < 15.936ms
70% < 22.062ms
80% < 31.926ms
90% < 47.970ms
100% < 3778.776ms

The 100% times are much worse than they were pre streaming, but only
in this test run. But th 90 and 99% times are better. I ran 2 other
runs after this with shorter run times and didn't see spikes nearly
that high. I need to retry with other storages (this was using
fs_storage) to see if I it happens again.

JP

Todd Lipcon

unread,
Jan 5, 2009, 6:14:59 PM1/5/09
to dynomit...@googlegroups.com
I wonder how much these tests are affected by our changes versus specific conditions on ec2. Does anyone have a bored test cluster they can run benches on that might give more consistent results?

http://blog.dbadojo.com/2008/03/sysbench-fileio-vs-ec2-part-1.html seems to indicate pretty bad 95th percentile times for a lot of the random io tests as well.

-Todd

cliffmoon

unread,
Jan 5, 2009, 6:24:03 PM1/5/09
to Dynomite
Also, what was the average value size? The chunk size is set to 5k
which may or may not be an optimal chunk size. Making it the size of
a page or slightly smaller might be a good idea.

On Jan 5, 3:14 pm, "Todd Lipcon" <tlip...@gmail.com> wrote:
> I wonder how much these tests are affected by our changes versus specific
> conditions on ec2. Does anyone have a bored test cluster they can run
> benches on that might give more consistent results?
>
> http://blog.dbadojo.com/2008/03/sysbench-fileio-vs-ec2-part-1.htmlseems to
> indicate pretty bad 95th percentile times for a lot of the random io tests
> as well.
>
> -Todd
>
Reply all
Reply to author
Forward
0 new messages