Facebook kisses DRAM goodbye, builds memcached for flash

Yiftach Shoolman

unread,

Mar 6, 2013, 3:53:02 PM3/6/13

to redi...@googlegroups.com

Though this is not the Memcached forum, it is still interesting to see whether this shift can also affect the future of Redis:

http://gigaom.com/2013/03/05/facebook-kisses-dram-goodbye-builds-memcached-for-flash/

--

Yiftach Shoolman
+972-54-7634621

Pedro Melo

unread,

Mar 6, 2013, 4:23:54 PM3/6/13

to redi...@googlegroups.com

Hi,

On Wed, Mar 6, 2013 at 8:53 PM, Yiftach Shoolman
<yiftach....@gmail.com> wrote:
> Though this is not the Memcached forum, it is still interesting to see
> whether this shift can also affect the future of Redis:
>
> http://gigaom.com/2013/03/05/facebook-kisses-dram-goodbye-builds-memcached-for-flash/

This is one example of something that triggers "oh, cool!" from my
geek inner self, but my old beard UNIX hat still thinks that setting a
swap file to the flash drive, and let memcached swap to it, and let
the VM keep the most recently used in memory would probably get you 80
or 90% of the performance of this solution with zero coding.

Then again, maybe I'm just grumpy and tired.. :)

Bye,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org

Javier Guerra Giraldez

unread,

Mar 6, 2013, 4:30:00 PM3/6/13

to redi...@googlegroups.com

On Wed, Mar 6, 2013 at 4:23 PM, Pedro Melo <me...@simplicidade.org> wrote:
> setting a
> swap file to the flash drive, and let memcached swap to it

I had similar thoughts, but not just "let memcached swatp to it";
instead do a memmap() to specifically map a file on SSD. Not only the
kernel doesn't have so much to guess, but there are high-performance
SSD PCI cards that avoid all the disk-emulation layers and lets the
CPU do real memory accesses to the Flash chips.

--
Javier

Salvatore Sanfilippo

unread,

Mar 6, 2013, 4:46:58 PM3/6/13

to Redis DB

The problem with the swap approach is that you turn it into a
random-access business when you write. At least that's what the
authors of the project wrote. In theory however most SSD disks will
internally use a log structured format anyway so this may also work,
it is not clear from my point of view.

About Redis, I did a number of tests like this, the point here is that
from plain strings GET/SET/DEL operations to Redis, there is a huge
difference in between. I bet that already the SSD thing can't reach
the 1million ops/sec you get with Redis using pipelining.
But even if you could, it's GET/SET.

Redis uses memory in order to have a lot more freedom in the way it
can operate with data, so we have complex data structures and can do a
lot of indexing almost for free, like Redis Cluster ability to return
keys in a given hash slot, or random sampling to expire stuff, or
complex data structures like sorted sets that still are from the user
point of view almost as fast as GET/SET.

It is surely possible to imagine a Redis-on-SSD reimplementation, and
this may make sense, but not with the current performance levels, and
not with constant-time writes. Now if we ask ourselves how people are
using Redis right now, that is, to save their asses when on-disk DBs
are too slow, what I see is this.

1) That it's a good idea for Redis to focus on memory for the near
future at least.
2) That if SSD disks will turn into this almost-memory device,
operating systems will soon be able to abstract it for us, with some
kind of advanced SSD-obvious swap file.
3) That Redis on disk is still an interesting project that somebody
should try to do because in Redis the value IMHO is not just on the
speed, but in the data model offered that makes it simpler to address
certain type of problems.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.
— David Gelernter

Salvatore Sanfilippo

unread,

Mar 6, 2013, 4:47:42 PM3/6/13

to Redis DB

p.s. tomorrow morning I'll put the swap file of my server in the SSD
drive and test Redis with datasets bigger then RAM and report back
here.

Gleicon Moraes

unread,

Mar 6, 2013, 7:16:20 PM3/6/13

to redi...@googlegroups.com

AFAIK it's way different than swap-on-ssd stuff. If it's FusionIO drives, they have an k/v engine on itself that can be leverage thru drivers. I have a good friend working there if you want to connect.

Salvatore Sanfilippo

unread,

Mar 6, 2013, 7:20:23 PM3/6/13

to Redis DB

Hello Gleicon,

I have to look at the project better honestly to understand the
tradeoffs, for instance I don't know if they use persistence.
If they are not interesting in persistence, I can see how you can do
interesting things for a middleware between memcached and on-disk DBs,
that is, think L2 cache levels on CPUs.

However if we want more info I understand that the developer is the
same great guy that wrote Twemproxy, he is active on Twitter and is a
very kind guy, so we can get plenty of details, but probably a direct
investigation to form a proper idea is needed.
Still I've the feeling that this can work well only under two assumptions:

1) Data is trow-away (no persistence)
2) Data model is get/set/del, no complex Redis-alike ops.

Salvatore

Yiftach Shoolman

unread,

Mar 7, 2013, 1:49:22 AM3/7/13

to redi...@googlegroups.com

My 2 cents:

Historically FB put almost everything they have in Memcached including users' profiles. From my few short discussions with them, they were probably using ~1PB of Memcached. Now these guys went IPO and got their investors pressure to be more efficient. This is way they have invested 20+ engineers' years in this project
As for Redis - Salvatore I totally agree with you point that the unique thing about Redis is that it processes complex commands fast (in addition to the simple GET/SET commands). That said, IMO if the community finds ways to make it efficient with the latest SSD technologies, it will only help Redis to become more popular

Yiftach Shoolman
+972-54-7634621

Andrea Campi

unread,

Mar 7, 2013, 1:59:28 AM3/7/13

to redi...@googlegroups.com, redi...@googlegroups.com

On Mar 7, 2013, at 7:49 AM, Yiftach Shoolman <yiftach....@gmail.com> wrote:

My 2 cents:
Historically FB put almost everything they have in Memcached including users' profiles. From my few short discussions with them, they were probably using ~1PB of Memcached. Now these guys went IPO and got their investors pressure to be more efficient. This is way they have invested 20+ engineers' years in this project

That's so silly I LOLes in my morning coffee, I'm lucky I didn't make a mess. Thanks :D

Pierre Chapuis

unread,

Mar 7, 2013, 4:15:36 AM3/7/13

to redi...@googlegroups.com

Le jeudi 7 mars 2013 01:20:23 UTC+1, Salvatore Sanfilippo a écrit :

However if we want more info I understand that the developer is the
same great guy that wrote Twemproxy, he is active on Twitter and is a
very kind guy, so we can get plenty of details, but probably a direct
investigation to form a proper idea is needed.

I think you're talking about Twitter's project fatcache [1] here.

It is not the same project as Facebook's McDipper [2], which is not

Open Source. That being said, from what I read, the general idea is

somewhat similar.

[1] https://github.com/twitter/fatcache

[2] https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920

Salvatore Sanfilippo

unread,

Mar 7, 2013, 7:11:33 AM3/7/13

to Redis DB

On Thu, Mar 7, 2013 at 10:15 AM, Pierre Chapuis
<catwell...@catwell.info> wrote:
> I think you're talking about Twitter's project fatcache [1] here.
> It is not the same project as Facebook's McDipper [2], which is not
> Open Source. That being said, from what I read, the general idea is
> somewhat similar.

Yep sorry I was talking at fatcache that seems similar, but indeed the
Facebook project appears to have received a lot of attention from the
company and is probably deployed at a larger scale. Thanks for the
clarification.

Salvatore Sanfilippo

unread,

Mar 7, 2013, 7:24:10 AM3/7/13

to Redis DB

On Thu, Mar 7, 2013 at 7:49 AM, Yiftach Shoolman
<yiftach....@gmail.com> wrote:
> My 2 cents:
>
> Historically FB put almost everything they have in Memcached including
> users' profiles. From my few short discussions with them, they were probably
> using ~1PB of Memcached. Now these guys went IPO and got their investors
> pressure to be more efficient. This is way they have invested 20+ engineers'
> years in this project

Makes sense indeed, however it is not clear from the paper the real
performance they are able to get from the new system.
They say that if you add netwrok latency times, the performance starts
to be in the same order of magnitude, but it is not clear with what
kind of data access pattern (completely random write access is what
I'm interested in), and what the actual numbers are.

From the datasheet of Intel 320 SSDs:

Random Write (100% Span) 400 IOPS

That's what I expect the FB system to perform with random writes.
Maybe they have a mostly read-only data access for this cached data?
Or simply the system is not designed to persist, so that writes get
help from a write cache in order to be clustered and get nearest to
this other value:

Random Write (8GB Span) 23000 IOPS

And incidentally this is exactly what I got in today's Redis tests
once the first GB of data set started to be swapped out.

> As for Redis - Salvatore I totally agree with you point that the unique
> thing about Redis is that it processes complex commands fast (in addition to
> the simple GET/SET commands). That said, IMO if the community finds ways to
> make it efficient with the latest SSD technologies, it will only help Redis
> to become more popular

From the point of view of Redis SSD or spinning disk is not too
different, in order to port Redis to these technologies you need to
rethink the system completely by reimplementing it to store/retrieve
data from disk: threaded, different data structures, and worst-case
write performances on pair to what the disk can deliver.

It's a stimulating project, but in my opinion, is a different project.
I have a narrow mission with Redis: to provide the best DB experience
using memory, and we are still far from perfection from this point of
view. I'm pretty sure that's good to have other projects focusing on
SSD with the right tradeoffs...

Cheers,
Salvatore

Felix Gallo

unread,

Mar 7, 2013, 8:27:17 AM3/7/13

to redi...@googlegroups.com

Here's a good picture which illustrates why RAM-based systems are

not likely to be kissed goodbye any time soon. Spoiler: there's a legend

at the very top of the image.

http://i.imgur.com/X1Hi1.gif

Facebook's approach to engineering is somewhat idiosyncratic -- sometimes

for good reasons, sometimes not. Anyone smaller than around a billion users

should be careful before cloning from their repos (or, e.g., declaring the death

of RAM); there are frequently pragmatic alternatives that stand on a foundation

of boring old fashioned unix engineering that will last you until then.

F.

Salvatore Sanfilippo

unread,

Mar 7, 2013, 9:50:58 AM3/7/13

to Redis DB

Results -> http://antirez.com/news/52

Javier Guerra Giraldez

unread,

Mar 7, 2013, 10:43:00 AM3/7/13

to redi...@googlegroups.com

On Thu, Mar 7, 2013 at 7:24 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> Random Write (8GB Span) 23000 IOPS
>
> And incidentally this is exactly what I got in today's Redis tests
> once the first GB of data set started to be swapped out.

while 23kIOPS is a dismal figure for Redis, it's still 100 times
better than a simplistic on-disc DB. Even PostgreSQL with several
magnetic drives struggles to reach 6-8kIOPS. I don't know what are
'typical' numbers for PostgreSQL on SSD.

So, i'd say that your results confirm that Redis-swapping-to-SSD fits
in the (huge) void between 'real Redis' and on-disk-DBs.

Now, I think that some DB with mmap()-optimized structures/allocators
and a Redis-like API would be a great option for bigger-than-ram
datasets.

--
Javier

Salvatore Sanfilippo

unread,

Mar 7, 2013, 10:51:31 AM3/7/13

to Redis DB

On Thu, Mar 7, 2013 at 4:43 PM, Javier Guerra Giraldez
<jav...@guerrag.com> wrote:
> So, i'd say that your results confirm that Redis-swapping-to-SSD fits
> in the (huge) void between 'real Redis' and on-disk-DBs.

I don't think so because performances was too erratic... that's a big problem.
OS virtual memory is not designed for that, maybe with advanced
syscalls that hint the system about what to swap and what not, we can
improve this.

> Now, I think that some DB with mmap()-optimized structures/allocators
> and a Redis-like API would be a great option for bigger-than-ram
> datasets.

IMHO not with the current locality. Note that we started to have
trashy performances when there were just 2 GB on disk and 23 GB on
memory. It's a tragedy if you have like 50% and 50%.

If Redis would use more cache-obvious structures, then maybe...

Greg Andrews

unread,

Mar 7, 2013, 11:01:59 AM3/7/13

to redi...@googlegroups.com

I like to illustrate the difference in speed for RAM vs. HD by scaling their access times up to human timescales. It gives a different kind of gut feel for the differences. Felix's picture lists a ram access figure of 83ns and HD access figure of 13.7ms.

Scale the ram access up to 0.83 seconds (multiply by 10,000,000) and it's around the time a programmer would take to reach out to pick up his/her mobile phone from the desk. That's the human equivalent of a cpu fetching a piece of data from ram to work with it.

Scale the HD access up by the same factor and you get 137,000 seconds, or just over 38 hours. Ordering a mobile phone on the Internet and receiving it via overnight delivery.

-Greg

Pierre Chapuis

unread,

Mar 7, 2013, 12:21:13 PM3/7/13

to redi...@googlegroups.com

Le jeudi 7 mars 2013 17:01:59 UTC+1, GregA a écrit :

I like to illustrate the difference in speed for RAM vs. HD by scaling their access times up to human timescales. It gives a different kind of gut feel for the differences. Felix's picture lists a ram access figure of 83ns and HD access figure of 13.7ms.

These figures are true for rotating HDs. SSDs have higher latency than RAM but much lower than rotating HDs. FusionIO advertises 30 to 45µs read latency, which is only about 50 times as much as RAM.

Felix Gallo

unread,

Mar 7, 2013, 12:53:39 PM3/7/13

to redi...@googlegroups.com

FusionIO is not exactly an SSD and that 30µs latency figure is not random write at max throughput; they have a page-sized RAM buffer to hide the flash programming time, and presumably if the writes spill over a page then they lock you to the SSD flash programming interval before returning. And, 30µs is 30,000ns, so that would be 361 times slower than RAM, not 50.

Modern SSD drives (e.g. the Intel X25M) have, e.g., a 225µs average write latency on a heavy random write workload (http://www.anandtech.com/show/2944/3). And that drive also had a 282ms spike in testing, presumably as it runs whatever GC it runs. Yet ignoring the spike, that's 2710 times slower than RAM.

F.

--

Pierre Chapuis

unread,

Mar 8, 2013, 4:01:14 AM3/8/13

to redi...@googlegroups.com

Le jeudi 7 mars 2013 18:53:39 UTC+1, Felix a écrit :

that would be 361 times slower than RAM, not 50.

Oops, you're right, I meant 500.

Reply all

Reply to author

Forward