Curiosity killed the `stats cachedump`

1,565 views
Skip to first unread message

dormando

unread,
Jul 31, 2011, 3:03:03 PM7/31/11
to memc...@googlegroups.com
Yo,

We've threatened to kill the `stats cachedump` command for probably five
years. I've daydreamed about randomizing the command name on every minor
release, every git push, ensuring that it stays around as a last ditch
debugging tool.

A lot of you continue to build programs which rely on stats cachedump.
This both confuses and enrages us. Removing it outright sounds like a
failure, though. Your malevolent overlords have decided that this thing
you want and occasionally use should be taken away.

So instead I'd like to start a discussion which I'll seed with some
ideas; we want to shitcan this feature, but it should be a fair trade. If
we shitcan it, we first need to make you not want it anymore.

Here are some ideas I have for making you not want this feature anymore:

- Better documentation.

95% of the time when users want to use cachedump, they want to verify that
their application is working right. There're better ways to do this, but
it's clearly too hard to figure out.

- Better toolage.

That 95% of users overlaps with users who want to know better about what's
going on inside memcached. Our usual response is "restart in screen with
-vvv or point to a logfile or blah blah blah". This is unacceptable.
mk-query-digest helps, and I will hopefully be releasing a tool to do the
same for the binary protocol. This should allow you to watch or summarize
the flow of data, which is much more useful anyway.

- Streaming commands.

Instead of (or as well as) running tcpdump tools, we could add commands
(or simply use TAP? I'm not sure if it overlaps fully for this) which lets
you either telnet in and start streaming some subset of information, or
run tools which act like varnishlog. Tools that can show the command,
the return value, and also the hidden headers.

An off the cuff example:

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
watch every=5000,request=full,response=headers

The above would stream back one out of every 5000 requests, with the full
request, and the headers of the response, but not the full binary data.
I'm not promising to implement this as-is, but I could see it helping to
solve the issue.

Astute readers will notice that this is my biased push on the TOPKEYS
feature; 1.6 already has a way to discover the most accessed keys, but I
feel strongly that its approach is too limited.

- Commands to poll the Head or Tail of the LRU

Probably the most controversial. It is much more efficient to pretend that
the head or the tail are nebulous, nefarious, malicious things. As
instances grow into the tens of millions of items, polling at the head or
the tail doesn't give you a consistent view of very much. I imagine this
would be immediately abused by people implementing queues (or perhaps
that's a good thing?)

It also weighs heavy in my mind as we reserve the right to make the LRU
more loose or more strict as we evolve. It may not exist at all at some
point.

- Commands to stream the keys of evictions, or also reclaims or expired
items

People want cachedump so they can see what's still in there. This would be
an extension (or instead of) the previous streaming commands. You would
register for events with a set of flags, and when items expire or are
evicted or whatever you decided to watch, it would copy a result to the
stream.

It is much, much more efficient to read out of the statistical counters to
get the information. But as people want to see what's in there, often
they're really wondering about what's no longer in there.

---

I'm not really sold on any of these. These are not all the ideas we should
even consider, if you have better ones. Please help distribute this ML
post around as much as possible so we can have a better chance of having
an intelligent discussion about it.

Thanks,
-Dormando

Dustin

unread,
Aug 1, 2011, 1:06:26 AM8/1/11
to memcached

On Jul 31, 12:03 pm, dormando <dorma...@rydia.net> wrote:
> Yo,
>
> We've threatened to kill the `stats cachedump` command for probably five
> years. I've daydreamed about randomizing the command name on every minor
> release, every git push, ensuring that it stays around as a last ditch
> debugging tool.

I remain a huge supporter of killing this. Thanks for starting this
discussion.

> - Streaming commands.
>
> Instead of (or as well as) running tcpdump tools, we could add commands
> (or simply use TAP? I'm not sure if it overlaps fully for this) which lets
> you either telnet in and start streaming some subset of information, or
> run tools which act like varnishlog. Tools that can show the command,
> the return value, and also the hidden headers.

I owe all of you better tap documentation (the last couple of weeks
have really killed me). It does some pretty great stuff in this area
and has many practical uses.

> - Commands to stream the keys of evictions, or also reclaims or expired
> items
>
> People want cachedump so they can see what's still in there. This would be
> an extension (or instead of) the previous streaming commands. You would
> register for events with a set of flags, and when items expire or are
> evicted or whatever you decided to watch, it would copy a result to the
> stream.

This fits into the current tap protocol as well. It's basically a
protocol for anything where you want the server to come tell you stuff
instead of you going to ask it.

dormando

unread,
Aug 1, 2011, 1:12:32 AM8/1/11
to memcached
> I owe all of you better tap documentation (the last couple of weeks
> have really killed me). It does some pretty great stuff in this area
> and has many practical uses.

Now would be a great time to sell us on it, then :)

Peter Portante

unread,
Aug 7, 2011, 8:49:23 PM8/7/11
to memc...@googlegroups.com
How 'bout random sample request profiling?

The Alpha processor used to do this (still does if you are using EV6 or later), called ProfileMe:

Alpha 21264A processors (and later) use a different method called "instruction sampling." PC sampling on out-of-order
execution engines like the Alpha 21264 smears and skews sample data and profile information cannot be precisely attributed
to specific instructions. Instruction sampling solves this problem by periodically selecting a specific instruction and
collecting data about it as it flows through the processor pipeline. The program counter is known precisely as well as the
execution history of the instruction. The problems of smear and skew are eliminated. Like PC sampling, the sampling period
is randomized to get a statistically meaningful estimate of program behavior.
[From http://h21007.www2.hp.com/portal/download/files/unprot/tru64/metrics.pdf, Section 1.1, third paragraph]

One could randomly sample requests in a similar manner, each one "profiled" to document all the choices made leading to the result of the request. You can then allow a listener to capture that sample, and then that listener can collect a bunch, and see shift through the data to find out what is happening. It means adding branches on the fast path, or compiling two sets of routines, one collecting one not, to avoid any hits on the fast path. Once that is done, then clients can be built up to analyze the performance offline.

Or not.

-peter

Andrew O'Brien

unread,
Aug 7, 2011, 9:01:40 PM8/7/11
to memc...@googlegroups.com
> From: memc...@googlegroups.com [mailto:memc...@googlegroups.com] On
> Behalf Of Peter Portante
> Sent: Monday, 8 August 2011 10:49 AM

>
> How 'bout random sample request profiling?

Profiling for monitoring and activity estimation purposes - isn't that the point of the sFlow set of patches mentioned a few times on list?

Cheers,

Andrew

dormando

unread,
Aug 7, 2011, 9:11:08 PM8/7/11
to memc...@googlegroups.com

The sFlow patches bother me as I'd prefer to be able to generate sFlow
events from a proper internal system, as opposed to the inverse. You
shouldn't have to be an sFlow consumer, and it's much more difficult to
vary the type of data you'd be ingesting (full headers, vs partial, vs
item bodies, etc).

The internal statistical sampling would be the start, then come methods of
shipping it. You could send to listeners connected over a socket, or have
a plugin listen as an internal consumer to the samplings. The internal
consumer could provide "builtin" statistical summaries the same as an
external daemon could. Which could make everyone happy in this case.

I like the sFlow stuff, I'm just at a loss for why it's so important that
everything be generated on top of sFlow. So far nobody's addressed my
specific arguments as listed above.

-Dormando

Dustin

unread,
Aug 7, 2011, 11:33:51 PM8/7/11
to memcached

On Aug 7, 6:01 pm, Andrew O'Brien <andr...@oriel.com.au> wrote:

> Profiling for monitoring and activity estimation purposes - isn't that the point of the sFlow set of patches mentioned a few times on list?

My opinion on the sFlow patches has been that it shouldn't be a
change to memcached.

What can we do to memcached to make it easy for someone to deploy
this at runtime on an existing system? (well, assuming the system
doesn't have dtrace, anyway)

The last conversation I had involved writing a shim engine to inject
the sFlow logic. I don't know that I've heard any reason that
wouldn't work. If there's something better we could do, let's do that
thing.

Neil Mckee

unread,
Aug 8, 2011, 2:34:52 PM8/8/11
to memc...@googlegroups.com
I think it's clearer if we separate the requirements:

(A) continuous monitoring of the whole cluster, in production.
(B) troubleshooting a specific node, key, operation or client -- without impacting (A)

For (A) you want the most robust, scale-out measurement you can find that will not impact performance but still provide as much insight as possible. Packing random samples into XDR-encoded UDP datagrams (i.e. sFlow) is a good fit for this. Implemented carefully the overhead is roughly equivalent to adding one counter to the stats block. It's worth agreeing on a standard format because the cluster-wide analysis is necessarily "external" to memcached. There is no single node that can tell you the cluster-wide top-keys. Selecting sFlow as the format makes sense because the cluster also comprises networking, servers, hypervisors, web-servers etc. that can all export sFlow too (each providing their own perspective). This way the continuous monitoring can all be done by listening for standard UDP datagrams on a single UDP socket on a separate server (or servers).

For (B), I think the "tap" toolkit looks promising. It complements sFlow well. A stream-oriented protocol with options and filters to allow you to nose into all the dark corners. Perfect.

In other words. These two solutions are different in almost every way, and I think you want both.

There's nothing stopping you from consuming random samples "internally" (locally on one node) as well. I just can't think of an every-day reason to do that. Any analysis where you want to filter/aggregate/sort/chop/threshold on the populations of keys can be done at the sFlow collector server -- either for a particular node or for the whole cluster. You can change the query without making any change at all to the nodes in the cluster. Doing the same thing as a "builtin" just seems redundant, risks hurting performance (yes TOPKEYS, I mean you!), and results in a config change on the node every time you change the query. I suspect that many memcached users would break out in a cold sweat just changing the sFlow sampling rate from 1-in-1000 to 1-in-1001, so anything more invasive than that is likely to be a tough sell.

If your analysis occasionally calls for a local measurement that cannot be driven from the sFlow samples (e.g. because it needs to see the content of the value object, or has to see every transaction that matches a filter) then you can use "tap". I think you would still want as much as possible to be done "externally" in another process, though. The simpler and more robust the "tap" protocol is, the more likely it will be trusted and used.

Neil

P.S. Minor point: the current sFlow patch includes generic code that allows for multiple "agents", "samplers", "pollers" and "receivers" all with different sampling rates and polling intervals. In practice we only have one of each here, so the patch could be stripped down to a few hundred lines if we cut out all the layers of indirection and just assembled the UDP datagram inline. I'd be happy to do that if you think it would help.

Neil Mckee

unread,
Aug 8, 2011, 2:47:46 PM8/8/11
to memc...@googlegroups.com
I looked pretty hard at the shim idea back in May, but the engine protocol is really a different protocol. There was not a 1:1 correspondence with the standard memcached operations. If we define a standard sFlow-MEMCACHE measurement then it should be something that any memcache daemon can export, so it's really tied to the external spec, and shouldn't reflect anything about the internal structure of the implementation.

Instead I went back and tried to minimize the number of source code lines that would have to be added to memcached.c and thread.c. It's just a handful now. Pretty minimal.

Neil

Dustin

unread,
Aug 8, 2011, 8:52:28 PM8/8/11
to memcached

On Aug 8, 11:47 am, Neil Mckee <neil.mckee...@gmail.com> wrote:
> I looked pretty hard at the shim idea back in May,  but the engine protocol is really a different protocol.  There was not a 1:1 correspondence with the standard memcached operations.

Well, all the memcached operations are built on top of it... do you
mean specifically multiget might call into the engine multiple times
for a single "request"?

> If we define a standard sFlow-MEMCACHE measurement then it should be something that any memcache daemon can export,  so it's really tied to the external spec,  and shouldn't reflect anything about the internal structure of the implementation.
>
> Instead I went back and tried to minimize the number of source code lines that would have to be added to memcached.c and thread.c.   It's just a handful now.  Pretty minimal.

I can understand what you're getting at here, but if source
modification is required to integrate with your product, that sounds
like a failing from the memcached perspective. It seems like we can
reach a compromise such that your products can do all the cool stuff
they do with rich support from memcached, but without memcached having
to specifically support your products (unless I'm misunderstanding).

neilmckee

unread,
Aug 9, 2011, 1:00:49 AM8/9/11
to memcached


On Aug 8, 5:52 pm, Dustin <dsalli...@gmail.com> wrote:
> On Aug 8, 11:47 am, Neil Mckee <neil.mckee...@gmail.com> wrote:
>
> > I looked pretty hard at the shim idea back in May,  but the engine protocol is really a different protocol.  There was not a 1:1 correspondence with the standard memcached operations.
>
>   Well, all the memcached operations are built on top of it... do you
> mean specifically multiget might call into the engine multiple times
> for a single "request"?

Yes. That's one example. I think there were others where the
memcache operation resulted in more than one engine transaction.

>
> > If we define a standard sFlow-MEMCACHE measurement then it should be something that any memcache daemon can export,  so it's really tied to the external spec,  and shouldn't reflect anything about the internal structure of the implementation.
>
> > Instead I went back and tried to minimize the number of source code lines that would have to be added to memcached.c and thread.c.   It's just a handful now.  Pretty minimal.
>
>   I can understand what you're getting at here, but if source
> modification is required to integrate with your product, that sounds
> like a failing from the memcached perspective.  It seems like we can
> reach a compromise such that your products can do all the cool stuff
> they do with rich support from memcached, but without memcached having
> to specifically support your products (unless I'm misunderstanding).

Although there are already 30+ companies and open-source projects with
sFlow collectors I fully expect most memcached users will write their
own collection-and-analysis tools once they can get this data! Don't
you agree? So it's not about any one collector, it's about
defining a useful, scalable measurement that everyone can feel
comfortable using, even in production, even on the largest clusters.

On a positive note, it does seem like there is some consensus on the
value of random-transaction-sampling here. But do we have agreement
that this feed should be made available for external consumption (i.e.
the whole cluster sends to one place that is not itself a memcached
node), and that UDP should be used as the transport? I'd like to
understand if we are on the same page when it comes to these broader
architectural questions.

Neil

Dustin

unread,
Aug 9, 2011, 4:50:08 PM8/9/11
to memcached

On Aug 8, 10:00 pm, neilmckee <neil.mckee...@gmail.com> wrote:

> >   Well, all the memcached operations are built on top of it... do you
> > mean specifically multiget might call into the engine multiple times
> > for a single "request"?
>
> Yes.  That's one example.  I think there were others where the
> memcache operation resulted in more than one engine transaction.

Allocation is a separate engine request from linking. You can just
do whatever is sensible there, though. The binary protocol doesn't
necessarily have packet responses for many engine requests, but packet
requests map pretty well to engine requests. The text protocol has a
special-case "multiget" which behaves differently.

> Although there are already 30+ companies and open-source projects with
> sFlow collectors I fully expect most memcached users will write their
> own collection-and-analysis tools once they can get this data!   Don't
> you agree?   So it's not about any one collector,   it's about
> defining a useful, scalable measurement that everyone can feel
> comfortable using,  even in production,  even on the largest clusters.

I don't think I've ever said anything that sounds like a
disagreement with you. I just disagree that it's impossible to build
memcached such that sFlow collection is an externally produced
plugin. I could be wrong, but I don't understand why we can't do it
with the engine interface or why we can't design another interface
that would be useful.

> On a positive note,  it does seem like there is some consensus on the
> value of random-transaction-sampling here.   But do we have agreement
> that this feed should be made available for external consumption (i.e.
> the whole cluster sends to one place that is not itself a memcached
> node),  and  that UDP should be used as the transport?   I'd like to
> understand if we are on the same page when it comes to these broader
> architectural questions.

I think I do agree with that. The question is whether we do that by
making an sFlow interface or a sample interface?

(And why can't everyone just use dtrace?)

Neil Mckee

unread,
Aug 9, 2011, 7:25:24 PM8/9/11
to memc...@googlegroups.com

On Aug 9, 2011, at 1:50 PM, Dustin wrote:

>
> On Aug 8, 10:00 pm, neilmckee <neil.mckee...@gmail.com> wrote:
>
>>> Well, all the memcached operations are built on top of it... do you
>>> mean specifically multiget might call into the engine multiple times
>>> for a single "request"?
>>
>> Yes. That's one example. I think there were others where the
>> memcache operation resulted in more than one engine transaction.
>
> Allocation is a separate engine request from linking. You can just
> do whatever is sensible there, though. The binary protocol doesn't
> necessarily have packet responses for many engine requests, but packet
> requests map pretty well to engine requests. The text protocol has a
> special-case "multiget" which behaves differently.
>
>> Although there are already 30+ companies and open-source projects with
>> sFlow collectors I fully expect most memcached users will write their
>> own collection-and-analysis tools once they can get this data! Don't
>> you agree? So it's not about any one collector, it's about
>> defining a useful, scalable measurement that everyone can feel
>> comfortable using, even in production, even on the largest clusters.
>
> I don't think I've ever said anything that sounds like a
> disagreement with you. I just disagree that it's impossible to build
> memcached such that sFlow collection is an externally produced
> plugin. I could be wrong, but I don't understand why we can't do it
> with the engine interface or why we can't design another interface
> that would be useful.

Well, I think we really just need a hook that announces the completion of a memcache-protocol operation, with args being:

(1) the connection object (so we can read out transport, socket details, protocol...)
(2) the operation (GET, SET, INCR etc.)
(3) the key and key-length
(4) the number of keys (usually 1, but >1 if this was part of a multi-get)
(5) the value bytes
(6) the status (STORED, NOT_FOUND, etc.)
(7) perhaps something about the expiration deadline of the key(?)

If timing data is ever interesting (which it can be in the new architecture) then we would want a start-of-memcache-operation hook too.

It might be helpful if a plugin could attach variables to the connection object, and also to the thread object. I don't know if that is strictly necessary. I'm just looking at how it works today and how it avoids using locking/atomic_ops by adding fields to each of those two structures.

There is also a special lockless fn to roll together the sample_pool counter in thread.c. I guess that means a threads_do(cb) iteration hook might be necessary too.

We would still need access to the counters. sFlow pushes them out every n seconds (typically n=20, but it's configurable). That also means it would be good to register for a 1-second tick callback, to avoid having to run a separate thread just for that.

So if the extra function calls don't hurt performance too much it might be a good way to do in the future. However I like your next suggestion better....

>
>> On a positive note, it does seem like there is some consensus on the
>> value of random-transaction-sampling here. But do we have agreement
>> that this feed should be made available for external consumption (i.e.
>> the whole cluster sends to one place that is not itself a memcached
>> node), and that UDP should be used as the transport? I'd like to
>> understand if we are on the same page when it comes to these broader
>> architectural questions.
>
> I think I do agree with that. The question is whether we do that by
> making an sFlow interface or a sample interface?

Do you mean a hook that can be used by a plugin to receive randomly sampled transactions? That would allow you to inline the random-sampling and eliminate most of the overhead. An sFlow plugin would then just have to register for the feed; possibly sub-sample if the internal 1-in-N rate was more aggressive than the requested sFlow sampling-rate; marshall the samples into UDP datagrams, and send them to the configured destinations. I like this solution because it means the performance-critical part would be baked in by the experts and fully tested with every new release.

But if you've already done the hard work, and everyone is going to want the UDP feed, then why not offer that too? I probably made it look hard with my bad coding, but all you have to do is XDR-encode it and call sendto().

>
> (And why can't everyone just use dtrace?)

I looked at this too, but the dtrace macros are not called with all the fields we need, and I assumed they were immutable. (Also the SFLOW_SAMPLE() macro injects rather fewer lines of code when enabled).

Finally, I accept that the engine-pu branch is the focus of future development, but... any thoughts on what to do for the 1.4.* versions?

Neil

dormando

unread,
Aug 18, 2011, 8:19:52 PM8/18/11
to memc...@googlegroups.com
> >
> >> On a positive note, it does seem like there is some consensus on the
> >> value of random-transaction-sampling here. But do we have agreement
> >> that this feed should be made available for external consumption (i.e.
> >> the whole cluster sends to one place that is not itself a memcached
> >> node), and that UDP should be used as the transport? I'd like to
> >> understand if we are on the same page when it comes to these broader
> >> architectural questions.
> >
> > I think I do agree with that. The question is whether we do that by
> > making an sFlow interface or a sample interface?
>
> Do you mean a hook that can be used by a plugin to receive randomly
> sampled transactions? That would allow you to inline the
> random-sampling and eliminate most of the overhead. An sFlow plugin
> would then just have to register for the feed; possibly sub-sample if
> the internal 1-in-N rate was more aggressive than the requested sFlow
> sampling-rate; marshall the samples into UDP datagrams, and send them
> to the configured destinations. I like this solution because it means
> the performance-critical part would be baked in by the experts and fully
> tested with every new release.
>
> But if you've already done the hard work, and everyone is going to want
> the UDP feed, then why not offer that too? I probably made it look hard
> with my bad coding, but all you have to do is XDR-encode it and call
> sendto().

We can ship plugins with the core codebase, so sflow would still work "out
of the box", it just wouldn't be what the system was based off of.

On that note, how critical is it for sflow packets to contain timing data?
Benchmarking will show for sure, but history tells me that this should be
optional.

What would be pretty awesome is sflow-ish from libmemcached, since the
only place it *really* matters how long something took is from the
perspective of a client. Profiling the server is only going to tell me if
the box is swapping, as it's extremely uncommon to nail the locks.

> Finally, I accept that the engine-pu branch is the focus of future
> development, but... any thoughts on what to do for the 1.4.* versions?

I'm kicking out one release of 1.4.* monthly until 1.6 supersedes it. That
said I have a backlog of bugs and higher priority changes that will likely
keep me busy for a few months. Unless of course someone sponsors me to
spend more time on it :)

-Dormando

dormando

unread,
Aug 18, 2011, 8:25:39 PM8/18/11
to memcached
>
> Although there are already 30+ companies and open-source projects with
> sFlow collectors I fully expect most memcached users will write their
> own collection-and-analysis tools once they can get this data! Don't
> you agree? So it's not about any one collector, it's about
> defining a useful, scalable measurement that everyone can feel
> comfortable using, even in production, even on the largest clusters.
>
> On a positive note, it does seem like there is some consensus on the
> value of random-transaction-sampling here. But do we have agreement
> that this feed should be made available for external consumption (i.e.
> the whole cluster sends to one place that is not itself a memcached
> node), and that UDP should be used as the transport? I'd like to
> understand if we are on the same page when it comes to these broader
> architectural questions.

Don't forget the original thread as well. I'm trying to solve two issues:

1) Sampling useful data out of a cluster.

2) Providing something useful for application developers

The second case is an OS X user who fires up memcached locally, writes
some rails code, then wonders what's going on under the hood. 1-in-1000
sampling there is counterproductive. Headers only is often useless.

stats cachedump is most often used for the latter, and everyone needs to
remember that users never get to 1) if they can't figure out 2). Maybe I
should flip those priorities around?

-Dormando

Neil Mckee

unread,
Aug 19, 2011, 12:46:32 AM8/19/11
to memc...@googlegroups.com

Not critical at all. The duration_uS field can be set to -1 in the XDR output to indicate that it is not implemented. I added this measurement when porting to the 1.6 branch, where it makes more sense. I left it in when I updated the 1.4 branch because, well, the overhead seemed negligible and the numbers still seemed like they might be revealing something (though I wasn't sure what exactly). The start-time field is currently used as the "we're going to sample this one" flag. However that could easily be changed to just set a bit instead. Two system calls per sample would be saved. The practice of marking a transaction to be sampled at the beginning and then actually taking the sample at the end when the status is known could also be replaced by the old scheme from last year where we do both steps at the same time. However it was actually easier to implement with the two-step approach because of the way that there are only two or three ways that a transaction can start and a whole myriad of ways that it can end. So the first step (the coin-tossing) only has to happen in those two or three places and it's easier to know that you have counted everything once. Breaking it up like this also gives you the choice of accumulating details incrementally (the key, the status-code etc.) in whatever is the easiest place.

>
> What would be pretty awesome is sflow-ish from libmemcached, since the
> only place it *really* matters how long something took is from the
> perspective of a client. Profiling the server is only going to tell me if
> the box is swapping, as it's extremely uncommon to nail the locks.

Yes, a client might well offer sFlow-MEMCACHE transaction samples (as well as enclosing sFlow-HTTP transaction samples, if applicable). However you would probably still want to instrument at the server end to ensure that you were getting the full picture. There might be a whole menagerie of different C, Python, Perl and Java clients in use.

>
>> Finally, I accept that the engine-pu branch is the focus of future
>> development, but... any thoughts on what to do for the 1.4.* versions?
>
> I'm kicking out one release of 1.4.* monthly until 1.6 supersedes it. That
> said I have a backlog of bugs and higher priority changes that will likely
> keep me busy for a few months. Unless of course someone sponsors me to
> spend more time on it :)

In the mean time I could strip down the current patch and reduce it's code footprint considerably - but would that help?

Neil


>
> -Dormando

dormando

unread,
Aug 19, 2011, 1:13:52 AM8/19/11
to memc...@googlegroups.com
> Not critical at all. The duration_uS field can be set to -1 in the XDR
> output to indicate that it is not implemented. I added this measurement
> when porting to the 1.6 branch, where it makes more sense. I left it in
> when I updated the 1.4 branch because, well, the overhead seemed
> negligible and the numbers still seemed like they might be revealing
> something (though I wasn't sure what exactly). The start-time field is
> currently used as the "we're going to sample this one" flag. However
> that could easily be changed to just set a bit instead. Two system
> calls per sample would be saved. The practice of marking a transaction
> to be sampled at the beginning and then actually taking the sample at
> the end when the status is known could also be replaced by the old
> scheme from last year where we do both steps at the same time. However
> it was actually easier to implement with the two-step approach because
> of the way that there are only two or three ways that a transaction can
> start and a whole myriad of ways that it can end. So the first step
> (the coin-tossing) only has to happen in those two or three places and
> it's easier to know that you have counted everything once. Breaking it
> up like this also gives you the choice of accumulating details
> incrementally (the key, the status-code etc.) in whatever is the easiest
> place.

Not totally sure I follow. The system calls would be nice to avoid, since
we can't guarantee the system will use a vsyscall for the clock...

> Yes, a client might well offer sFlow-MEMCACHE transaction samples (as
> well as enclosing sFlow-HTTP transaction samples, if applicable).
> However you would probably still want to instrument at the server end to
> ensure that you were getting the full picture. There might be a whole
> menagerie of different C, Python, Perl and Java clients in use.

Many/most are based off of libmemcached. You can catch quite a bit with
that one.

> > I'm kicking out one release of 1.4.* monthly until 1.6 supersedes it. That
> > said I have a backlog of bugs and higher priority changes that will likely
> > keep me busy for a few months. Unless of course someone sponsors me to
> > spend more time on it :)
>
> In the mean time I could strip down the current patch and reduce it's
> code footprint considerably - but would that help?

It could help to serve as a reference, but I'm not sure we can merge it.
Every time we merge something we make some sort of idiotic pinkie swear to
the planet to never touch that feature ever again forever. ASCII noreply
haunts me to this day.

Depending on how long 1.4 goes on though, it might end up having its own
internal sampler, and thus sflow could be slapped on top of it.

thanks,
-Dormando

Neil Mckee

unread,
Aug 19, 2011, 1:33:06 AM8/19/11
to memc...@googlegroups.com

I certainly agree that you want both of these features. However they are wildly different. (1) is for monitoring in production, and (2) is for testing and troubleshooting. The requirements are so divergent that there may not be any overlap at all in the implementation of each. In fact the more separate they are the better because there is a lot of pressure on (1) to be ultra-stable and never change, while you are likely to think of new ideas for (2) all the time.

So there's no need to hesitate if you can already do (1) today. Let's face it, you have been very successful and there are rather a lot of users who have already gotten past (2) :)

Neil


> -Dormando

dormando

unread,
Aug 19, 2011, 1:48:21 AM8/19/11
to memc...@googlegroups.com
> > 1) Sampling useful data out of a cluster.
> >
> > 2) Providing something useful for application developers
> >
> > The second case is an OS X user who fires up memcached locally, writes
> > some rails code, then wonders what's going on under the hood. 1-in-1000
> > sampling there is counterproductive. Headers only is often useless.
> >
> > stats cachedump is most often used for the latter, and everyone needs to
> > remember that users never get to 1) if they can't figure out 2). Maybe I
> > should flip those priorities around?
> >
>
> I certainly agree that you want both of these features. However they
> are wildly different. (1) is for monitoring in production, and (2) is
> for testing and troubleshooting. The requirements are so divergent that
> there may not be any overlap at all in the implementation of each. In
> fact the more separate they are the better because there is a lot of
> pressure on (1) to be ultra-stable and never change, while you are
> likely to think of new ideas for (2) all the time.
>
> So there's no need to hesitate if you can already do (1) today. Let's
> face it, you have been very successful and there are rather a lot of
> users who have already gotten past (2) :)

Okay, I'm kinda tired of that argument. Just beacuse you say something
isn't possible, doesn't mean we can't make it work anyway. If you believe
they're divergent, stop saying that they're divergent and prove it with
examples. However I'd rather spend my time writing features than
pretending to know if a theoretical patch will work or not.

We want to work towards a system that can encompass a replacement for
"stats cachedump". If we can design something which generates sflow as a
subset, that'll be totally amazing! We can even use your patches as
reference for creating a core shipped plugin.

If people want to use sflow today, they can apply your patches and use it.
As is such with open source.

-Dormando

Neil Mckee

unread,
Aug 19, 2011, 2:07:25 AM8/19/11
to memc...@googlegroups.com


I didn't say it wasn't possible.... but never mind all that. A core-shipped plugin would be great. Let me know if there's anything I can do to help.

Neil

dormando

unread,
Aug 19, 2011, 3:56:28 AM8/19/11
to memc...@googlegroups.com
> >> So there's no need to hesitate if you can already do (1) today. Let's
> >> face it, you have been very successful and there are rather a lot of
> >> users who have already gotten past (2) :)
> >
> > Okay, I'm kinda tired of that argument. Just beacuse you say something
> > isn't possible, doesn't mean we can't make it work anyway. If you believe
> > they're divergent, stop saying that they're divergent and prove it with
> > examples. However I'd rather spend my time writing features than
> > pretending to know if a theoretical patch will work or not.
> >
> > We want to work towards a system that can encompass a replacement for
> > "stats cachedump". If we can design something which generates sflow as a
> > subset, that'll be totally amazing! We can even use your patches as
> > reference for creating a core shipped plugin.
> >
> > If people want to use sflow today, they can apply your patches and use it.
> > As is such with open source.
> >
> > -Dormando
>
>
> I didn't say it wasn't possible.... but never mind all that. A
> core-shipped plugin would be great. Let me know if there's anything I
> can do to help.

That was more strongly worded than I intended, I apologize; I don't agree
that it's worth rushing. Not "rushing" is why we haven't already settled
on TOPKEYS the way it is. I don't really intend to throw something else in
there immediately.

Neil Mckee

unread,
Aug 19, 2011, 3:55:18 PM8/19/11
to memc...@googlegroups.com

No worries. I apologize for my impatience. You are right. There is no rush.

But you did ask for more specific examples, so for what it's worth, here are some reasons why I think features for (1) in-production cluster-wide sampling and (2) testing and troubleshooting should be kept as separate as possible:

A. They will rarely be used at the same time on the same node.
B. If they are used concurrently (e.g. troubleshooting a production node), then using (2) should have no effect on (1).
C. The cluster-wide configuration used for (1) is likely to be very different from the interactive configuration for (2).
D. Getting a feed of randomly-sampled transactions is probably the only think they will have in common. After that, (1) will simply send the sample over UDP, while (2) might apply regex-filtering, value-field analysis, various tests on the expiration times and slab allocation and finally stream results out on a TCP connection - probably using some ASCII format instead of XDR.
E. Even on the part that they do have in common (1) is likely to want only a handful of samples per second per node (e.g. 1-in-10000), while (2) is much more likely to want a more aggressive feed such as 1-in-10, or even 1-in-1. It seems likely that this difference will impact the implementation. For example, that time-duration measurement would be unthinkable at 1-in-1, but could be quite OK at 1-in-50000.
F. Even if (2) may be considered higher priority, I think it's easier to see how (1) can be completed and tied in a bow. I should stress here that I'm not expecting anyone to use my code! I just think you guys could knock (1) out pretty easily and reap immediate benefits, while (2) could take a while to crystallize.


Getting unnecessarily detailed, let's say you implemented the plugin sampling something like this:

possibly_sample_transaction(connection, protocol, operation, key, val, status) {
r = next_random(connection->thread);
for(i = 0; i < num_sampling_plugins; i++) {
consumer = sampling_plugins[i];
if(r <= consumer->probability_threshold) {
(*consumer->sample_callback)(connection, protocol, operation, key, value, status);
}
}
}

Compare that with the number of instructions and branches involved here:

possibly_sample_transaction(connection, protocol, operation, key, value, status) {
if(next_random(connection->thread) <= probability_threshold) {
take_sample(connection, protocol, operation, key, value, status);
}
}

Or, if you allow one sampling_probability to be treated specially and turned into a countdown-to-next-sample, then you can do it this way and save more:

possibly_sample_transaction(connection, protocol, operation, key, value, status) {
if(unlikely(--connection->thread->countdown== 0)) {
connection->thread->countdown = compute_next_countdown();
take_sample(connection, protocol, operation, key, value, status);
}
}

At this point you could easily turn it into a macro so that there is no extra function-call in the critical path, just a decrement-and-test on the thread->countdown.

I don't know if it matters so much to shave a few dozen cycles off the critical path, but my point was just to illustrate that even in the small area of overlap between (1) and (2) you might still be grateful someday if you kept them entirely separate.

Thoughts?

Neil

Reply all
Reply to author
Forward
0 new messages