We've threatened to kill the `stats cachedump` command for probably five
years. I've daydreamed about randomizing the command name on every minor
release, every git push, ensuring that it stays around as a last ditch
debugging tool.
A lot of you continue to build programs which rely on stats cachedump.
This both confuses and enrages us. Removing it outright sounds like a
failure, though. Your malevolent overlords have decided that this thing
you want and occasionally use should be taken away.
So instead I'd like to start a discussion which I'll seed with some
ideas; we want to shitcan this feature, but it should be a fair trade. If
we shitcan it, we first need to make you not want it anymore.
Here are some ideas I have for making you not want this feature anymore:
- Better documentation.
95% of the time when users want to use cachedump, they want to verify that
their application is working right. There're better ways to do this, but
it's clearly too hard to figure out.
- Better toolage.
That 95% of users overlaps with users who want to know better about what's
going on inside memcached. Our usual response is "restart in screen with
-vvv or point to a logfile or blah blah blah". This is unacceptable.
mk-query-digest helps, and I will hopefully be releasing a tool to do the
same for the binary protocol. This should allow you to watch or summarize
the flow of data, which is much more useful anyway.
- Streaming commands.
Instead of (or as well as) running tcpdump tools, we could add commands
(or simply use TAP? I'm not sure if it overlaps fully for this) which lets
you either telnet in and start streaming some subset of information, or
run tools which act like varnishlog. Tools that can show the command,
the return value, and also the hidden headers.
An off the cuff example:
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
watch every=5000,request=full,response=headers
The above would stream back one out of every 5000 requests, with the full
request, and the headers of the response, but not the full binary data.
I'm not promising to implement this as-is, but I could see it helping to
solve the issue.
Astute readers will notice that this is my biased push on the TOPKEYS
feature; 1.6 already has a way to discover the most accessed keys, but I
feel strongly that its approach is too limited.
- Commands to poll the Head or Tail of the LRU
Probably the most controversial. It is much more efficient to pretend that
the head or the tail are nebulous, nefarious, malicious things. As
instances grow into the tens of millions of items, polling at the head or
the tail doesn't give you a consistent view of very much. I imagine this
would be immediately abused by people implementing queues (or perhaps
that's a good thing?)
It also weighs heavy in my mind as we reserve the right to make the LRU
more loose or more strict as we evolve. It may not exist at all at some
point.
- Commands to stream the keys of evictions, or also reclaims or expired
items
People want cachedump so they can see what's still in there. This would be
an extension (or instead of) the previous streaming commands. You would
register for events with a set of flags, and when items expire or are
evicted or whatever you decided to watch, it would copy a result to the
stream.
It is much, much more efficient to read out of the statistical counters to
get the information. But as people want to see what's in there, often
they're really wondering about what's no longer in there.
---
I'm not really sold on any of these. These are not all the ideas we should
even consider, if you have better ones. Please help distribute this ML
post around as much as possible so we can have a better chance of having
an intelligent discussion about it.
Thanks,
-Dormando
Now would be a great time to sell us on it, then :)
Profiling for monitoring and activity estimation purposes - isn't that the point of the sFlow set of patches mentioned a few times on list?
Cheers,
Andrew
The sFlow patches bother me as I'd prefer to be able to generate sFlow
events from a proper internal system, as opposed to the inverse. You
shouldn't have to be an sFlow consumer, and it's much more difficult to
vary the type of data you'd be ingesting (full headers, vs partial, vs
item bodies, etc).
The internal statistical sampling would be the start, then come methods of
shipping it. You could send to listeners connected over a socket, or have
a plugin listen as an internal consumer to the samplings. The internal
consumer could provide "builtin" statistical summaries the same as an
external daemon could. Which could make everyone happy in this case.
I like the sFlow stuff, I'm just at a loss for why it's so important that
everything be generated on top of sFlow. So far nobody's addressed my
specific arguments as listed above.
-Dormando
(A) continuous monitoring of the whole cluster, in production.
(B) troubleshooting a specific node, key, operation or client -- without impacting (A)
For (A) you want the most robust, scale-out measurement you can find that will not impact performance but still provide as much insight as possible. Packing random samples into XDR-encoded UDP datagrams (i.e. sFlow) is a good fit for this. Implemented carefully the overhead is roughly equivalent to adding one counter to the stats block. It's worth agreeing on a standard format because the cluster-wide analysis is necessarily "external" to memcached. There is no single node that can tell you the cluster-wide top-keys. Selecting sFlow as the format makes sense because the cluster also comprises networking, servers, hypervisors, web-servers etc. that can all export sFlow too (each providing their own perspective). This way the continuous monitoring can all be done by listening for standard UDP datagrams on a single UDP socket on a separate server (or servers).
For (B), I think the "tap" toolkit looks promising. It complements sFlow well. A stream-oriented protocol with options and filters to allow you to nose into all the dark corners. Perfect.
In other words. These two solutions are different in almost every way, and I think you want both.
There's nothing stopping you from consuming random samples "internally" (locally on one node) as well. I just can't think of an every-day reason to do that. Any analysis where you want to filter/aggregate/sort/chop/threshold on the populations of keys can be done at the sFlow collector server -- either for a particular node or for the whole cluster. You can change the query without making any change at all to the nodes in the cluster. Doing the same thing as a "builtin" just seems redundant, risks hurting performance (yes TOPKEYS, I mean you!), and results in a config change on the node every time you change the query. I suspect that many memcached users would break out in a cold sweat just changing the sFlow sampling rate from 1-in-1000 to 1-in-1001, so anything more invasive than that is likely to be a tough sell.
If your analysis occasionally calls for a local measurement that cannot be driven from the sFlow samples (e.g. because it needs to see the content of the value object, or has to see every transaction that matches a filter) then you can use "tap". I think you would still want as much as possible to be done "externally" in another process, though. The simpler and more robust the "tap" protocol is, the more likely it will be trusted and used.
Neil
P.S. Minor point: the current sFlow patch includes generic code that allows for multiple "agents", "samplers", "pollers" and "receivers" all with different sampling rates and polling intervals. In practice we only have one of each here, so the patch could be stripped down to a few hundred lines if we cut out all the layers of indirection and just assembled the UDP datagram inline. I'd be happy to do that if you think it would help.
Instead I went back and tried to minimize the number of source code lines that would have to be added to memcached.c and thread.c. It's just a handful now. Pretty minimal.
Neil
>
> On Aug 8, 10:00 pm, neilmckee <neil.mckee...@gmail.com> wrote:
>
>>> Well, all the memcached operations are built on top of it... do you
>>> mean specifically multiget might call into the engine multiple times
>>> for a single "request"?
>>
>> Yes. That's one example. I think there were others where the
>> memcache operation resulted in more than one engine transaction.
>
> Allocation is a separate engine request from linking. You can just
> do whatever is sensible there, though. The binary protocol doesn't
> necessarily have packet responses for many engine requests, but packet
> requests map pretty well to engine requests. The text protocol has a
> special-case "multiget" which behaves differently.
>
>> Although there are already 30+ companies and open-source projects with
>> sFlow collectors I fully expect most memcached users will write their
>> own collection-and-analysis tools once they can get this data! Don't
>> you agree? So it's not about any one collector, it's about
>> defining a useful, scalable measurement that everyone can feel
>> comfortable using, even in production, even on the largest clusters.
>
> I don't think I've ever said anything that sounds like a
> disagreement with you. I just disagree that it's impossible to build
> memcached such that sFlow collection is an externally produced
> plugin. I could be wrong, but I don't understand why we can't do it
> with the engine interface or why we can't design another interface
> that would be useful.
Well, I think we really just need a hook that announces the completion of a memcache-protocol operation, with args being:
(1) the connection object (so we can read out transport, socket details, protocol...)
(2) the operation (GET, SET, INCR etc.)
(3) the key and key-length
(4) the number of keys (usually 1, but >1 if this was part of a multi-get)
(5) the value bytes
(6) the status (STORED, NOT_FOUND, etc.)
(7) perhaps something about the expiration deadline of the key(?)
If timing data is ever interesting (which it can be in the new architecture) then we would want a start-of-memcache-operation hook too.
It might be helpful if a plugin could attach variables to the connection object, and also to the thread object. I don't know if that is strictly necessary. I'm just looking at how it works today and how it avoids using locking/atomic_ops by adding fields to each of those two structures.
There is also a special lockless fn to roll together the sample_pool counter in thread.c. I guess that means a threads_do(cb) iteration hook might be necessary too.
We would still need access to the counters. sFlow pushes them out every n seconds (typically n=20, but it's configurable). That also means it would be good to register for a 1-second tick callback, to avoid having to run a separate thread just for that.
So if the extra function calls don't hurt performance too much it might be a good way to do in the future. However I like your next suggestion better....
>
>> On a positive note, it does seem like there is some consensus on the
>> value of random-transaction-sampling here. But do we have agreement
>> that this feed should be made available for external consumption (i.e.
>> the whole cluster sends to one place that is not itself a memcached
>> node), and that UDP should be used as the transport? I'd like to
>> understand if we are on the same page when it comes to these broader
>> architectural questions.
>
> I think I do agree with that. The question is whether we do that by
> making an sFlow interface or a sample interface?
Do you mean a hook that can be used by a plugin to receive randomly sampled transactions? That would allow you to inline the random-sampling and eliminate most of the overhead. An sFlow plugin would then just have to register for the feed; possibly sub-sample if the internal 1-in-N rate was more aggressive than the requested sFlow sampling-rate; marshall the samples into UDP datagrams, and send them to the configured destinations. I like this solution because it means the performance-critical part would be baked in by the experts and fully tested with every new release.
But if you've already done the hard work, and everyone is going to want the UDP feed, then why not offer that too? I probably made it look hard with my bad coding, but all you have to do is XDR-encode it and call sendto().
>
> (And why can't everyone just use dtrace?)
I looked at this too, but the dtrace macros are not called with all the fields we need, and I assumed they were immutable. (Also the SFLOW_SAMPLE() macro injects rather fewer lines of code when enabled).
Finally, I accept that the engine-pu branch is the focus of future development, but... any thoughts on what to do for the 1.4.* versions?
Neil
We can ship plugins with the core codebase, so sflow would still work "out
of the box", it just wouldn't be what the system was based off of.
On that note, how critical is it for sflow packets to contain timing data?
Benchmarking will show for sure, but history tells me that this should be
optional.
What would be pretty awesome is sflow-ish from libmemcached, since the
only place it *really* matters how long something took is from the
perspective of a client. Profiling the server is only going to tell me if
the box is swapping, as it's extremely uncommon to nail the locks.
> Finally, I accept that the engine-pu branch is the focus of future
> development, but... any thoughts on what to do for the 1.4.* versions?
I'm kicking out one release of 1.4.* monthly until 1.6 supersedes it. That
said I have a backlog of bugs and higher priority changes that will likely
keep me busy for a few months. Unless of course someone sponsors me to
spend more time on it :)
-Dormando
Don't forget the original thread as well. I'm trying to solve two issues:
1) Sampling useful data out of a cluster.
2) Providing something useful for application developers
The second case is an OS X user who fires up memcached locally, writes
some rails code, then wonders what's going on under the hood. 1-in-1000
sampling there is counterproductive. Headers only is often useless.
stats cachedump is most often used for the latter, and everyone needs to
remember that users never get to 1) if they can't figure out 2). Maybe I
should flip those priorities around?
-Dormando
Not critical at all. The duration_uS field can be set to -1 in the XDR output to indicate that it is not implemented. I added this measurement when porting to the 1.6 branch, where it makes more sense. I left it in when I updated the 1.4 branch because, well, the overhead seemed negligible and the numbers still seemed like they might be revealing something (though I wasn't sure what exactly). The start-time field is currently used as the "we're going to sample this one" flag. However that could easily be changed to just set a bit instead. Two system calls per sample would be saved. The practice of marking a transaction to be sampled at the beginning and then actually taking the sample at the end when the status is known could also be replaced by the old scheme from last year where we do both steps at the same time. However it was actually easier to implement with the two-step approach because of the way that there are only two or three ways that a transaction can start and a whole myriad of ways that it can end. So the first step (the coin-tossing) only has to happen in those two or three places and it's easier to know that you have counted everything once. Breaking it up like this also gives you the choice of accumulating details incrementally (the key, the status-code etc.) in whatever is the easiest place.
>
> What would be pretty awesome is sflow-ish from libmemcached, since the
> only place it *really* matters how long something took is from the
> perspective of a client. Profiling the server is only going to tell me if
> the box is swapping, as it's extremely uncommon to nail the locks.
Yes, a client might well offer sFlow-MEMCACHE transaction samples (as well as enclosing sFlow-HTTP transaction samples, if applicable). However you would probably still want to instrument at the server end to ensure that you were getting the full picture. There might be a whole menagerie of different C, Python, Perl and Java clients in use.
>
>> Finally, I accept that the engine-pu branch is the focus of future
>> development, but... any thoughts on what to do for the 1.4.* versions?
>
> I'm kicking out one release of 1.4.* monthly until 1.6 supersedes it. That
> said I have a backlog of bugs and higher priority changes that will likely
> keep me busy for a few months. Unless of course someone sponsors me to
> spend more time on it :)
In the mean time I could strip down the current patch and reduce it's code footprint considerably - but would that help?
Neil
>
> -Dormando
Not totally sure I follow. The system calls would be nice to avoid, since
we can't guarantee the system will use a vsyscall for the clock...
> Yes, a client might well offer sFlow-MEMCACHE transaction samples (as
> well as enclosing sFlow-HTTP transaction samples, if applicable).
> However you would probably still want to instrument at the server end to
> ensure that you were getting the full picture. There might be a whole
> menagerie of different C, Python, Perl and Java clients in use.
Many/most are based off of libmemcached. You can catch quite a bit with
that one.
> > I'm kicking out one release of 1.4.* monthly until 1.6 supersedes it. That
> > said I have a backlog of bugs and higher priority changes that will likely
> > keep me busy for a few months. Unless of course someone sponsors me to
> > spend more time on it :)
>
> In the mean time I could strip down the current patch and reduce it's
> code footprint considerably - but would that help?
It could help to serve as a reference, but I'm not sure we can merge it.
Every time we merge something we make some sort of idiotic pinkie swear to
the planet to never touch that feature ever again forever. ASCII noreply
haunts me to this day.
Depending on how long 1.4 goes on though, it might end up having its own
internal sampler, and thus sflow could be slapped on top of it.
thanks,
-Dormando
I certainly agree that you want both of these features. However they are wildly different. (1) is for monitoring in production, and (2) is for testing and troubleshooting. The requirements are so divergent that there may not be any overlap at all in the implementation of each. In fact the more separate they are the better because there is a lot of pressure on (1) to be ultra-stable and never change, while you are likely to think of new ideas for (2) all the time.
So there's no need to hesitate if you can already do (1) today. Let's face it, you have been very successful and there are rather a lot of users who have already gotten past (2) :)
Neil
> -Dormando
Okay, I'm kinda tired of that argument. Just beacuse you say something
isn't possible, doesn't mean we can't make it work anyway. If you believe
they're divergent, stop saying that they're divergent and prove it with
examples. However I'd rather spend my time writing features than
pretending to know if a theoretical patch will work or not.
We want to work towards a system that can encompass a replacement for
"stats cachedump". If we can design something which generates sflow as a
subset, that'll be totally amazing! We can even use your patches as
reference for creating a core shipped plugin.
If people want to use sflow today, they can apply your patches and use it.
As is such with open source.
-Dormando
I didn't say it wasn't possible.... but never mind all that. A core-shipped plugin would be great. Let me know if there's anything I can do to help.
Neil
That was more strongly worded than I intended, I apologize; I don't agree
that it's worth rushing. Not "rushing" is why we haven't already settled
on TOPKEYS the way it is. I don't really intend to throw something else in
there immediately.
No worries. I apologize for my impatience. You are right. There is no rush.
But you did ask for more specific examples, so for what it's worth, here are some reasons why I think features for (1) in-production cluster-wide sampling and (2) testing and troubleshooting should be kept as separate as possible:
A. They will rarely be used at the same time on the same node.
B. If they are used concurrently (e.g. troubleshooting a production node), then using (2) should have no effect on (1).
C. The cluster-wide configuration used for (1) is likely to be very different from the interactive configuration for (2).
D. Getting a feed of randomly-sampled transactions is probably the only think they will have in common. After that, (1) will simply send the sample over UDP, while (2) might apply regex-filtering, value-field analysis, various tests on the expiration times and slab allocation and finally stream results out on a TCP connection - probably using some ASCII format instead of XDR.
E. Even on the part that they do have in common (1) is likely to want only a handful of samples per second per node (e.g. 1-in-10000), while (2) is much more likely to want a more aggressive feed such as 1-in-10, or even 1-in-1. It seems likely that this difference will impact the implementation. For example, that time-duration measurement would be unthinkable at 1-in-1, but could be quite OK at 1-in-50000.
F. Even if (2) may be considered higher priority, I think it's easier to see how (1) can be completed and tied in a bow. I should stress here that I'm not expecting anyone to use my code! I just think you guys could knock (1) out pretty easily and reap immediate benefits, while (2) could take a while to crystallize.
Getting unnecessarily detailed, let's say you implemented the plugin sampling something like this:
possibly_sample_transaction(connection, protocol, operation, key, val, status) {
r = next_random(connection->thread);
for(i = 0; i < num_sampling_plugins; i++) {
consumer = sampling_plugins[i];
if(r <= consumer->probability_threshold) {
(*consumer->sample_callback)(connection, protocol, operation, key, value, status);
}
}
}
Compare that with the number of instructions and branches involved here:
possibly_sample_transaction(connection, protocol, operation, key, value, status) {
if(next_random(connection->thread) <= probability_threshold) {
take_sample(connection, protocol, operation, key, value, status);
}
}
Or, if you allow one sampling_probability to be treated specially and turned into a countdown-to-next-sample, then you can do it this way and save more:
possibly_sample_transaction(connection, protocol, operation, key, value, status) {
if(unlikely(--connection->thread->countdown== 0)) {
connection->thread->countdown = compute_next_countdown();
take_sample(connection, protocol, operation, key, value, status);
}
}
At this point you could easily turn it into a macro so that there is no extra function-call in the critical path, just a decrement-and-test on the thread->countdown.
I don't know if it matters so much to shave a few dozen cycles off the critical path, but my point was just to illustrate that even in the small area of overlap between (1) and (2) you might still be grateful someday if you kept them entirely separate.
Thoughts?
Neil