> A little while ago a user asked in
http://groups.google.com/group/maatkit-discuss/browse_thread/thread/2...
> if it was possible to use mk-query-digest to analyze queries per
> hour. I responded with a skeleton script for use with --filter, but I
> didn't actually test this. Today, I filled out the script and tested
> it and found that it works. The script is available from trunk at:
> http://maatkit.googlecode.com/svn/trunk/mk-query-digest/t/samples/fil...
> The test file I'm using is available at:
> http://maatkit.googlecode.com/svn/trunk/common/t/samples/binlog005.txt
> The filter code does two things: it adds attributes called year,
> month, day and hour to each event, and it uses environment variables
> called YEAR, MONTH, DAY and HOUR to filter those newly added
> attributes. I'll show how this works later.
> The filter works best with binary logs because binlogs reliably
> timestamp events. If an event does not have a timestamp (as is often
> the case in a slowlog), then it gets values 0, 0, 0, 24 for year,
> month, day and hour respectively. Since 0 is a valid hour, 24 is used
> to indicate that the event had no hour.
> The basic usage is to group queries by hour. Let's say you want to
> see query stats for each hour. The command line is:
> mk-query-digest --type binlog binlog005.txt --filter filter-add-ymdh-
> attribs.txt --group-by hour
> Notice "--group-by hour". And the result is (truncated for brevity):
> #
> ########################################################################
> # Report grouped by hour
> #
> ########################################################################
> # Item 1: 1.50 QPS, 31.01kx concurrency, ID 0x0DB5E4B97FC2AF39 at byte
> 450
> # pct total min max avg 95% stddev
> median
> # Count 30 3
> # Exec time 30 62029s 20661s 20704s 20676s 19861s 0
> 19861s
> # Time range 2007-12-07 13:02:08 to 2007-12-07 13:02:10
> # bytes 23 81 27 27 27 27
> 0 27
> # day 25 21 7 7 7 7
> 0 7
> # error cod 0 0 0 0 0 0
> 0 0
> # month 27 36 12 12 12 12
> 0 12
> # year 27 21 7 7 7 7
> 0 7
> # Query_time distribution
> # 1us
> # 10us
> # 100us
> # 1ms
> # 10ms
> # 100ms
> # 1s
> # 10s+
> ################################################################
> 13
> ...
> # Item 2: 0.00 QPS, 0.71x concurrency, ID 0xAA27A0C99BFF6710 at byte
> 301 _
> # pct total min max avg 95% stddev
> median
> # Count 30 3
> # Exec time 29 62000s 20661s 20675s 20667s 19861s 0
> 19861s
> # Time range 2007-12-07 12:02:50 to 2007-12-08 12:12:12
> # bytes 46 163 22 87 40.75 84.10 25.86
> 26.08
> # day 37 30 7 8 7.50 7.70 0.36
> 7.70
> # error cod 0 0 0 0 0 0
> 0 0
> # month 36 48 12 12 12 12
> 0 12
> # year 36 28 7 7 7 7
> 0 7
> # Query_time distribution
> # 1us
> # 10us
> # 100us
> # 1ms
> # 10ms
> # 100ms
> # 1s
> # 10s+
> ################################################################
> 12
> ...
> # Rank Query ID Response time Calls R/Call Item
> # ==== ================== ================ ======= ========== ====
> # 1 0x 62029.0000 30.0% 3 20676.3333 13
> # 2 0x 62000.0000 30.0% 3 20666.6667 12
> # 3 0x 20661.0000 10.0% 1 20661.0000 23
> # 4 0x 20661.0000 10.0% 1 20661.0000 10
> # 5 0x 20661.0000 10.0% 1 20661.0000 08
> # 6 0x 20661.0000 10.0% 1 20661.0000 18
> Each item corresponds to the queries for that hour. Shown above are
> hours 13 (1pm) and 12 (noon). Then the profile gives you summarized
> information about each hour. From this fake binlog we see that 30% of
> queries occurred in the noon hour. (binlog005.txt is highly
> contrived; the values are just for demonstration.)
> Unless your logs are rotated daily, chances are there will be noon-
> hour queries for multiple days. If you want to see per-hour stats for
> one specific day, the filter can do this, too, by using environment
> variables. Filter scripts were not originally meant to accept user
> input, and having to modify values in the actual code isn't flexible,
> so the solution is to use environment variables. Here's how:
> DAY=7 mk-query-digest --type binlog binlog005.txt --filter filter-add-
> ymdh-attribs.txt --group-by hour
> The leading "DAY=7" temporarily sets the environment variable DAY only
> during the execution of mk-query-digest. This way you don't pollute
> your normal environment variables. The result is now (truncated
> again):
> # Item 2: 0.00 QPS, 12.24x concurrency, ID 0xAA27A0C99BFF6710 at byte
> 301
> # This item is included in the report because it matches --limit.
> # pct total min max avg 95% stddev
> median
> # Count 28 2
> # Exec time 28 41339s 20664s 20675s 20670s 20675s 8s
> 20670s
> # Time range 2007-12-07 12:02:50 to 2007-12-07 12:59:07
> # bytes 28 54 27 27 27 27
> 0 27
> # day 28 14 7 7 7 7
> 0 7
> # error cod 0 0 0 0 0 0
> 0 0
> # month 28 24 12 12 12 12
> 0 12
> # year 28 14 7 7 7 7
> 0 7
> # Query_time distribution
> # 1us
> # 10us
> # 100us
> # 1ms
> # 10ms
> # 100ms
> # 1s
> # 10s+
> ################################################################
> 12
> Notice that there are now only 2 queries in the noon hour and that the
> time range is only in 2007-12-07. Previously, there was a noon-hour
> query in 2007-12-08. Thus we know that the DAY filter worked.
> In this fashion, you can group and filter your log as you please. You
> can combine multiple filters like:
> DAY=7 HOUR=12 mk-query-digest --type binlog binlog005.txt --filter
> filter-add-ymdh-attribs.txt --group-by hour
> That will group and analyze only queries from the noon hour of the 7th
> (December 7, 2007 in this log). mk-query-digest is so flexible you
> can even do this:
> DAY=7 HOUR=12 mk-query-digest --type binlog binlog005.txt --filter
> filter-add-ymdh-attribs.txt --group-by hour --no-report --print
> That will suppress the query analysis and report and simply print all
> the queries from the noon hour of the 7th in pseudo-slowlog format.
> There are, of course, other ways to do this kind of per-time-unit
> query aggregation, analysis and reporting, but if all you have are pre-
> existing logs and mk-query-digest, then --filter can be used to
> accomplish this task, too.
> -Daniel