How to keep custom Add-on in sync for 'long' running command

12 views
Skip to first unread message

Wes Vaske

unread,
Dec 16, 2015, 3:14:35 PM12/16/15
to dim_STAT
I'm trying to integrate some custom add-ons and the commands are long-running.

(example: smartctl -x /dev/sda takes 5+ seconds on some drives)

Is there a best practice for capturing this data and keeping the time in sync?

I currently have a script that will write this data to a file as it's able to capture it and dim_STAT will just cat this file at the collect interval.

Dimitri

unread,
Dec 18, 2015, 7:44:03 AM12/18/15
to dim...@googlegroups.com
Hi Wes,

I'd say if all your stats data are coming slower than say N seconds, then don't allow to collect on such a small time interval.. -- there is no magic, if data are missed for a given period of time, or between two points in time - they are really missed, and you cannot speculate here that a missed value could be represented as an Avg between the latest before  and after.. - nobody knows what it could be in reality, as it could be very low or very high with the same probability ;-)

Personally, when I need stats on a different time intervals, I just starting 2 parallel collects from the same host: one is collecting all "light" stats on say 10sec time interval, and another collecting "heavy" stats say on 30sec interval. In this case I may at least trust what I'm seeing ;-)

However, you may also adapt your Add-On script in a way it provides you a regular output even if backgroud stats are lagging :
  - for ex. your stat command delivers you data on unexpected time intervals, say between 5 and 45 sec (or more) - you never know
  - but you need to see corresponding stats printed every N sec..
  - you're writing then a "wrapper" script which is reading itself the data coming from these stats command(s) and printing every N sec "something"
  - this "something" then will be one of two :
        - real data (if stats provided the info you need within N sec interval)
        - estimated data (if no stats came within N sec)..
  - for estimation you can use for ex. an Avg of the last 20 values (or more)
  - it's important then also to have an additional column in the output used as a flag to say which data are printed (say: 1 for "real", and 0 for "estimated") -- after that it'll be very easy to say how real are your stats by looking on the "data" graph aligned with "flag" graph on the same time..

Now: why dim_STAT could not do exactly the same?.. -- simply because it's always much more simple to implement this on the Add-On scrip level rather invent a kind of "universal" solution ;-)

So, from the beginning there was no any time dependency in the code -- each data metric is just getting its serial number (SNO) attributed, which is then correlated with collect start time to show the corresponding time.. Stats commands may lag for many reasons, or may just have a buffered output -- you never know, so it's always more healthy to delegate it to the Add-On itself, rather debug a "general" logic in each broken case ;-)

However, dim_STAT is still taking care about lagging stats : 
  - if there was still no output from the stat during x3 time intervals - the stat collection will be restarted
  - if connection was lost, the new SNO will be adjusted according the new time on re-connect
  - every hour there is a check for measurements lagging in every stat, and if several measurements are missed, such a stat is re-started with a new SNO (say you're collecting vmstat every 10 sec, means within 1hour you should find 360 measurements, and if it's not the case -- means vmstat was lagging somewhere for some reasons (for ex. your server was too busy with 100% CPU usage for a long time, etc.) -- so, vmstat will be restarted and the output re-aligned to the current time)..
  - NOTE: every time there was a drop or shift in time in data you'll see a vertical red dotted line to mention this..

Let me know if you need any more info, etc..

Rgds,
-Dimitri
--
You received this message because you are subscribed to the Google Groups "dim_STAT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dimstat+u...@googlegroups.com.
To post to this group, send email to dim...@googlegroups.com.
Visit this group at https://groups.google.com/group/dimstat.
For more options, visit https://groups.google.com/d/optout.

Alan Impink

unread,
Dec 18, 2015, 10:18:27 AM12/18/15
to dim...@googlegroups.com
Hi Wes, Dimitri and all,

In addition to all the excellent information and advice Dimitri gave, I thought I would add a little.

Because collectors are written with code that performs a data gathering activity, followed by a sleep, you could (should?) also alter your sleep interval each iteration to compensate for the working time of the data capture.  

For example, if your sample interval is 5 seconds, and the time it takes to gather a set of metrics is 2 seconds, then if the sleep is coded to exactly the sample interval, the collector will spit data out every 7 seconds.  Over even a small bit of time, you will start seeing data gaps (red lines) from missed data samples.  A script might be coded this way in perl:

for (;;) {
    my $result = gather_metrics();

    print "$result\n";

    sleep $timeout;
}

To avoid this, re-compute the actual sleep time each collection interval.  Here's a perl example:

for (;;) {
    my $start = time;
    my $result = gather_metrics();

    print "$result\n";

    if ((my $remaining = $timeout - (time - $start)) > 0) {
        sleep $remaining;
    }
}

This takes care of a potentially code-induced lagging issue.  However, as Dimitri stated, if the sample interval is shorter than the time it takes to actually gather the data, you will miss real data.  There's simply no solution for that, other than using a longer sample interval for expensive data gathering routines, or use one of Dimitri's suggestions for data estimation.  

FWIW, I personally would go without data for a sample interval, rather than go with an estimate.  While it makes for nicer graphs if lines connect, I'd prefer not to guess what the system is doing.  

Best of luck,
Alan


From: Dimitri <dimit...@gmail.com>
To: "dim...@googlegroups.com" <dim...@googlegroups.com>
Sent: Friday, December 18, 2015 7:44 AM
Subject: Re: [dim_STAT] How to keep custom Add-on in sync for 'long' running command

Dimitri

unread,
Dec 18, 2015, 11:57:31 PM12/18/15
to dim...@googlegroups.com
Hi Alain,

just a small remark :
- you should also check in your code that your $remaining time is
not bigger than $timeout (this could happen if by surprise the OS
clock time was changed for some reasons)..

in fact last months I've spent to verify all the scripts I have for
such conditions, that's why ;-)) while the normal rule is to stop
STAT-service first before any time clock changement on the system, but
many are just forgetting about..

NOTE: you should always sleep at least 1sec in your script as well to
avoid a "fast pass" and come with zero difference vs previous
measurement -- this may give a wrong indicator of frozen activity
(like network for ex.), while in reality it'll be not so..

Rgds,
-Dimitri
Reply all
Reply to author
Forward
0 new messages