Possible bug in availability reports

555 views
Skip to first unread message

Dale Marshall

unread,
Apr 10, 2013, 4:48:44 AM4/10/13
to th...@googlegroups.com
Hi Sven,

I'm encountering a problem with the availability reports. When performing a 'last month' availability report for 'all hosts' the outputted figures do no not match those done on 'hostgroups' or a single host.

For instance:
  1. Performing a trend report for 'client a' of the 'last month' period, the resultant % Time Up is 95.038%
  2. Performing an availability report for 'client a' of the 'last month' period, the resultant % Time Up is 95.038%
  3. Performing an availability report for 'hostgroups' which includes 'client a' of the 'last month' period, the resultant % Time Up of 'client a' is 95.038% (95.038%)
  4. Performing an availability report for 'all hosts' of the last month period, the resultant % Time Up of 'client a' is 0.145% (93.750%)
  5. Performing an availability report for 'all hosts' of the last month period Using the Nagios frontend, the resultant % Time Up of 'client a' is 95.085% (95.085%)
All available options are set the same across the above tests (whether assumed state of host is unspecified or up). Naturally this causes a loss of faith in the numbers being presented. I have not expanded my tests to periods other than 'last month'.

Please let me know if you would like any more specifics

Thanks,

Dale

Sven Nierlein

unread,
Apr 18, 2013, 6:08:40 AM4/18/13
to th...@googlegroups.com, Dale Marshall
Hi Dale,

I cannot reproduce that. Do you use any timeperiod or 24x7 for your report. Can you narrow the report down to a smaller timeframe where
the problem still exists and maybe even isolate single logfile entries which result in that behaviour?

Regards,
Sven



On 4/10/13 10:48, Dale Marshall wrote:
> Hi Sven,
>
> I'm encountering a problem with the availability reports. When performing a 'last month' availability report for 'all hosts' the outputted figures do no not match those done on 'hostgroups' or a single host.
>
> For instance:
>
> 1. Performing a trend report for 'client a' of the 'last month' period, the resultant % Time Up is 95.038%
> 2. Performing an availability report for 'client a' of the 'last month' period, the resultant % Time Up is 95.038%
> 3. Performing an availability report for 'hostgroups' which includes 'client a' of the 'last month' period, the resultant % Time Up of 'client a' is 95.038% (95.038%)
> 4. Performing an availability report for 'all hosts' of the last month period, the resultant % Time Up of 'client a' is 0.145% (93.750%)
> 5. Performing an availability report for 'all hosts' of the last month period *Using the Nagios frontend*, the resultant % Time Up of 'client a' is 95.085% (95.085%)
>
> All available options are set the same across the above tests (whether assumed state of host is unspecified or up). Naturally this causes a loss of faith in the numbers being presented. I have not expanded my tests to periods other than 'last month'.
>
> Please let me know if you would like any more specifics
>
> Thanks,
>
> Dale
>
> --
> You received this message because you are subscribed to the Google Groups "Thruk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thruk+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

bjornf

unread,
Apr 23, 2013, 7:37:27 AM4/23/13
to th...@googlegroups.com, Dale Marshall
I think I have a similar problem.  When doing an availability report via Thruk for a servicegroup the time undetermined is around 45% no matter backtracking or other parameters. When I run the same report via Nagios avail.cgi the time undetermined is less than 1%.

Any hints on troubleshooting?

This with:

Thruk version 1.68
Nagios 3.5.0

This has worked well before but I have upgraded both Thruk and Nagios since. Even the same time-period that worked before does not work now. 

Regards, Bjorn

Dale Marshall

unread,
May 1, 2013, 4:18:02 AM5/1/13
to th...@googlegroups.com, Dale Marshall
Hi Sven,

Thank you for looking into the problem.

I went and tried again specifying a 24x7 timeperiod to be sure - same result. Unfortunately I don't have a lot of time to spend with this at the moment, however, I will be migrating our entire network to a distributed Shinken installation soon. Once complete I will resume these tests on the new backend and provide any new updates. I took a brief look at the code so perhaps in the near future I'll try take it further and debug.

A thought that has been put forward at work is that perhaps the multi-backend/hierarchical way we've setup Nagios is causing the problem but one of the employees who makes regular use of these reports attests that the feature was in fact accurate in the past (well, at the least the discrepancies weren't so obvious).

Thanks again,

Dale

Sven Nierlein

unread,
May 1, 2013, 5:12:44 AM5/1/13
to th...@googlegroups.com
On 5/1/13 10:18, Dale Marshall wrote:
> A thought that has been put forward at work is that perhaps the multi-backend/hierarchical way we've setup Nagios is causing the problem but one of the employees who makes regular use of these reports attests that the feature was in fact accurate in the past (well, at the least the discrepancies weren't so obvious).

Thats possible, there had been quite a lot changes in the reporting module lately. I will try to reproduce that as soon as i have some time.

Sven

bjornf

unread,
May 31, 2013, 2:10:29 PM5/31/13
to th...@googlegroups.com
It appears it applies to following report types:

- all services
- all hosts
- servicegroups
- hostgroups

But not individual hosts or services. I get very high TIME_UNDETERMINED. The longer back the reports goes, the higher that gets.

It is also suspiciously quick, A complete month report for all services completes in about 1 minute.  This for 1.4GB of logs. Don't recall how fast it was when I didn't have this issue though. 

This tested with thruk-1.70-3 .

Regards, Bjorn

bjornf

unread,
Jul 28, 2013, 4:01:13 AM7/28/13
to th...@googlegroups.com
This still appears to be an issue. Any chance of getting it fixed? High "TIME_UNDETERMINED" that is. 

Sven Nierlein

unread,
Jul 28, 2013, 7:24:17 AM7/28/13
to th...@googlegroups.com
On 7/28/13 10:01, bjornf wrote:
> This still appears to be an issue. Any chance of getting it fixed? High "TIME_UNDETERMINED" that is.

I still cannot reproduce that. Do you have some detailed information on how to reproduce that?
Are there any differences in the "View full log entries". Can it be pinned down to a small timeframe?


Regards,
Sven

bjornf

unread,
Jul 28, 2013, 6:47:11 PM7/28/13
to th...@googlegroups.com
Above I wrote that it applies to hostgroups etc but I'm not sure of that anymore. I can still reproduce it when doing a standard last 7 days report for a servicegroup. This with using default options, not changing anything.  If I do it for last 24 hours time undetermined  is 0%. With last 7 days it's around 65%. 

Regards, Bjorn

Łukasz Krzeminski

unread,
Feb 20, 2014, 8:01:00 AM2/20/14
to th...@googlegroups.com

Łukasz Krzeminski

unread,
Feb 20, 2014, 8:01:45 AM2/20/14
to th...@googlegroups.com
I confirm this issue is present in 1.82 too. Is there anything I can do to help you narrow out the problem ?

Thanks,

Lukasz

Dale Marshall

unread,
Feb 5, 2015, 4:24:42 PM2/5/15
to th...@googlegroups.com
So it has been quite some time since I posted this and as stated then, I haven't had much time to dabble. Shinken turned out to be a bust and we have in fact gone with Naemon across our network.

The error still persists, however, I can shed some findings that I have come across.

It appears that when doing the reports and the backends are queried via livestatus that the problem occurs. I was forced to go into the log files directly where particular state changes were found, however, when querying the frontend they could not be located. It was at this point that I decided to use the mysql db logcache feature, which in turn had the same problem while performing logcacheupdate (which I assume uses livestatus). I then tried importing from the logfiles directly using logcacheimport --local and performing reports using the frontend and all information was then present.

bjornf, we also saw a lot of time undetermined in our reports (except when people obviously specified to assume hosts were up at start of reporting period) which actually led to me searching for the current state entries in the logfiles to ensure they were actually present. Obviously after I found that the entries were there and Thruk was not seeing them I got suspicious and started trying a few things.

It has been in light of this that I have scripted bringing in the previous days logfile directly to our thruk frontend server from each backend and performing "logcacheimport --local". This coupled with the automatic logcacheupdate as well as an hourly cron job running logcacheupdate appears to be working well and ensures any missing entries are imported.

As for going into Livestatus itself to determine if it is indeed at fault, perhaps we'll run out of things to do at the office and I'll be able to look into it. I currently make use of a few custom Python scripts I wrote which use the Livestatus python library - such as retention importers, host renaming and logfile merging (for our move from Nagios to Shinken to Naemon) as well as a service running on apache exposing a few end points - and have not had any problems in that regard.

Hope this helps anyone else.

Thanks again Sven,
Dale
Reply all
Reply to author
Forward
0 new messages