analyzer get stuck

44 views
Skip to first unread message

Roee Zabari

unread,
Sep 17, 2014, 10:42:32 AM9/17/14
to skyli...@googlegroups.com
Hi,
I'm facing a problem where the analyzer get stuck after a while. 
The process is still up, but in the log I can see that at some point it just stops analyzing, and it doesn't write anything else to the log as well. I can also notice that the CPU utilization on the machine drops when it stops functioning.
The horizon and webapp are still working fine though.
The only thing I tried is to decrease the ANALYZER_PROCESSES setting because we only have 8 cores on the machine, but it didn't help.

Any ideas?

Thanks,
Roee.

This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. 
If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.

Abe Stanway

unread,
Sep 17, 2014, 11:16:44 AM9/17/14
to Roee Zabari, skyli...@googlegroups.com
Interesting. What is the Redis health status then? Does Horizon keep working and is the data in Redis recent?

--
You received this message because you are subscribed to the Google Groups "skyline-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skyline-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Abe Stanway

Roee Zabari

unread,
Sep 17, 2014, 11:20:58 AM9/17/14
to skyli...@googlegroups.com, ro...@liveperson.com
Yes. 
I can tell that by looking at the webapp and when I go over all the metric points for the metrics show there I can see there are no gaps.
Also, the horizon and redis log look fine.
Is there a way to start the analyzer in debug mode or such?

Abe Stanway

unread,
Sep 17, 2014, 11:24:25 AM9/17/14
to Roee Zabari, skyli...@googlegroups.com
Unfortunately not. Have you gone through all of 

STALE_PERIOD
MIN_TOLERABLE_LENGTH
MAX_TOLERABLE_BOREDOM
BOREDOM_SET_SIZE

and made sure none of those conditions apply?



Roee Zabari

unread,
Sep 17, 2014, 11:50:02 AM9/17/14
to skyli...@googlegroups.com, ro...@liveperson.com
Yes, but it's not related to the problem if I understand correctly. 
For example, this is what I see in the analyzer log: 

2014-09-17 11:41:26 :: 15272 :: seconds to run    :: 215.02
2014-09-17 11:41:26 :: 15272 :: total metrics     :: 72972
2014-09-17 11:41:26 :: 15272 :: total analyzed    :: 19597
2014-09-17 11:41:26 :: 15272 :: total anomalies   :: 2
2014-09-17 11:41:26 :: 15272 :: exception stats   :: {'Boring': 30660, 'Stale': 22715}
2014-09-17 11:41:26 :: 15272 :: anomaly breakdown :: {'least_squares': 2, 'histogram_bins': 2, 'stddev_from_average': 2, 'stddev_from_moving_average': 2, 'median_absolute_deviation': 2, 'grubbs': 2, 'mean_subtraction_cumulation': 2}

I see this set of lines every few minutes. The problem is that in some point I don't see these new sets anymore, nor anything else on the log, but the process is still up... The webapp then shows the analyzed anomalies from the last time the analyzer ran, and that's it. No new analyzer runs.

Abe Stanway

unread,
Sep 17, 2014, 2:29:26 PM9/17/14
to Roee Zabari, skyli...@googlegroups.com
Huh, quite strange. Any chance you can run a strace or dtrace on the analyzer process and send that to me?

Roee Zabari

unread,
Sep 17, 2014, 3:27:58 PM9/17/14
to skyli...@googlegroups.com, ro...@liveperson.com
 Yes, here it is:

XXX:/opt/skyline/bin [root] > date ; strace -p 15272 ; date
Wed Sep 17 15:04:54 EDT 2014
Process 15272 attached - interrupt to quit
select(0, NULL, NULL, NULL, {81, 109625}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0})   = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0})   = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0})   = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0})   = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0})   = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0})   = 0 (Timeout)
select(0, NULL, NULL, NULL, {100, 0}^C <unfinished ...>
Process 15272 detached
Wed Sep 17 15:17:28 EDT 2014
XXX:/opt/skyline/bin [root] > 

After that I stopped it and started it again, and actually noticed that there are 5 analyzer-agent.py processes (with ANALYZER_PROCESSES = 3), but when it gets stuck only 1 process is left (this was the one I traced).
I'll try to trace now all the 5 processes until they're dying again and send them to you afterwards.

Roee Zabari

unread,
Sep 17, 2014, 3:37:59 PM9/17/14
to skyli...@googlegroups.com, ro...@liveperson.com
looked at it again and I'll trace only the 2 main processes, the workers are less interesting I guess because they come and go.

Abe Stanway

unread,
Sep 17, 2014, 4:11:42 PM9/17/14
to Roee Zabari, skyli...@googlegroups.com
Can you do the strace while it's still working and catch it when it breaks?

Roee Zabari

unread,
Sep 17, 2014, 4:32:21 PM9/17/14
to skyli...@googlegroups.com, ro...@liveperson.com
Yes sure, I'm already running it. I'll post it here after it breaks.

Roee Zabari

unread,
Sep 18, 2014, 4:09:12 AM9/18/14
to skyli...@googlegroups.com, ro...@liveperson.com
Hi,
I've attached the strace files of the parent process (16593) and the child (16594) which starts the analyzer processes.
16593 is staying alive, and 16594 is the one that dies.
strace.rar
Reply all
Reply to author
Forward
0 new messages