Unreliable counter values when # of counter increases (timeouts)

Vincent

unread,

Mar 3, 2016, 2:58:05 PM3/3/16

to cascading-user

Hi,

I am on Cascading 2.5.6 running a Cascade with multiple Flows in it on Hadoop 2.4.

Each flow has multiple operations and each operation may log statistics to the FlowProcess via counters.

I currently have ~95 counters. This works well, both locally and on the EMR cluster.

However it seems when I increase the number of counters slightly above that (>100) I consistently run into problems.

After digging, it seems I am consistently getting timeout exceptions when trying to retrieve the counters.

I'm seeing 2 different things related to that in logs:

2016-03-03 02:46:15,985 WARN cascading.stats.hadoop.BaseHadoopStepStats (pool-10-thread-1): fetching counters timed out after: 5 seconds, attempts: 1

As well as:

2016-03-03 03:08:54,954 INFO org.apache.hadoop.ipc.Client (stats-futures): Retrying connect to server: ip-10-207-5-239/10.207.5.239:44747. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

2016-03-03 03:08:55,955 INFO org.apache.hadoop.ipc.Client (stats-futures): Retrying connect to server: ip-10-207-5-239/10.207.5.239:44747. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

2016-03-03 03:08:56,956 INFO org.apache.hadoop.ipc.Client (stats-futures): Retrying connect to server: ip-10-207-5-239/10.207.5.239:44747. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

After looking at the Cascading code, it seems there are 3 attempts (hardcoded) to retrieve a Counter according to cascading.step.counter.timeout and after which you will only get cached counters, which makes the results completely unreliable because if timeout exceptions occurred then you silently get the wrong values.

Are these timeouts as the # of counter increases more or less expected or have been experienced by anyone else?

Trying to get some insight here on why that happens.. I know it's getting dangerously close to Hadoop's default limit of 120 but I still wouldn't expect thing to behave this badly. I can always try increasing the timeout values, but it's still worrisome I may silently get wrong values.

Should those Exceptions at the very least be logged as ERROR? Right now it's hard to trust the counters because you potentially get wrong results without any exceptions or errors surfacing. Should the number of attempts also be exposed as a JobConf property (similar to the timeout?

Thanks.

Vincent

Chris K Wensel

unread,

Mar 3, 2016, 3:37:26 PM3/3/16

to cascadi...@googlegroups.com

In Cascading 3 we have

/**
 * Method getLastSuccessfulCounterFetchTime returns the time, in millis, the last moment counters
 * were successfully retrieved.
 * <p/>
 * If -1, counter values were never successfully retrieved.
 * <p/>
 * If this return value is less than the {@link CascadingStats#getFinishedTime()} it is likely the
 * counter service became unavailable.
 *
 * @return the moment counters were last successfully retrieved
 */
long getLastSuccessfulCounterFetchTime();

https://github.com/cascading/cascading/blob/13058185bf30976b96f31f8ca006e9c3635e3549/cascading-core/src/main/java/cascading/stats/ProvidesCounters.java#L41-L41

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/b5f2360d-aebf-48c5-b96e-5e46feeb077e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

Vincent

unread,

Apr 1, 2016, 12:34:31 PM4/1/16

to cascading-user

I see, thanks. I guess it's better than nothing to know that it failed but it generally makes counters very hard to rely on since there's no guaranteed way of retrieving stats accurately on every run.

Vince

Reply all

Reply to author

Forward