I am using the DataFu package in PIG and after changing a script around some I am now getting negative values for my variance.
user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);
min_by_day = FOREACH user_days GENERATE group, MIN(aliased_user_times.miliseconds_into_day) AS min;
max_by_day = FOREACH user_days GENERATE group, MAX(aliased_user_times.miliseconds_into_day) AS max;
some_times_by_day = JOIN min_by_day BY $0, max_by_day BY $0;
all_times_by_day = FOREACH some_times_by_day GENERATE min_by_day::group AS grouped, max AS max, min AS min, max-min as time_on;
times_by_day = FOREACH all_times_by_day GENERATE FLATTEN(grouped) AS (user, year, month, day, day_of_week), min, max, time_on;
times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
start_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
end_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles;
worked_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles;
Basically I am gathering statistics based on the earliest and latest timestamps for a user. I had this running before and it was fine...I tweaked some minor earlier grouping and was seeing negative values for the variance which of course meant negative values for my standard deviation. Any ideas what could cause this?