Negative variance...

29 views
Skip to first unread message

Adrian Landman

unread,
Jan 15, 2014, 1:50:12 PM1/15/14
to dat...@googlegroups.com
I am using the DataFu package in PIG and after changing a script around some I am now getting negative values for my variance.  

The relevant part of my script can be seen below

user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);
min_by_day = FOREACH user_days GENERATE group, MIN(aliased_user_times.miliseconds_into_day) AS min;
max_by_day = FOREACH user_days GENERATE group, MAX(aliased_user_times.miliseconds_into_day) AS max;
some_times_by_day = JOIN min_by_day BY $0, max_by_day BY $0;
all_times_by_day = FOREACH some_times_by_day GENERATE min_by_day::group AS grouped, max AS max, min AS min, max-min as time_on;
times_by_day = FOREACH all_times_by_day GENERATE FLATTEN(grouped) AS (user, year, month, day, day_of_week), min, max, time_on;


times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);

start_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
end_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles;
worked_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS   hours_quartiles;

Basically I am gathering statistics based on the earliest and latest timestamps for a user.  I had this running before and it was fine...I tweaked some minor earlier grouping and was seeing negative values for the variance which of course meant negative values for my standard deviation.  Any ideas what could cause this?

Matthew Hayes

unread,
Jan 15, 2014, 2:33:48 PM1/15/14
to dat...@googlegroups.com
One improvement you could make is to compute the min and max without joining.  This isn't your problem but your script will run faster :)

Essentially:

user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);
min_and_max_by_day = FOREACH user_days GENERATE group,
                                               MIN(aliased_user_times.miliseconds_into_day) AS min,
                                               MAX(aliased_user_times.miliseconds_into_day) AS max;

One possibility is that you are overflowing the range of long.  VAR will square each value in the bag and compute their sum.  Since you are dealing with milliseconds within a day this could end up being quite large.  Maybe you can try seconds or minutes into the day instead.

Also you FLATTEN as (user, year, month, day, day_of_week) but you grouped by (uid, ldap_server,domain, year, month, day, day_of_week).  I'm not sure what Pig will do as a result of this.  The tuple you are working with after may not be what you expect.



--
You received this message because you are subscribed to the Google Groups "DataFu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafu+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Adrian Landman

unread,
Jan 15, 2014, 2:41:16 PM1/15/14
to dat...@googlegroups.com
Sorry I changed some of this to post

user_days = GROUP aliased_user_times BY (user, year, month, day, day_of_week);
min_by_day = FOREACH user_days GENERATE group, MIN(aliased_user_times.miliseconds_into_day) AS min;
max_by_day = FOREACH user_days GENERATE group, MAX(aliased_user_times.miliseconds_into_day) AS max;
some_times_by_day = JOIN min_by_day BY $0, max_by_day BY $0;
all_times_by_day = FOREACH some_times_by_day GENERATE min_by_day::group AS grouped, max AS max, min AS min, max-min as time_on;
times_by_day = FOREACH all_times_by_day GENERATE FLATTEN(grouped) AS (user, year, month, day, day_of_week), min, max, time_on;


times_by_day_of_week = GROUP times_by_day BY (user, day_of_week);

start_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
end_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles;
worked_stats= FOREACH times_by_day_of_week GENERATE group, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS   hours_quartiles;

Is what it should say.  When I tried to get both the max and min without joining I would invariably heap so that wasn't good...

Matthew Hayes

unread,
Jan 15, 2014, 2:45:48 PM1/15/14
to dat...@googlegroups.com
I see, strange that you'd have heap issues with min and max together.  So the code you just posted is your current code that is giving you trouble right?  Can you try computing the variance of the seconds or minutes instead of milliseconds and see if this helps?  Given the way VAR is doing the computation I can't see how else the value would become negative.
Reply all
Reply to author
Forward
0 new messages