Hi all,
Attempting to sessionize some clickstream data. The issue I keep running into is that the bag sizes of the clickstream to sessionize seem to be too large and Pig begins proactively spilling records to disk during reduce. Trying to understand if anyone has found a way around this. Script could look something like
DEFINE ISOToMonth org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToMonth();
DEFINE ISOToDay org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay();
DEFINE Sessionize datafu.pig.sessions.Sessionize('30m');
raw_data = LOAD 'clickstream' USING PigStorage('\t') AS (timestamp:chararray, account_id:chararray, user_id:chararray);
trimmed = FOREACH raw_data GENERATE timestamp, ISOToDay(timestamp) AS day:chararray, ISOToMonth(timestamp) AS month:chararray, account_id, user_id;
grouped = GROUP trimmed by (day, account_id, uuid);
sessionized = FOREACH grouped {
ordered = ORDER trimmed BY timestamp ASC;
GENERATE FLATTEN(Sessionize(ordered)) AS (timestamp, day, month, account_id, user_id, session_id);
};
The issue that I continually have is that the Sessionize operation fails as even fairly beefy machines have difficulty with the reduce triggered by the size of the bag created by grouped. Grouping by day is about the lowest granularity I'd like to look at given that I'm specifying session length of 30 minutes.
Is an optimization here to pre-order the entire clickstream before the nested FOREACH? I'm not positive what's causing the spill to disk, the order or the Sessionize UDF.
Thoughts?