Sessionize spills records to disk

47 views
Skip to first unread message

Mike Sukmanowsky

unread,
Dec 18, 2012, 11:04:12 AM12/18/12
to dat...@googlegroups.com
Hi all,

Attempting to sessionize some clickstream data.  The issue I keep running into is that the bag sizes of the clickstream to sessionize seem to be too large and Pig begins proactively spilling records to disk during reduce.  Trying to understand if anyone has found a way around this.  Script could look something like

 
DEFINE ISOToMonth org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToMonth();
DEFINE ISOToDay org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(); 
DEFINE Sessionize datafu.pig.sessions.Sessionize('30m');

raw_data = LOAD 'clickstream' USING PigStorage('\t') AS (timestamp:chararray, account_id:chararray, user_id:chararray);
trimmed = FOREACH raw_data GENERATE timestamp, ISOToDay(timestamp) AS day:chararray, ISOToMonth(timestamp) AS month:chararray, account_id, user_id;

grouped = GROUP trimmed by (day, account_id, uuid);
sessionized = FOREACH grouped {
  ordered = ORDER trimmed BY timestamp ASC;
  GENERATE FLATTEN(Sessionize(ordered)) AS (timestamp, day, month, account_id, user_id, session_id);
};  
 
The issue that I continually have is that the Sessionize operation fails as even fairly beefy machines have difficulty with the reduce triggered by the size of the bag created by grouped.  Grouping by day is about the lowest granularity I'd like to look at given that I'm specifying session length of 30 minutes.

Is an optimization here to pre-order the entire clickstream before the nested FOREACH?  I'm not positive what's causing the spill to disk, the order or the Sessionize UDF.

Thoughts?

Matthew Hayes

unread,
Jan 10, 2013, 4:28:00 PM1/10/13
to dat...@googlegroups.com
Hi Mike, 

What is the failure you are getting?   Out of memory?  Do you know the upper bound on how many records are added to the bag for each (day, account_id, uuid)?  

-Matt
Reply all
Reply to author
Forward
0 new messages