Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
dealing with heavily skewed data sets
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  12 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Ben Linsay  
View profile  
 More options Apr 14 2012, 2:13 am
From: Ben Linsay <blin...@gmail.com>
Date: Fri, 13 Apr 2012 23:13:04 -0700 (PDT)
Local: Sat, Apr 14 2012 2:13 am
Subject: dealing with heavily skewed data sets

hi all,

what's the standard approach for dealing with massively skewed data sets in
cascading? I'm currently dealing with a situation where the same reduce
task is throwing an OutOfMemory error during tuple deserialization (stack
trace is below) , which I can only imagine is caused by an extremely large
group of tuples with the same key. i have a trap on this branch of my
stream, but  this seems not to trigger it.

i'm in a situation where I can throw this group away as an outlier, so i'm
going to try using a FirstN aggregator to limit the size of each reduce
group and throw away tuples in excess of what's "reasonable". This seems
reasonable, but also rather kludgy. I'd love a more elegant solution.

relevant notes: this may not be a software problem, and I just need more
hardware. the data-set is is about 1T, and I'm running on an EMR cluster of
m1.xlarge boxes configured using amazon's memory-intensive bootstrap
action. i'm considering upgrading to cc1.4xlarge boxes.

--- stack trace ---

cascading.flow.FlowException: internal error during reducer execution
        at cascading.flow.hadoop.FlowReducer.reduce(FlowReducer.java:122)
        at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:527)
...
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.Long.valueOf(Long.java:557)
        at
cascading.tuple.hadoop.HadoopTupleInputStream.readType(HadoopTupleInputStre am.java:93)
        at
cascading.tuple.hadoop.HadoopTupleInputStream.getNextElement(HadoopTupleInp utStream.java:52)
        at
cascading.tuple.hadoop.TupleElementComparator.compare(TupleElementComparato r.java:74)
        at
cascading.tuple.hadoop.TupleElementComparator.compare(TupleElementComparato r.java:32)
        at
cascading.tuple.hadoop.DelegatingTupleElementComparator.compare(DelegatingT upleElementComparator.java:73)
        at
cascading.tuple.hadoop.DelegatingTupleElementComparator.compare(DelegatingT upleElementComparator.java:33)
        at
cascading.tuple.hadoop.DeserializerComparator.compareTuples(DeserializerCom parator.java:145)
        at
cascading.tuple.hadoop.GroupingSortingComparator.compare(GroupingSortingCom parator.java:59)
        at
org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
        at
org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:136)
        at
org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
... (many more) ...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ken Krugler  
View profile  
 More options Apr 14 2012, 11:05 am
From: Ken Krugler <kkrugler_li...@transpac.com>
Date: Sat, 14 Apr 2012 08:05:36 -0700
Local: Sat, Apr 14 2012 11:05 am
Subject: Re: dealing with heavily skewed data sets

From what I know of the 1.2 code, you can't run out of memory in a GroupBy/CoGroup just because there are lots of tuples in a  group.

This assumes that you're not doing anything in a custom Buffer or Aggregator that consumes memory, of course.

So maybe it's a 2.0 issue?

-- Ken

On Apr 13, 2012, at 11:13pm, Ben Linsay wrote:

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben Linsay  
View profile  
 More options Apr 14 2012, 1:24 pm
From: Ben Linsay <blin...@gmail.com>
Date: Sat, 14 Apr 2012 10:24:41 -0700 (PDT)
Local: Sat, Apr 14 2012 1:24 pm
Subject: Re: dealing with heavily skewed data sets

unfortunately I have to concatenate a few fields of each tuple with a
particular ID to generate "bags" of data, so I'm definitely using memory in
my aggregators. no need to blame chris. :P


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris K Wensel  
View profile  
 More options Apr 14 2012, 1:40 pm
From: Chris K Wensel <ch...@wensel.net>
Date: Sat, 14 Apr 2012 10:40:46 -0700
Local: Sat, Apr 14 2012 1:40 pm
Subject: Re: dealing with heavily skewed data sets

why not group on the id instead of stuffing all the values in a Tuple, then use a Buffer to work against them.

On Apr 14, 2012, at 10:24 AM, Ben Linsay wrote:

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben Linsay  
View profile  
 More options Apr 14 2012, 4:09 pm
From: Ben Linsay <blin...@gmail.com>
Date: Sat, 14 Apr 2012 13:09:36 -0700 (PDT)
Local: Sat, Apr 14 2012 4:09 pm
Subject: Re: dealing with heavily skewed data sets

yeah, i think that's what i'll end up doing. i'm going to end up
serializing my super-nested tuple bags to JSON, so it felt better to have
smaller operations and compose them instead of trying to group and encode
in one shot.

thanks for pointing me in the right direction.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Maxime Brugidou  
View profile  
 More options Apr 15 2012, 6:09 pm
From: Maxime Brugidou <maxime.brugi...@gmail.com>
Date: Mon, 16 Apr 2012 00:09:06 +0200
Local: Sun, Apr 15 2012 6:09 pm
Subject: Re: dealing with heavily skewed data sets

I also like the Hive way of dealing with skewed data on the reduce side
(for aggregations). They shuffle keys in the first M/R job to do a
non-skewed pre-aggregation and they add an extra M/R job for the final
aggregation. (cf https://issues.apache.org/jira/browse/HIVE-964)

Is that possible in Cascading 2.0? For heavily skewed data it's totally
worth having an extra job.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris K Wensel  
View profile  
 More options Apr 15 2012, 6:58 pm
From: Chris K Wensel <ch...@wensel.net>
Date: Sun, 15 Apr 2012 15:58:16 -0700
Local: Sun, Apr 15 2012 6:58 pm
Subject: Re: dealing with heavily skewed data sets

From what I can tell, that feature is for joins, which is something we should probably look into if I can sort out what its actually doing.

As for aggregations in a GroupBy, see the AggregateBy sub-assembly (and its sub-types). This will allow for partial aggregations map side. Which is similar to what you described but in a single MR job.

http://www.cascading.org/1.2/userguide/html/ch06s09.html

ckw

On Apr 15, 2012, at 3:09 PM, Maxime Brugidou wrote:

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Maxime Brugidou  
View profile  
 More options Apr 15 2012, 7:23 pm
From: Maxime Brugidou <maxime.brugi...@gmail.com>
Date: Mon, 16 Apr 2012 01:23:08 +0200
Local: Sun, Apr 15 2012 7:23 pm
Subject: Re: dealing with heavily skewed data sets

I totally mixed-up the two operations (aggregation & join), the additional
M/R jobs are for skewed joins and map-side pre-aggregation is for skewed
aggregations :) Sorry about that.

On Mon, Apr 16, 2012 at 12:58 AM, Chris K Wensel <ch...@wensel.net> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robin Kraft  
View profile  
 More options Apr 16 2012, 5:10 pm
From: Robin Kraft <robinkr...@gmail.com>
Date: Mon, 16 Apr 2012 14:10:05 -0700 (PDT)
Local: Mon, Apr 16 2012 5:10 pm
Subject: Re: dealing with heavily skewed data sets

Hey Ben,

I just wanted to let you know that you're not alone! I've seen similar heap
errors using extremely skewed datasets and Cascalog, joining a set of 50
tuples with a set of 100s of thousands or a few million tuples.

In Cascalog our solution was to create a Clojure map in a separate, very
quick map-side job. After that it could be passed into the custom function,
where keyword lookups produced the same result as a join. It's much, much
faster and stabler. I've never used Cascading directly, but since it works
in Cascalog I assume there's an analogue in pure Cascading.

Here's what you'd do in Cascalog (skipping the initial map job that would
otherwise create "other-vals").

(def src
"Our input data"
[[1 20]
 [2 30]])

(defn join-it
  "Join actually happens here with keyword lookup"
  [id other-vals]
  (other-vals (keyword (str id))))

;; the query
(let [other-vals {:1 10 :2 25}]
  (??<- [?id ?val ?val2]
    (src ?id ?val)
    (join-it ?id other-vals :> ?val2)))

Result:
1 20 10
2 30 25

-Robin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris K Wensel  
View profile  
 More options Apr 16 2012, 5:31 pm
From: Chris K Wensel <ch...@wensel.net>
Date: Mon, 16 Apr 2012 14:31:25 -0700
Local: Mon, Apr 16 2012 5:31 pm
Subject: Re: dealing with heavily skewed data sets

you should't be seeing OOME when doing a CoGroup unless you have "cascading.spill.threshold" set to an unreasonably large number (or you are stuff nearly unreasonable amounts of data into a single Tuple), and you had ordered your tuple streams incorrectly in the CoGroup (large to small)

also, Join (likely renamed to HashJoin) is an effective map side join, and in the case of an inner-join, an effective filter, which would be the best practice if one side fits in memory (assuming you order your tuple streams from large to small, as in CoGroup) and you won't be performing an Aggregation afterwards.

ckw

On Apr 16, 2012, at 2:10 PM, Robin Kraft wrote:

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris K Wensel  
View profile  
 More options Apr 16 2012, 5:34 pm
From: Chris K Wensel <ch...@wensel.net>
Date: Mon, 16 Apr 2012 14:34:40 -0700
Local: Mon, Apr 16 2012 5:34 pm
Subject: Re: dealing with heavily skewed data sets

to be clearer, CoGroup and Join work best when you order your tuple streams from large to small...

when in the case of CoGroup, large and small mean num values per unique key.

and in the case of Join, large and small means doesn't and does fit in memory, respectively

ckw

On Apr 16, 2012, at 2:31 PM, Chris K Wensel wrote:

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oscar Boykin  
View profile  
 More options Apr 16 2012, 5:42 pm
From: Oscar Boykin <os...@twitter.com>
Date: Mon, 16 Apr 2012 14:42:50 -0700
Local: Mon, Apr 16 2012 5:42 pm
Subject: Re: dealing with heavily skewed data sets
We implemented a replicated join on top of CoGroup for scalding:

https://github.com/twitter/scalding/blob/master/src/main/scala/com/tw...

It helps dealing with skewed data.

Right now, the replication is constant across all keys, but a smarter approach is to only replicate the keys with a lot of values (we have this on our roadmap: do a sampled cogroup + count, compute replication from count, then Join the replication to the non-sampled streams and apply the block join algorithm).

--
Oscar Boykin :: @posco :: https://twitter.com/intent/user?screen_name=posco

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »