Need help with error GC Overhead limit exceeded

srinivas reddy gantla

unread,

Dec 12, 2013, 10:54:14 PM12/12/13

to scoobi...@googlegroups.com

Hi ,

I have a DList of type a = DList((Key),(value1,value2,value3))
when I do group by key. I get value as Iterable(value1,value2,value3) and Am trying to convert this to list by doing ".toList".
(from that list, I want to do distinct count of "value1", median of "value2" and sum of "value3")

But this step is failed with the error GC Overhead limit exceeded. Because I have more than 30mil different values for each key.

Can anyone please help me with this? Is there any other way of doing this without converting it to List.

Thanks
Srinivas

Eric Torreborre

unread,

Dec 15, 2013, 6:18:33 AM12/15/13

to scoobi...@googlegroups.com

Hi Srinivas,

If you materialise the values in memory with .toList you are indeed going to exceed your memory. However you don't need to do that. You can still iterate over your values with a for loop and accumulate the values you want (for count and sum):

list.mapValues { values =>

var (count, sum) = (0, 0)

values.foreach { case (v1, v2, v3) =>

count += 1

sum += v3

}

(count, sum)

}

For the median, I don't know what's the best approach. Intuitively I would build a frequency map: Map[V2, Int] in the for loop to count how many times a value v2 has been seen. And from this map deduce the mean value. There might be a better way but I don't see which one if you have a large number of values.

E.

srinivas reddy gantla

unread,

Dec 15, 2013, 12:45:02 PM12/15/13

to scoobi...@googlegroups.com

Thanks Eric for the Reply.

Now my groupByKey is also failed on a BIG data set with the same error.

Can you please point me to some examples where i can do group by in some other way?

Thanks

Srini

--
You received this message because you are subscribed to the Google Groups "scoobi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scoobi-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Eric Torreborre

unread,

Dec 17, 2013, 5:30:23 PM12/17/13

to scoobi...@googlegroups.com

The new failure might be due to the way you compute the median. If you've used a Map to count the number of time a given value has been seen then, if the number of different values is large, you might still blow up memory. What you can do instead is to create a map where keys are "buckets": bucket 1 is to count how many values between 1 and 10, bucket 2 to count the values between 10 and 20 and so on. This will not give you the exact mean but an approximation of it.

E.

Raghu

unread,

Apr 17, 2014, 2:34:06 PM4/17/14

to scoobi...@googlegroups.com

Hi Eric,

I am trying the same thing and I tried what you suggested. Iterate over Iterator and calculate count and sum, but it fails with java heap space. I have 36 million records for one key and the scoobi job is not able to handle it. Is there something similar to blockjoin to handle large volume of data after grouping?

Thanks,
Raghu

Raghu

unread,

May 13, 2014, 2:47:36 PM5/13/14

to scoobi...@googlegroups.com

Here is the solution. For counts add a column to the AvroObject and set it to 1 before applying the reduction.

def red: Reduction[AvroObject] = Reduction((row1: AvroObject, row2: AvroObject) => {

      val field4Sum = row1.getField4+ row2.getField4
      val field5Sum = row1.getField5+ row2.getField5
      val count = row1.getCountField + row2.getCountField
      row1.setField4(field4Sum )
      row1.setField5(field5Sum )
      row1.setCountField(count)

      row1
    })

val summary = DList[AvroObject] groupBy { row => (row.getField1, row.getField2, row.getField3) }.combine(red)

Thanks,
Raghu

Reply all

Reply to author

Forward