[scalding] preserving group keys after groupBy

130 views
Skip to first unread message

Miguel Ping

unread,
Apr 14, 2014, 1:56:09 PM4/14/14
to cascadi...@googlegroups.com
If I have the following: 

  val aggr = filtered.groupBy('a, 'b, 'c) { group =>
    group.min('a -> 'test)
  }

How can I preserve the group fields after the groupBy? 
I want to achieve the same as the following in apache pig (notice the flatten(group)):

aggr = foreach filtered {
  test = min(aggr.a);
 generate flatten(group),
                test;
 }

Thanks

Oscar Boykin

unread,
Apr 14, 2014, 1:59:44 PM4/14/14
to cascadi...@googlegroups.com
In scalding, groupBy('a, 'b, 'c) { .. } means that for each value of the fields a, b, c, do the given aggregation. In your code, there will only ever be one value of 'a in each group, so why are you taking the minimum of one item?

I don't know pig well enough to translate the code you wrote into scalding. Can you describe more in detail the input and output you want?


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2b6b7e08-645e-42ab-9277-9dbfe47a8200%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Oscar Boykin :: @posco :: http://twitter.com/posco

Jonathan Coveney

unread,
Apr 14, 2014, 2:22:55 PM4/14/14
to cascadi...@googlegroups.com
Unlike oscar, I know Pig pretty well, and I think that the scalding is doing what you want.

 val aggr = filtered.groupBy('a, 'b, 'c) { group =>
    group.min('a -> 'test)
  }

should result in aggr having four fields: 'a, 'b, 'c, 'test. Oscar is also right in that 'test and 'a are going to be the same, in this case, since you're grouping on it.


Miguel Ping

unread,
Apr 15, 2014, 5:52:30 AM4/15/14
to cascadi...@googlegroups.com
Sorry for the bad example guys; I got mixed up. I was under the impression that scalding wasn't emitting the group.
Here's a better one:

[name age location]
[a 10 pt]
[a 20 gb]
[b 10 uk]
[b 15 es]

output should be:
[a 10 [pt gb]]
[b 10 [uk es]]

val aggr = filtered.groupBy('name) { group =>
    group.min('age -> 'minAge) //want to emit rest of tuples also
}

Thanks!

On Monday, April 14, 2014 7:22:55 PM UTC+1, Jonathan Coveney wrote:
Unlike Oscar, I know Pig pretty well, and I think that the scalding is doing what you want.

Oscar Boykin

unread,
Apr 15, 2014, 12:15:01 PM4/15/14
to cascadi...@googlegroups.com
On Tue, Apr 15, 2014 at 2:52 AM, Miguel Ping <migue...@gmail.com> wrote:
Sorry for the bad example guys; I got mixed up. I was under the impression that scalding wasn't emitting the group.
Here's a better one:

[name age location]
[a 10 pt]
[a 20 gb]
[b 10 uk]
[b 15 es]

output should be:
[a 10 [pt gb]]
[b 10 [uk es]]

data.groupBy('name) {
  _.min('age)
    .toList('location -> 'locations)
}

 

For more options, visit https://groups.google.com/d/optout.

Miguel Ping

unread,
Apr 17, 2014, 11:01:40 AM4/17/14
to cascadi...@googlegroups.com
So simple. Not sure if this is common, but I do this in Pig all the time; I couldn't find anything on the docs...
Thanks!
Message has been deleted

Miguel Ping

unread,
Apr 21, 2014, 12:22:37 PM4/21/14
to cascadi...@googlegroups.com
By the way, how is the code suppose to look if I want to generate a list for more than one field? I hit an error on arity:

[name age location country]
[a 10 a pt]
[a 20 b gb]
[b 10 c uk]
[b 15 d es]

data.groupBy('name) {
  _.min('age)
    .toList('location, 'contry -> 'locations) //arity check error
}

Caused by: java.lang.AssertionError: assertion failed: Arity of (class com.twitter.scalding.LowPriorityTupleConverters$$anon$5) is 1, which doesn't match: + ('location', 'country')
        at scala.Predef$.assert(Predef.scala:179)
        at com.twitter.scalding.TupleArity$class.assertArityMatches(TupleArity.scala:42)
        at com.twitter.scalding.LowPriorityTupleConverters$$anon$5.assertArityMatches(TupleConverter.scala:47)
        at com.twitter.scalding.GroupBuilder.mapReduceMap(GroupBuilder.scala:188)
        at com.twitter.scalding.GroupBuilder.mapReduceMap(GroupBuilder.scala:37)
        ...

I generated the list beforehand:

data.map(('location, 'country) -> 'aggr){els:(String,String) => val(loc,c) = elems; List(loc,c)}
    .groupBy('name) {
     _.min('age)
      .toList('aggr -> 'list)
}

I suppose this is the only way? I would prefer to generate the list on the fly within the groupBy

Oscar Boykin

unread,
Apr 21, 2014, 1:27:50 PM4/21/14
to cascadi...@googlegroups.com
On Mon, Apr 21, 2014 at 8:51 AM, Miguel Ping <migue...@gmail.com> wrote:
By the way, how is the code suppose to look if I want to generate a list for more than one field? I hit an error on arity:

[name age location country]
[a 10 a pt]
[a 20 b gb]
[b 10 c uk]
[b 15 d es]

data.groupBy('name) {
  _.min('age)
    .toList('location, 'contry -> 'locations) //arity check error
}

You must put the parens around the right fields:
 .toList(('location, 'contry) -> 'locations)

 

Caused by: java.lang.AssertionError: assertion failed: Arity of (class com.twitter.scalding.LowPriorityTupleConverters$$anon$5) is 1, which doesn't match: + ('siteKey', 'contentHost')
        at scala.Predef$.assert(Predef.scala:179)
        at com.twitter.scalding.TupleArity$class.assertArityMatches(TupleArity.scala:42)
        at com.twitter.scalding.LowPriorityTupleConverters$$anon$5.assertArityMatches(TupleConverter.scala:47)
        at com.twitter.scalding.GroupBuilder.mapReduceMap(GroupBuilder.scala:188)
        at com.twitter.scalding.GroupBuilder.mapReduceMap(GroupBuilder.scala:37)
        ...

I'm guessing I have to use something like mapPlusMap to generate a single field?
Thanks!

On Thursday, April 17, 2014 4:01:40 PM UTC+1, Miguel Ping wrote:

For more options, visit https://groups.google.com/d/optout.

Miguel Ping

unread,
Apr 21, 2014, 1:52:41 PM4/21/14
to cascadi...@googlegroups.com
That was my first mistake. 
Even with the parens, I get the same arity check error. I have a workaround (see my prev post), but it would be nice to be able to do this.

Oscar Boykin

unread,
Apr 21, 2014, 2:23:23 PM4/21/14
to cascadi...@googlegroups.com
Woops.

toList actually requires the type of the items you are getting out:

.toList[(String, String)](('location, 'contry) -> 'locations)



For more options, visit https://groups.google.com/d/optout.

Miguel Ping

unread,
Apr 21, 2014, 2:28:34 PM4/21/14
to cascadi...@googlegroups.com
Thanks. I really should read more about scala's type system...
Reply all
Reply to author
Forward
0 new messages