reduceByKey is not a member of Array[(java.lang.String, Int)]

3,092 views
Skip to first unread message

exwhyz

unread,
Jun 17, 2013, 5:59:05 PM6/17/13
to spark...@googlegroups.com
I am getting this error on Spark 0.7.2 (scala 2.9.3) when using reduceByKey in spark-shell. Here is the command that fails:
 
textFile.take(10).map(line => line.split("\t")).map(line => (line(10), 1)).reduceByKey(_+_, 1).collect

Ian O'Connell

unread,
Jun 17, 2013, 6:05:20 PM6/17/13
to spark...@googlegroups.com
Take(n) Return an array with the first n elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements.

You basically called an action to which the result is not an RDD but an array, so nothing from then on is really occurring in a spark way. Its just local ops on the driver node and Array doesn't have reduceByKey... If you wanted to use this in testing you could modify it to be...
sc.parallelize(textFile.take(10)).map(line => line.split("\t")).map(line => (line(10), 1)).reduceByKey(_+_, 1).collect

which should work (since we re-distribute the results from the take)


On Mon, Jun 17, 2013 at 2:59 PM, exwhyz <hum...@gmail.com> wrote:
I am getting this error on Spark 0.7.2 (scala 2.9.3) when using reduceByKey in spark-shell. Here is the command that fails:
 
textFile.take(10).map(line => line.split("\t")).map(line => (line(10), 1)).reduceByKey(_+_, 1).collect

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

exwhyz

unread,
Jun 17, 2013, 6:19:39 PM6/17/13
to spark...@googlegroups.com, i...@ianoconnell.com
That was it, once I parallelized to RDD it works. Thanks!

Josh Rosen

unread,
Jun 17, 2013, 6:29:15 PM6/17/13
to spark...@googlegroups.com
Collecting a result to the driver only to immediately re-distribute it seems like an anti-pattern that will cause problems once you start working with big datasets.

If you can collect results to the driver and they're small enough to fit in memory, it's generally going to make more sense to process them further on the driver rather than redistributing them to Spark.

It looks like the original command was trying to add a "limit" to the text file to control the amount of input that was read.  Maybe there's a better way to do this that doesn't involve a collect() and re-parallelize.

Ian O'Connell

unread,
Jun 17, 2013, 6:37:59 PM6/17/13
to spark...@googlegroups.com
Sorry yes I should have prefix'd the solution was really based on the notion of doing a small scale test for debugging. i.e. take the head of the real input to work with. The sample for that use case I've used previous for that and it works quite well though obviously its a pseudo-random distribution of data from the input set so may not work for all cases. but it can be slotted into any transformation pipeline without introducing an anti-pattern issue.

exwhyz

unread,
Jun 22, 2013, 1:43:41 PM6/22/13
to spark...@googlegroups.com
Agreed, it was more of a small scale test and not something that would be used in live scenarios. I will consider other/better ways to do this, but since this was quick and compact way of limiting to Top n records it was irresistible to use for testing purposes.
Reply all
Reply to author
Forward
0 new messages