My default Spark jobs run fine but I am not noticing significant speedup compared to scalding for wordcount at least. One of reason might be the default serialization in Spark.
Scalding most likley is using Kryo.
The following job runs fine:
SPARK_MEM=2g ./run-example org.apache.spark.examples.HdfsWordCount master inputPath outputPath
Now I tried to use Kryo serializer
SPARK_JAVA_OPTS="-Dspark.serializer.spark.KryoSerializer" SPARK_MEM=2g ./run-example org.apache.spark.examples.HdfsWordCount master inputPath outputPath
and the job fails.
What's the recommended serialization for large workloads you have tested ? Avro or Kryo ?
Thanks.
Deb