I have a RDD with 10K columns and 70 million rows, 70 MM rows will be grouped into 2000-3000 groups based on a key attribute. I followed below steps
1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
def juliaCall(x):
<<convert x (list of rows) to list of list inputdata>>
j = julia.Julia()
jcode = """ """
calc= j.eval(jcode )
result = calc(inputdata)
RDD.groupBy(key).map(lambda x: juliaCall(x))
It works fine foe Key (or group) with 50K records, but my each group got 100K to 3M records. in such cases Shuffle will be more and it will fail. Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone mode and i allocated only 10 cores per node.
Any help?