Calling Julia from PySpark

Skip to first unread message

Harish Kumar

Nov 2, 2016, 9:14:25 PM11/2/16
to julia-dev
I have a RDD with 10K columns and 70 million rows,  70 MM rows will be grouped into 2000-3000 groups based on a key attribute. I followed below steps 

1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
    def juliaCall(x):
      <<convert x (list of rows) to  list of list inputdata>>
       j = julia.Julia()
       jcode = """     """
       calc= j.eval(jcode )
      result = calc(inputdata)

      RDD.groupBy(key).map(lambda x: juliaCall(x))

It works fine foe Key (or group) with 50K records, but my each group got 100K to 3M records. in such cases Shuffle will be more and it will fail. Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone mode and i allocated only 10 cores per node. 

Any help?

Stefan Karpinski

Nov 2, 2016, 9:37:13 PM11/2/16
This question is better suited to the julia-users mailing list.
Reply all
Reply to author
0 new messages