--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
and if i want to make it a map-side join, i do:
a1 = a.partitionBy(myPartitioner)
b1 = b.partitionBy(myPartitioner)and now a.join(b) is a map-side join?
On Thursday, June 13, 2013 3:04:39 PM UTC-7, Koert Kuipers wrote:and if i want to make it a map-side join, i do:
a1 = a.partitionBy(myPartitioner)b1 = b.partitionBy(myPartitioner)and now a.join(b) is a map-side join?
++I have this doubt too. how do we make sure that RDD-element with a particular key of a1 and b1 reside inside the same node?Assuming a1 and b1 both use hashpartitioned and partitioned by hashvalue of key into 4 RDDs.
Also if a1 and b1 are normal RDDs not partitioned, then when those RDDs are computed what policy decides how many RDDs they will be split and does it use any hashpartition mechanism in this case or the RDDs are randomly split??
--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/gUyCSoFo5RI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/gUyCSoFo5RI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.
val distPool = sc.parallelize(poolL)poolL is a normal scala collection and its not a spark RDD and I am not using any partitionBy() call on the RDD distPool.
case-1) The datastore from where poolL is populated is lets say a local file from local file system(not hdfs)What would be the partition mechanism for distPool RDD? how many partitions would it have? Does it depend on the size of distPool and the available RAM on the worker nodes?
When will the partitions(or # of partitions and the nodes hosting the partitions) be materialized?
Will it be materialized only when some computation triggers the calculation of distPool RDD?case-2) Lets say poolL is populated in memory and not from a local file. How does it change the partition mechanism for distPool RDD?
--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
--
What if you add a node to the cluster, or a node dies? Or what if a node is busy and can never be run anything - are you going to block the entire program forever so you could put a partition on that node?
You can still write your program in a way that does mostly local joins. And even if the joins need to fetch blocks from remote nodes, the joins still avoid an extra round of shuffle, which is much more expensive than just fetching data.
Look at the PageRank example (Section 3.2.2) in Spark's NSDI paper: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
1) cust1.partitioner shows as 'None' . Is it happening because my hdfs cluster does not share any nodes with spark cluster?Does this behaviour change if I have both hdfs and spark running in same nodes?
2) Also partitionBy does not work cust1. Looks like it always expects the RDD in (key, value) format for it to work.each cust1 element represents a tab delimited text line and it does not have any (key,value) notion and I wanted to partition it so that the split on delimiters and filters on split fields on cust1 would be parallelized down the flow. Is there anyway I can partition and parallalize cust1 based on linenumbers etc into 4 partitions?
3) I do create a pair out of cust1 down the flow and use groupby. The resultanting group by RDD does show partitioner=hashpartitioner because of the groupby operation. But I am curious how many partitions (RDD cust) will it be creating based on the hash values of keys when it gets materialized. Please see below.