It is taking forever to count billon rows in cassandra

kant kodali

unread,

Nov 23, 2016, 4:16:51 PM11/23/16

to spark-conn...@lists.datastax.com

Hi Russell,

I have the following code

./spark-shell --conf spark.cassandra.connection.host=172.31.26.108 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864

scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB

df: org.apache.spark.sql.DataFrame = [test: binary]

scala> df.count

[Stage 0:> (0 + 2) / 13]

I followed the post below but it still takes me forever (I have been waiting for more than 15 mins)

http://stackoverflow.com/questions/31583249/apache-spark-taking-5-to-6-minutes-for-simple-count-of-1-billon-rows-from-cassan

Any ideas?

Thanks,

kant

kant kodali

unread,

Nov 23, 2016, 4:20:55 PM11/23/16

to spark-conn...@lists.datastax.com

I am using Spark 2.0.2 and spark-cassandra-connector_2.11-2.0.0-M3.jar

If I don't specify spark.cassandra.input.split.size_in_mb=67108864 I get the following

val df = spark.sql("SELECT test from hello") // This has about billion rows

scala> df.count

[Stage 0:=> (686 + 2) / 24686] // What are these numbers precisely

And this runs forever.

Thanks!

srinu gajjala

unread,

Nov 23, 2016, 4:32:06 PM11/23/16

to spark-conn...@lists.datastax.com

Make the below changes and try.

The recommended cores to be used is 5 and increase the no of executors(it is different from the no of nodes present). Allocate each executor memory based on the below calculation.

For example 3 executors per node then, 15/3 =5 -> 5- 5*0.07 ~ 4(allocate roughly this memory). The amount of memory we are deducting is teh overhead caused by yarn scheduler.

So better combination will be

--executor-memory 4G --executor-cores 5 --num-executors 11

Kind Regards,

__________________________________________

Srinivas Gajjala

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

kant kodali

unread,

Nov 23, 2016, 4:36:08 PM11/23/16

to spark-conn...@lists.datastax.com

I am using standalone cluster. no yarn or mesos.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

srinu gajjala

unread,

Nov 23, 2016, 4:38:56 PM11/23/16

to spark-conn...@lists.datastax.com

Even stand alone cluster has a in built cluster manager so this is the least memory that can occupy.

Kind Regards,

__________________________________________

Srinivas Gajjala

kant kodali

unread,

Nov 23, 2016, 4:46:38 PM11/23/16

to spark-conn...@lists.datastax.com

I have 1 executor jvm per node

so executor-cores 15/1

15- 15*0.07 = 2 so this is --executor-memory 2G

num-executors ??

Do I need to specify --conf spark.cassandra.input.split.size_in_mb=67108864 ?

srinu gajjala

unread,

Nov 24, 2016, 12:30:05 PM11/24/16

to spark-conn...@lists.datastax.com

no In this scenario cores means the no of tasks each executor can process parallelly and It will be taken care by the manager. These settings are for optimum utilization of the resources. We can partition the data according our resources for better optimization So you can use spark.cassandra.input.split.size_in_mb and it's default value is 64 and are you really using that big number.

Kind Regards,

__________________________________________

Srinivas Gajjala

Mobile- (816) 908-5747 | srinu.ga...@gmail.com

Russell Spitzer

unread,

Nov 28, 2016, 1:58:21 PM11/28/16

to spark-conn...@lists.datastax.com

[Stage 0:=> (686 + 2) / 24686] // What are these numbers precisely

Means that you are in Stage 0 of the current job
686 Tasks have completed and only 2 are currently being run in parallel
There are 24686 remaining tasks.

This means the max number of cores this application can use is 2. Increasing this will most likely greatly increase thorughput.

One little aside, due to the way Catalyst does counts, this will not be a pushed down count to C* it will actually require every row to be fully read from c* before it is counted. You can always try the rdd `cassandraCount` function for a pushed down count.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

kant kodali

unread,

Nov 28, 2016, 2:13:56 PM11/28/16

to spark-conn...@lists.datastax.com

Hi Russell,

Thanks for your response. I did run with cassandraCount as well it took me about 1hr 30mins.

by cores you mean spark.executor.cores ? or spark.cores.max this looks it will take all the available cores if you dont set and I did not set any. I run on stand alone mode

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,

Nov 28, 2016, 2:18:13 PM11/28/16

to spark-conn...@lists.datastax.com

It could be either of those. Your stand alone cluster has a certain number of cores that it can give defined by the number of nodes running workers and how many cores they allow. The job itself has a certain number of cores it is allowed to use `spark.cores.max` and by default this is all of them.

Check your spark web ui, SparkMasterIp:[8080|7080] and check it out

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

kant kodali

unread,

Nov 28, 2016, 6:00:20 PM11/28/16

to spark-conn...@lists.datastax.com

Looks like I can't change that "2" from [Stage 0:=> (686 + 2) / 24686]. I tried various options like the one below. I have 2 Spark worker nodes and 1 Spark master and I thought each worker node runs only one JVM. so I don't know if there is anything I can do besides just adding more worker nodes (although it doesn't make 100% sense since the network utilization of both worker nodes is 2KB/s which is super low)

--num-executors 6 --executor-cores 8 --total-executor-cores 8

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

kant kodali

unread,

Nov 28, 2016, 6:03:39 PM11/28/16

to spark-conn...@lists.datastax.com

My each worker node has 4 VCPU and 16GB RAM. (m4.xlarge)

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

Russell Spitzer

unread,

Nov 28, 2016, 6:17:25 PM11/28/16

to spark-conn...@lists.datastax.com

There is some confusion here I would recommend you go back to the spark standalone docs.
http://spark.apache.org/docs/latest/spark-standalone.html

A "Core" is just a thread inside an executor jvm which is spawned by a worker.

Each Worker JVM Spawns an Executor JVM per Application With X Cores depending on the applications request but no more than that Worker is authorized to make.

The worker can say it has any number of cores and this is not system dependent, you can oversubscribe. These machines are not especially big though but oversubscribing will most likely help your parallelism.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

siddharth verma

unread,

Nov 28, 2016, 9:24:31 PM11/28/16

to spark-conn...@lists.datastax.com

Hi,

I was working on a utility which can be used for cassandra full table scan, at a tremendously high velocity, cassandra fast full table scan.

How fast?

The script dumped ~ 229 million rows in 116 seconds, with a cluster of size 6 nodes.

Data transfer rates were upto 25MBps was observed on cassandra nodes.

For some use case, a spark cluster was required, but for some reason we couldn't create spark cluster. Hence, one may use this utility to iterate through the entire table at very high speed.

But now for any full scan, I use it freely for my adhoc java programs to manipulate or aggregate cassandra data.

You can customize the options, setting fetch size, consistency level, degree of parallelism(number of threads) according to your need.

You can visit https://github.com/siddv29/cfs to go through the code, see the logic behind it, or try it in your program.

A sample program is also provided.

I coded this utility in java.

Bhuvan Rawal(bhu1...@gmail.com) and I worked on this concept.

For python you may visit his blog(http://casualreflections.io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python) and github(https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c0785)

Looking forward to your suggestions and comments.

P.S. Give it a try. Trust me, the iteration speed is awesome!!

It is a bare application, built asap. If you would like to contribute to the java utility, add or build up on it, do reach out sidd.verm...@gmail.com

Thanks and Regards,

Siddharth Verma

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--

Siddharth Verma

(Visit https://github.com/siddv29/cfs for a high speed cassandra full table scan)

kant kodali

unread,

Nov 28, 2016, 10:02:11 PM11/28/16

to spark-conn...@lists.datastax.com

Hi Siddharth,

Great effort!! I would also suspect the idea of breaking it into smaller token ranges and executing it on a separate thread must have been incorporated in spark-cassandra-connector (Although I cannot confirm 100% I was able to see those queries in logs when you turn on the log level to TRACE).

Here is the thing. According to @Russell I have two cores which means 2 threads and you have about 128 threads so that explains the reasons why mine is slow although I am still trying to see what I need to change to oversubscribe to more number of cores.

Thanks!

On Mon, Nov 28, 2016 at 6:24 PM, siddharth verma <sidd.ver...@gmail.com> wrote:

Hi,
I was working on a utility which can be used for cassandra full table scan, at a tremendously high velocity, cassandra fast full table scan.
How fast?
The script dumped ~ 229 million rows in 116 seconds, with a cluster of size 6 nodes.
Data transfer rates were upto 25MBps was observed on cassandra nodes.

For some use case, a spark cluster was required, but for some reason we couldn't create spark cluster. Hence, one may use this utility to iterate through the entire table at very high speed.

But now for any full scan, I use it freely for my adhoc java programs to manipulate or aggregate cassandra data.

You can customize the options, setting fetch size, consistency level, degree of parallelism(number of threads) according to your need.

You can visit https://github.com/siddv29/cfs to go through the code, see the logic behind it, or try it in your program.
A sample program is also provided.

I coded this utility in java.

Bhuvan Rawal(bhu1...@gmail.com) and I worked on this concept.
For python you may visit his blog(http://casualreflections.io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python) and github(https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c0785)

Looking forward to your suggestions and comments.

P.S. Give it a try. Trust me, the iteration speed is awesome!!

It is a bare application, built asap. If you would like to contribute to the java utility, add or build up on it, do reach out sidd.verma29.lists@gmail.com

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
Siddharth Verma
(Visit https://github.com/siddv29/cfs for a high speed cassandra full table scan)

--

kant kodali

unread,

Nov 28, 2016, 10:03:42 PM11/28/16

to spark-conn...@lists.datastax.com

I think async should also help instead of spawning so many threads.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

Russell Spitzer

unread,

Nov 28, 2016, 10:49:10 PM11/28/16

to spark-conn...@lists.datastax.com

2 Cores means to "Execution threads" it doesn't mean there aren't more threads than that :) . Anyway we've been talking about doing doing some internal parallelism inside of the tablescan routine in the SCC (Beyond the normal prefetching parallelism), but most of our test cases were maxing out with a decent number of cores so we stopped. I can write a jira for that as it may help.

On Mon, Nov 28, 2016 at 7:03 PM kant kodali <kant...@gmail.com> wrote:

I think async should also help instead of spawning so many threads.

On Mon, Nov 28, 2016 at 7:01 PM, kant kodali <kant...@gmail.com> wrote:

Hi Siddharth,

Great effort!! I would also suspect the idea of breaking it into smaller token ranges and executing it on a separate thread must have been incorporated in spark-cassandra-connector (Although I cannot confirm 100% I was able to see those queries in logs when you turn on the log level to TRACE).

Here is the thing. According to @Russell I have two cores which means 2 threads and you have about 128 threads so that explains the reasons why mine is slow although I am still trying to see what I need to change to oversubscribe to more number of cores.

Thanks!

On Mon, Nov 28, 2016 at 6:24 PM, siddharth verma <sidd.ver...@gmail.com> wrote:

Hi,
I was working on a utility which can be used for cassandra full table scan, at a tremendously high velocity, cassandra fast full table scan.
How fast?
The script dumped ~ 229 million rows in 116 seconds, with a cluster of size 6 nodes.
Data transfer rates were upto 25MBps was observed on cassandra nodes.

For some use case, a spark cluster was required, but for some reason we couldn't create spark cluster. Hence, one may use this utility to iterate through the entire table at very high speed.

But now for any full scan, I use it freely for my adhoc java programs to manipulate or aggregate cassandra data.

You can customize the options, setting fetch size, consistency level, degree of parallelism(number of threads) according to your need.

You can visit https://github.com/siddv29/cfs to go through the code, see the logic behind it, or try it in your program.
A sample program is also provided.

I coded this utility in java.

Bhuvan Rawal(bhu1...@gmail.com) and I worked on this concept.
For python you may visit his blog(http://casualreflections.io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python) and github(https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c0785)

Looking forward to your suggestions and comments.

P.S. Give it a try. Trust me, the iteration speed is awesome!!

It is a bare application, built asap. If you would like to contribute to the java utility, add or build up on it, do reach out sidd.verm...@gmail.com

kant kodali

unread,

Nov 29, 2016, 3:51:40 AM11/29/16

to spark-conn...@lists.datastax.com

export SPARK_WORKER_CORES=12 in conf/spark-env.sh is the way to oversubscribe to number of cores that can run in parallel.

It is a bare application, built asap. If you would like to contribute to the java utility, add or build up on it, do reach out sidd.verma29.lists@gmail.com

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

kant kodali

unread,

Nov 29, 2016, 10:13:29 PM11/29/16

to spark-conn...@lists.datastax.com

with oversubscribing the count for 750 million rows took 15 mins.

A table scan of 750 million rows took 2 hours 45 mins

I have two worker nodes where each worker node has 4 vCPU's and 16GB RAM (m3.xlarge)

The Spark nodes and Cassandra nodes are not colocated.

Though my configuration is not great it is still hard for me to justify these numbers. would be very interested to see what we can do from Cassandra Connector side perspective

Let me know if I can help.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

Reply all

Reply to author

Forward