joinWithCassandraTable executes too much queries

588 views
Skip to first unread message

Mikhail Tsaplin

unread,
Jan 17, 2017, 10:32:50 AM1/17/17
to DataStax Spark Connector for Apache Cassandra
Hi,

I have a Cassandra table where one column has 1-3Kb JSON strings.

C* table accessed via joinWithCassandraTable; JSON strings from joined RDD converted to DataFrame and DataFrame is registered as Spark's temp table. GROUP BY aggregation is executed against Spark's temp table. C* table partitioning is organized in a way where rows from same partition will join into one of GROUP BY produced groups, this groups are formed only from rows from the same partition. C* table has 120 partitions each having 800K - 11M rows.

When aggregation query is running - Spark starts many concurrent queries reading from C* table partitions and once JSON strings are big a three problems could happen:

1) C* server could die because of memory problems (I seen GC overlimit exceptions and situations where server was died perhaps because of OOM killer);

2) Some requests could end with following exception:
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:50)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)

3) Some requests could end with:
com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:76)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)


How can I control number of concurrent queries issued from CassandraJoinRDD?

I found spark.cassandra.input.join.throughput_query_per_sec parameter and when it is set to 6 it helps. But I think if partitions size will grow tomorrow then it could lead to exceptions again. So in my case it is better to control just the number of concurrent queries, is it possible?

PS.
other connector related properties are:
spark.cassandra.read.timeout_ms=3600000
spark.cassandra.connection.timeout_ms=60000
spark.cassandra.connection.keep_alive_ms=900000

Russell Spitzer

unread,
Jan 17, 2017, 7:07:13 PM1/17/17
to DataStax Spark Connector for Apache Cassandra

spark.cassandra.input.join.throughput_query_per_sec  is your concurrency limiter. It is the only threshold on how many queries are being executed at a time. I would warn you that if you are seeing errors with this method it is most likely your server will not be stable under moderate load from outside the cluster as well. 

These queries are run using executeAsync so the concurrency itself is managed by the underlying Cassandra Java Driver.
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraJoinRDD.scala#L129

The parameter above bounds how many executeAsyncs can be running per second per core and limits new executeAsyncs from being run when above the threshold. 

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




DS_Sig2.png

Mikhail Tsaplin

unread,
Jan 18, 2017, 12:53:37 AM1/18/17
to DataStax Spark Connector for Apache Cassandra
I checked executeAsync implementation; if I correctly understand - executeAsync implementation at https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/SessionManager.java#L143 immediately calls execute method which in case of initialized state of SessionManager immediately calls:

new RequestHandler(this, callback, statement).sendRequest();

The RequestHandler does not counts concurrent calls and HostConnectionPool is the only one limiter of concurrent connections which in case of overlimit will fail future at RequestHandler ( https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java#L329 ) .

Now if CassandraJoinRDD will issue many long running queries with execution time more than 1s - connection pool could be eventually exceeded.

Don't we need to control number of running queries on connector's side?

I will try to decrease number of partitions in C* table (assuming that it would not issue more queries than number of partitions).

Another successful option was switching from CassandraJoinRDD by storing data for single aggregation in separate table and loading it with cassandraTable.


On Wednesday, January 18, 2017 at 6:07:13 AM UTC+6, Russell Spitzer wrote:
> spark.cassandra.input.join.throughput_query_per_sec  is your concurrency limiter. It is the only threshold on how many queries are being executed at a time. I would warn you that if you are seeing errors with this method it is most likely your server will not be stable under moderate load from outside the cluster as well. 
>
> These queries are run using executeAsync so the concurrency itself is managed by the underlying Cassandra Java Driver.
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraJoinRDD.scala#L129
>
> The parameter above bounds how many executeAsyncs can be running per second per core and limits new executeAsyncs from being run when above the threshold. 
>
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
>
>
> --
>
>
>
>
> Russell Spitzer
> Software Engineer

Russell Spitzer

unread,
Jan 18, 2017, 11:19:56 AM1/18/17
to DataStax Spark Connector for Apache Cassandra, Jaroslaw Grabowski
How large are these partitions? Just want to know what use case we are trying to optimize here, also adding on Jaroslaw



--

You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer

unread,
Jan 18, 2017, 11:30:04 AM1/18/17
to DataStax Spark Connector for Apache Cassandra, Jaroslaw Grabowski
Ok I think we understand the problem now, currently beyond limiting the connection pool i'm not sure there is a way to limit the actual number of inflight requests. I've pinged the driver team to see if they have any ideas and we'll think about workarounds on our end. Thanks for pointing out the issue

Mikhail Tsaplin

unread,
Jan 18, 2017, 3:25:01 PM1/18/17
to DataStax Spark Connector for Apache Cassandra, jtgra...@gmail.com
Partitions size was varying from 300K to 11M rows, partitions count were 120, total rows number ~102M and biggest column size is approximately varying 1-3KB.

Russell Spitzer

unread,
Jan 18, 2017, 3:34:50 PM1/18/17
to DataStax Spark Connector for Apache Cassandra, jtgra...@gmail.com
So we could be looking at 33 GB partitions?

On Wed, Jan 18, 2017 at 12:25 PM Mikhail Tsaplin <tsmi...@gmail.com> wrote:
Partitions size was varying from 300K to 11M rows, partitions count were 120, total rows number ~102M and biggest column size is approximately varying 1-3KB.

Mikhail Tsaplin

unread,
Jan 19, 2017, 3:43:38 AM1/19/17
to spark-conn...@lists.datastax.com
Yes, this is possible. But I seen only one...

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Mikhail Tsaplin

unread,
Nov 2, 2017, 5:18:41 AM11/2/17
to spark-conn...@lists.datastax.com
Hi Russell,
do you know if this problem already solved in new versions of driver/connector?

2017-01-18 23:29 GMT+07:00 Russell Spitzer <rus...@datastax.com>:
Ok I think we understand the problem now, currently beyond limiting the connection pool i'm not sure there is a way to limit the actual number of inflight requests. I've pinged the driver team to see if they have any ideas and we'll think about workarounds on our end. Thanks for pointing out the issue
On Wed, Jan 18, 2017 at 8:19 AM Russell Spitzer <rus...@datastax.com> wrote:
How large are these partitions? Just want to know what use case we are trying to optimize here, also adding on Jaroslaw


On Tue, Jan 17, 2017 at 9:53 PM Mikhail Tsaplin <tsmi...@gmail.com> wrote:
I checked executeAsync implementation; if I correctly understand - executeAsync implementation at https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/SessionManager.java#L143 immediately calls execute method which in case of initialized state of SessionManager immediately calls:

new RequestHandler(this, callback, statement).sendRequest();

The RequestHandler does not counts concurrent calls and HostConnectionPool is the only one limiter of concurrent connections which in case of overlimit will fail future at RequestHandler ( https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java#L329 ) .

Now if CassandraJoinRDD will issue many long running queries with execution time more than 1s - connection pool could be eventually exceeded.

Don't we need to control number of running queries on connector's side?

I will try to decrease number of partitions in C* table (assuming that it would not issue more queries than number of partitions).

Another successful option was switching from CassandraJoinRDD by storing data for single aggregation in separate table and loading it with cassandraTable.


On Wednesday, January 18, 2017 at 6:07:13 AM UTC+6, Russell Spitzer wrote:
> spark.cassandra.input.join.throughput_query_per_sec  is your concurrency limiter. It is the only threshold on how many queries are being executed at a time. I would warn you that if you are seeing errors with this method it is most likely your server will not be stable under moderate load from outside the cluster as well. 
>
> These queries are run using executeAsync so the concurrency itself is managed by the underlying Cassandra Java Driver.
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraJoinRDD.scala#L129
>
> The parameter above bounds how many executeAsyncs can be running per second per core and limits new executeAsyncs from being run when above the threshold. 
>
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

>
>
> --
>
>
>
>
> Russell Spitzer
> Software Engineer

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




DS_Sig2.png

--

Russell Spitzer
Software Engineer




DS_Sig2.png

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,
Nov 2, 2017, 10:34:30 AM11/2/17
to spark-conn...@lists.datastax.com

Yes there is Spark 503 and spark 507, 503 is in now and 507 will be in the next release


On Thu, Nov 2, 2017, 2:18 AM Mikhail Tsaplin <tsmi...@gmail.com> wrote:
Hi Russell,
do you know if this problem already solved in new versions of driver/connector?
2017-01-18 23:29 GMT+07:00 Russell Spitzer <rus...@datastax.com>:
Ok I think we understand the problem now, currently beyond limiting the connection pool i'm not sure there is a way to limit the actual number of inflight requests. I've pinged the driver team to see if they have any ideas and we'll think about workarounds on our end. Thanks for pointing out the issue
On Wed, Jan 18, 2017 at 8:19 AM Russell Spitzer <rus...@datastax.com> wrote:
How large are these partitions? Just want to know what use case we are trying to optimize here, also adding on Jaroslaw


On Tue, Jan 17, 2017 at 9:53 PM Mikhail Tsaplin <tsmi...@gmail.com> wrote:
I checked executeAsync implementation; if I correctly understand - executeAsync implementation at https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/SessionManager.java#L143 immediately calls execute method which in case of initialized state of SessionManager immediately calls:

new RequestHandler(this, callback, statement).sendRequest();

The RequestHandler does not counts concurrent calls and HostConnectionPool is the only one limiter of concurrent connections which in case of overlimit will fail future at RequestHandler ( https://github.com/datastax/java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java#L329 ) .

Now if CassandraJoinRDD will issue many long running queries with execution time more than 1s - connection pool could be eventually exceeded.

Don't we need to control number of running queries on connector's side?

I will try to decrease number of partitions in C* table (assuming that it would not issue more queries than number of partitions).

Another successful option was switching from CassandraJoinRDD by storing data for single aggregation in separate table and loading it with cassandraTable.


On Wednesday, January 18, 2017 at 6:07:13 AM UTC+6, Russell Spitzer wrote:
> spark.cassandra.input.join.throughput_query_per_sec  is your concurrency limiter. It is the only threshold on how many queries are being executed at a time. I would warn you that if you are seeing errors with this method it is most likely your server will not be stable under moderate load from outside the cluster as well. 
>
> These queries are run using executeAsync so the concurrency itself is managed by the underlying Cassandra Java Driver.
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraJoinRDD.scala#L129
>
> The parameter above bounds how many executeAsyncs can be running per second per core and limits new executeAsyncs from being run when above the threshold. 
>
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

>
>
> --
>
>
>
>
> Russell Spitzer
> Software Engineer

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




DS_Sig2.png

--

Russell Spitzer
Software Engineer




DS_Sig2.png

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Mikhail Tsaplin

unread,
Nov 3, 2017, 1:51:12 AM11/3/17
to DataStax Spark Connector for Apache Cassandra
What this numbers (507, 503) mean?

Russell Spitzer

unread,
Nov 3, 2017, 9:44:53 AM11/3/17
to spark-conn...@lists.datastax.com

Those are jira tickets in the Spark Cassandra Connector


Mikhail Tsaplin

unread,
Nov 14, 2017, 1:24:27 AM11/14/17
to spark-conn...@lists.datastax.com
Hi Russel,

I've tried to upgrade connector to version 2.0.5; Cassandra version is still old 2.1.6. And now Cassandra server dies by out of memory error.

Spark log has a lot of messages like:

17/11/14 04:43:27 TRACE RequestHandler: [2051895867] com.datastax.spark.connector.writer.RichBoundStatement@4dce57a4
17/11/14 04:43:27 TRACE RequestHandler: [2051895867-1] Starting
17/11/14 04:43:27 TRACE RequestHandler: [2051895867-1] Querying node /172.31.60.128:9042
17/11/14 04:43:27 TRACE Connection: Connection[/172.31.60.128:9042-1, inFlight=370, closed=false], stream 24128, writing request EXECUTE 0xea9de36ce05edea0f0f4d85451a206d8 ([cl=LOCAL_ONE, positionalVals=[java.nio.HeapByteBuffer[pos=0 lim=9 cap=9], java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]], namedVals={}, skip=true, psize=5000, state=null, serialCl=SERIAL])
17/11/14 04:43:27 TRACE Connection: Connection[/172.31.60.128:9042-1, inFlight=370, closed=false], stream 24128, request sent successfully
17/11/14 04:43:27 TRACE CodecRegistry: Looking for codec [varchar <-> class java.lang.String]
17/11/14 04:43:27 TRACE CodecRegistry: Codec found: VarcharCodec [varchar <-> java.lang.String]
17/11/14 04:43:27 TRACE CodecRegistry: Looking for codec [int <-> class java.lang.Integer]
17/11/14 04:43:27 TRACE CodecRegistry: Codec found: IntCodec [int <-> java.lang.Integer]
17/11/14 04:43:27 TRACE RequestHandler: [376028443] com.datastax.spark.connector.writer.RichBoundStatement@7a4c947
17/11/14 04:43:27 TRACE RequestHandler: [376028443-1] Starting
17/11/14 04:43:27 TRACE RequestHandler: [376028443-1] Querying node /172.31.60.128:9042
17/11/14 04:43:27 TRACE Connection: Connection[/172.31.60.128:9042-1, inFlight=371, closed=false], stream 24192, writing request EXECUTE 0xea9de36ce05edea0f0f4d85451a206d8 ([cl=LOCAL_ONE, positionalVals=[java.nio.HeapByteBuffer[pos=0 lim=9 cap=9], java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]], namedVals={}, skip=true, psize=5000, state=null, serialCl=SERIAL])
17/11/14 04:43:27 TRACE Connection: Connection[/172.31.60.128:9042-1, inFlight=371, closed=false], stream 24192, request sent successfully
17/11/14 04:43:27 TRACE CodecRegistry: Looking for codec [varchar <-> class java.lang.String]
17/11/14 04:43:27 TRACE CodecRegistry: Codec found: VarcharCodec [varchar <-> java.lang.String]
17/11/14 04:43:27 TRACE CodecRegistry: Looking for codec [int <-> class java.lang.Integer]
17/11/14 04:43:27 TRACE CodecRegistry: Codec found: IntCodec [int <-> java.lang.Integer]
17/11/14 04:43:27 TRACE RequestHandler: [242802295] com.datastax.spark.connector.writer.RichBoundStatement@55a23bc6
17/11/14 04:43:27 TRACE RequestHandler: [242802295-1] Starting
17/11/14 04:43:27 TRACE RequestHandler: [242802295-1] Querying node /172.31.60.128:9042
17/11/14 04:43:27 TRACE Connection: Connection[/172.31.60.128:9042-1, inFlight=372, closed=false], stream 24256, writing request EXECUTE 0xea9de36ce05edea0f0f4d85451a206d8 ([cl=LOCAL_ONE, positionalVals=[java.nio.HeapByteBuffer[pos=0 lim=9 cap=9], java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]], namedVals={}, skip=true, psize=5000, state=null, serialCl=SERIAL])
17/11/14 04:43:27 TRACE Connection: Connection[/172.31.60.128:9042-1, inFlight=372, closed=false], stream 24256, request sent successfully
17/11/14 04:43:27 TRACE CodecRegistry: Looking for codec [varchar <-> class java.lang.String]
17/11/14 04:43:27 TRACE CodecRegistry: Codec found: VarcharCodec [varchar <-> java.lang.String]
17/11/14 04:43:27 TRACE CodecRegistry: Looking for codec [int <-> class java.lang.Integer]
17/11/14 04:43:27 TRACE CodecRegistry: Codec found: IntCodec [int <-> java.lang.Integer]
17/11/14 04:43:27 TRACE RequestHandler: [177237732] com.datastax.spark.connector.writer.RichBoundStatement@42d382a7

And finally, it looks like that all these requests failed with messages like:

17/11/14 04:45:26 TRACE RequestHandler: [1152532463-1] Setting final exception
17/11/14 04:45:26 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement@3f0a1608
com.datastax.driver.core.exceptions.OperationTimedOutException: [/172.31.60.128:9042] Timed out waiting for server response
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:772)
        at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:588)
        at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:662)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:385)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
17/11/14 04:45:26 TRACE RequestHandler: [1020921875-1] Setting final exception
17/11/14 04:45:26 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement@69c20520
com.datastax.driver.core.exceptions.OperationTimedOutException: [/172.31.60.128:9042] Timed out waiting for server response
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:772)
        at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:588)
        at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:662)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:385)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)

PS.
I will try to upgrade Cassandra and repeat the test.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Mikhail Tsaplin

unread,
Nov 14, 2017, 4:24:33 AM11/14/17
to spark-conn...@lists.datastax.com
Checked with Cassandra v3 - same problem, the only difference - not whole cluster dies with OutOfMemory.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.


Russell Spitzer

unread,
Nov 14, 2017, 11:04:55 AM11/14/17
to spark-conn...@lists.datastax.com

There are still no releases versions with those patches, you have to build off the master or b2. 0 branches to have them included


To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.


--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Mikhail Tsaplin

unread,
Apr 17, 2018, 1:25:11 AM4/17/18
to spark-conn...@lists.datastax.com
Hi Russell,
Looks like a similar problem exists on the other side. The saveToCassandra method does not support back pressure, com.datastax.spark.connector.writer.TableWriter just throttling output rate by bytes/sec and if Cassandra write performance degrades OOM exception could happen.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.


--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,
Apr 17, 2018, 8:50:49 AM4/17/18
to spark-conn...@lists.datastax.com
It would block on concurrent writers times numerous core elements. The iterators are lazy so nothing else would be pulled through from previous iterators. Oom would only be possible if an early stage loaded everything into memory. 

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.


--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages