Testing / integrating spark applications and cassandra.

Kevin Burton

unread,

Jan 13, 2015, 8:26:45 PM1/13/15

to spark-conn...@lists.datastax.com

I'm curious what you guys are using to unit test spark applications and cassandra?

I'm using cassandra-unit and I"m in dependency hell.

Spark + Cassandra seem to work together easily.. but when you throw them into the same VM together they have insane dependencies.

One strategy is to break Cassandra into a separate daemon but that's easier said than done.

Would be killer to figure out an easy way to create Cassandra as a background daemon within Maven . But until then a plan B is to try to get Cassandra Unit to work..

Helena Edelson

unread,

Jan 14, 2015, 4:16:33 PM1/14/15

to spark-conn...@lists.datastax.com

Hi Kevin,

We run our own embedded cassandra for IT tests - which is available from the spark-cassandra-connector-embedded artifact. We do not use cassandra for unit tests. I've not encountered any dependency issues.

In the distant past I'd done some stuff with cassandra unit but that was also with astyanax vs spark and the connector.

Hopefully others can contribute their unit test successes!

Helena
@helenaedelson

Gerard Maas

unread,

Jan 24, 2015, 7:35:26 AM1/24/15

to spark-conn...@lists.datastax.com

Hi Helena, Kevin,

At Virdata we use a combination of Docker (https://www.docker.com/)
and Cucumber (http://cukes.info/) to implement our integration tests. Docker provides the containers for the different components and cucumber has allowed us to create a number of reusable steps to orchestrate the test and express in a high-level language what we are expecting as result.

Our integration tests would look like this: (modified to protect the innocent)

Simple example: Test that our cassandra client works as expected.

Scenario: The cassandra client requires a Cassandra instance
Given a docker client instance
Given an instance of 'cassandra:2_0' getting IP 'cassip'
And wait 20 seconds

Scenario: The cassandra client connects and executes statements on a file with replacements
Given a cassandra client connected to '$cassip'
And I execute the resource 'sampleCQL.cql' with replacements:
| variable | value |
| CUSTOMER | cukecustomer |
When I retrieve the list of keyspaces
Then the operation succeeds
And the keyspaces list contains:
| cukecustomer |

And then a more complex test of a Spark Streaming job that requires several components to work:

Scenario: Setup
Given a docker client instance
Given an instance of 'precise-zookeeper-3.4.6:1.0.1' with name 'zk' getting IP 'zkip'
And I can connect to zookeeper at '$zkip'
Given an instance of 'cassandra:2_0' getting IP 'cassip'
Given an instance of 'precise-kafka-0.8.1:1.0.0' with name 'kafka' linked to 'zk:zk' getting IP 'kafkaip'
Given a cassandra client is connected to '$cassip'
And a cassandra statement from: 'keyspaces.cql' with arguments:
| CUSTOMER |
| customer |
And a cassandra statement from: 'sample_data.cql' with arguments:
| CUSTOMER |
| customer |
Given a kafka producer connected to '$kafkaip'
Given an instance of <spark job being tested> connected to zookeeper on '$zkip' and cassandra on '$cassip' for customer 'customer'
And wait 20 seconds # needed to allow the system to startup

Scenario: Spark Streaming job should accept messages compliant with foobar,
process it and store it in Cassandra
Given 'msg' is a MapMsg with content:
| k1:String | v1 |
| k2:String | v2 |

When '$msg' is sent through the KafkaProducer to topic 'job_topic'
Then the cassandra table 'customer.foobar_table' should contain record:
| field | is_key | value |
| k1 | yes | v1 |
| k2 | no | v2 |

The main advantage of this approach is that we create a setup that closely reflects the target system in topology and component versions.
The container isolation exposes any configuration or serialization issue in our code, helping with early detection of issues, not only at the logical level but also deployment issues. If it works in the IT test, it will work on the actual cluster (cloud-based).

We obviously have no issues with classloading conflicts. Each process runs in its own container. They communicate using the docker overlay IP network.

The initial investment to set this up is quite high as you can imagine there's quite some code behind each step. We also wrote our own Scala docker client for this purpose, as we adopted Docker early-on. Nowadays there're probably more options out there.
Once you're past the effort of the first climb, writing subsequent tests is mostly reusing the existing steps to create new scenarios.

Not all is bright and shiny,though. There're also some dark corners:

- We cannot use the cucumber background to setup the environment as this would be waaaay too expensive to do for each scenario tested. Instead we kind of abuse test order and use a scenario as setup (as shown above)
- Given the number of systems required for a test, when something goes wrong, it's not easy to figure out what. Log files for kafka, zk, ... are in their own containers
- Timing is a recurrent issue. Sometimes some systems take longer than expected to startup, potentially failing the test. This happens mostly on the local dev system (when you have a ton of other processes running). Our jenkins is usually stable in this regard.

I hope this helps a bit.

-kr, Gerard.

Reply all

Reply to author

Forward