[JIRA] (MLHR-1951) Gracefully Handle Cassandra Down Time In Cassandra Output Operator

1 view
Skip to first unread message

Timothy Farkas (JIRA)

Dec 17, 2015, 12:40:03 PM12/17/15
to malhar...@googlegroups.com
Timothy Farkas created an issue
Malhar / Improvement MLHR-1951
Gracefully Handle Cassandra Down Time In Cassandra Output Operator
Issue Type: Improvement Improvement
Assignee: Timothy Farkas
Created: 17/Dec/15 9:39 AM
Priority: Major Major
Reporter: Timothy Farkas

One of our users is outputting to Cassandra, but they want to handle a Cassandra failure or Cassandra down time gracefully from an output operator. Currently a lot of our database operators will just fail and redeploy continually until the database comes back. This is a bad idea for a couple of reasons:

1 - We rely on buffer server spooling to prevent data loss. If the database is down for a long time (several hours or a day) we may run out of space to spool for buffer server on many nodes since it spools to local disk, and data is purged only after a window is committed.

2 - If there is another failure further upstream in the dag, upstream operators will be redeployed to a checkpoint less than or equal to the checkpoint of the database operator. This could mean redoing several hours or a day worth of computation.

We should support a mechanism to detect when the connection to a database is lost and then spool to hdfs using a WAL, and then write the contents of the WAL into the database once it comes back online.

Add Comment Add Comment
This message was sent by Atlassian JIRA (v7.1.0-OD-02-030#71001-sha1:2ba8c0f)
Atlassian logo
Reply all
Reply to author
0 new messages