One of our users is outputting to Cassandra, but they want to handle a Cassandra failure or Cassandra down time gracefully from an output operator. Currently a lot of our database operators will just fail and redeploy continually until the database comes back. This is a bad idea for a couple of reasons: 1 - We rely on buffer server spooling to prevent data loss. If the database is down for a long time (several hours or a day) we may run out of space to spool for buffer server on many nodes since it spools to local disk, and data is purged only after a window is committed. 2 - If there is another failure further upstream in the dag, upstream operators will be redeployed to a checkpoint less than or equal to the checkpoint of the database operator. This could mean redoing several hours or a day worth of computation. We should support a mechanism to detect when the connection to a database is lost and then spool to hdfs using a WAL, and then write the contents of the WAL into the database once it comes back online. |