Hello Flume-users,
I work on a team that frequently uses data pipelines written in FlumeJava, and sometimes we want to check that our data is correct at the end of the pipeline. Usually we do this by iterating through the output collection with a parallel do, and either incrementing MRCounters or setting a precondition for the data values to verify correctness. The question is: what is the best practice for verifying data correctness at the end of a Flume pipeline; the current way we use flume is slightly awkward for two reasons:
1) Not clear which function to use: DoFn will result in useless emitFn, while CopyFn will result in useless return value and possibly wasted resources.
2) Need to markAsToBeMaterialized to ensure that the verification step is not optimized away.
Thanks,
Ian