There are a lot of ‘depends’ here to work through..
First, I’d see what the gap is to have the existing Cascading write to the data lake and rule out/in the value there.
If the lake is really just a function of using Apache Iceberg or a vendor specific format, we can discuss how to get that support. I plan to add Iceberg this year, but I need to add native Parquet support beforehand (which i’ve been sitting on for months).
As for Spark vs Cascading, that depends on the actual workloads..
If you have complex flows with a lot of forks in the Flow pipeline, you will likely see a slowdown if you ‘lift and shift’ the app over to spark. if i remember correctly, spark cannot parallelize a fork because it hasn’t happened yet in your code. Cascading is declarative, Spark is imperative. you will write a lot of code to manage the difference.
if the Flow is trivial, no harm in porting i guess, but I’d look at Flink or other tools because they have stronger messaging and time semantics, but then you might lose your ‘data lake’ format.