The problem with data flow fusion is it only supports a few combinators and it's hard to add new ones. My new system, machine fusion, solves this but we are still working out the best way to present it. But I haven't thought about any sort of parallelisation for it yet. The secret sauce is to treat each combinator as its own sequential process, and then all processes are run concurrently. Then you "just" choose a sequentialisation of the concurrent network, ending up with a single sequential process.
Depending on your use case, if you don't use inputs or intermediate results multiple times, and only have one output array, the staged computation paper "Stream fusion, to completeness" is worth reading. They claim to fuse "any combination of combinators" but that's a bit of a fib since using values multiple times duplicates work.