Yes, your question about reviewing the assumptions/requirements/designs is entirely valid, but the correct answer is "it depends (on your design goal)".
First, to be precise, Google's MR was implemented with reliability as the primary design goal. That disk storage slowed down the persistence was addressed with, e.g., sstables. Next was scalability; while "real-time" speed of results was never an important goal.
Once you start dividing a job up into small chunks (partitions), it becomes possible to have an execution graph where these chunks are executed asynchronously, e.g., if you have 1000 chunks to be processed by 10 executors (more "partitions" than available "mappers" or "reducers"). When this happens, the intermediate results from the first 10, 20, 30, etc. need to be saved somewhere so that they can be properly combined in the next phase. If any partition processing should fail, it is simply "replayed" at the partition level. So whether your persistence is disk-based or memory-based, these graphs are going to have synchronous blocks, or "distinct phases" as you point out. This is the classic bulk-synchronous parallel model.
If on the other hand you have an execution graph where you can fit the entire problem (data set) simultaneously within your computing resources, and speed is far more important than reliability (e.g., sub-second-latency SQL queries that would just be retried in its entirety if failed halfway), then it can make sense to avoid these synchronization barriers altogether and have the entire graph proceed asynchronously, as fast as possible without waiting. This is more like the MPI model, and you might have "mappers" push their results to "reducers" memory-to-memory without waiting for some common sync or spilling to any disk-based persistence. The disadvantage is you lose a lot of the reliability guarantees; in general it's harder to reason out the system state; for any given amount of resources you'd face more frequent back pressure on, e.g., memory, as the data pipes through the execution graph and parts are ready/not ready to release resources. But all this could be worth it when speed is the primary goal. Indeed this is the design choice of Cloudera's Impala.
To sum up, I for one am very interested in having such a mode in Spark, i.e., where I'm willing to give up some/all resiliency guarantees in exchange for speed.