From my perspective, the primary benefit of using a layered architecture is that it provides a good trade-off between expressiveness and complexity.
The lower layers for the system (e.g., the rendering layer) is very expressive, in the sense that you can write a RenerObject that has pretty much any behavior your can dream up. The cost of that expressiveness is that working with render objects is complex. There are lots of subtle rules you need to follow and there are many ways to shoot yourself in the foot.
By contrast, the higher levels of the system (e.g., the widget layer) are very easy to use. You just plug together the widgets you want and the framework takes care of diffing the new build against the previous build, and many other aspects. The cost of that simplicity is less expressiveness. There might not be a widget that behaves exactly the way you want.
Having a layered architecture lets you achieve a better trade-off between these factors than you would otherwise be able to achieve. When there's a widget that does what you want, you can simply use it. When there isn't a widget that does what you want, you can dig down a layer and express what you want at the lower layer, albeit in a more complex way.
This pattern repeats itself across all the layers. For example, if you like the abstraction offered by the gesture library, you can use it to easily recognize taps and drags, but if you want to express a more elaborate gesture that isn't provided by the library, the code is structured in such a way that you can dig down to that layer of the system and interact with the gesture arena in just the same way as the TapRecognizer.