I've been assuming the problem is that the tests are too noisy, no? When you look at the graph after a few runs, you can see where there was something that looked like a regression but wasn't. Other than that, I haven't noticed concrete problems. I haven't taken too much time to look into any specific case though.
I think getting emails on improvements will be nice eventually. Eventually, when there are improvements, we should lower our bar for when the bots turn red. Right now, we rarely ever do that, so performance improvements get lost.