Melzzzz,
Executing VZEROALL/VZEROUPPER at the *end* of a section of AVX code allows an x86 core to free up physical (rename) registers to speed up execution of subsequent SSE code ... though I didn't think you needed to execute them *before* AVX usage.
For instance, let's say a microarchitecture is mapping a 256-bit AVX architectural/logical register (XMM0-XMM15) to 2 x 128-bit physical registers i.e the width of the logical register is twice the width of the physical register. During execution of AVX-256 code that touches all 16 XMM regs, a max of 16*2=32 physical registers are consumed from the physical register file to map to the 16 XMM logical registers. But after execution of VZEROUPPER, the 16 'upper' registers can be freed and returned to the rename pool of free registers. Upon execution of VZEROALL, 32 regs can be freed and returned to the rename pool of free registers. This can be done with the help of additional state per logical register, which we can refer to as zero bits (z-bits) that are maintained in the speculative register map. The z-bits indicates whether the upper 128b of a register is zero or its lower 128b is zero. Freeing up physical regs can allow a subsequent section of SSE code to execute faster so it's a good practice to insert a VZEROALL/VZEROUPPER at end of AVX subroutine to give a hint to the hardware to free up resources, if in fact, software doesn't intend to execute AVX code for some time.
With regards to mixing SSE and AVX code, I would expect a penalty on both Intel and AMD CPUs when transitioning between SSE <-> AVX-256 code. Are you also seeing a penalty when transitioning between SSE <-> AVX-128 code and if so, on which CPUs are you seeing this penalty and can you provide an estimate of how much it is?