It is possible that all of this information is available in perf or can be derived from hardware counters that perf provides access to but some of the Vtune features that I have found really useful in the past (and haven't easily found in perf) are:
1. Analysis of backend-bound programs. You can check port utilization of your programs. Often your programs might seem to have some ILP but actually end up being bound on a single or couple ports. Rewriting your program to use the other ports can help with that.
2. Looking at stalls as opposed to just cache misses. Vtune lets you look at what portion of cache misses actually cause stalls. In an OoO machine, not all cache misses are necessarily bad. If your program has enough overlap and ILP, cache miss latency can be hidden to some extent. For eg: Request a bunch of independent reads, process them all in independent streams. The first few cache misses might cause stalls getting the OoO machinery rolling, but the others will cause cache misses but no stalls if you can overlap the processing of the data from the earlier reads with the cache misses for the latter reads.
3. SIMD to memory fetch ratio. Often people move from scalar to vector code without changing their memory layouts too much. You usually see classes like Vec3 etc when this happens (AOS layouts). SIMD is of limited use here because you remain memory bound. Changing to SOA layouts often helps. The SIMD-memory fetch ratio needs to be large-ish for SIMD to be useful and this helps you figure it out.
There are a lot of other esoteric Vtune features which I haven't seen in other profilers. As I said before, that might just be due to my unfamiliarity with them as opposed to those features actually lacking.