I'm using the gem5-gpu to study heterogeneous computing.
I build the gem5-gpu environment and run the benchmark rodinia and rodinia-nocopy,
for the backprop, my result shows that copy version can finish in 0.024s while nocopy version need 0.24s, 10x worse.
Gussian and hotspot also showed worse performance in nocopy version.
In my opinion, nocopy version is something like the new technology "unified memory", why this is much worse than using explicit copy? This doesn't meet my expectation.
Did you got similar results?
Best Regards
Boya