Hi Chen,
Trinayan is right that gem5-gpu was designed with physically integrated GPUs in mind. However, there are cache protocols (*_split) that simulate a discrete GPU. There is also a "copy engine" that emulates the DMA engine copying data across the PCIe bus. Although, the copy engine is not a high-fidelity model.
It could make sense to use gem5-gpu to simulate a discrete GPU if 1) All you care about is the GPU time and/or 2) you're interested in high-fidelity detailed models of cache coherence.
Cheers,
Jason