Deadlock with garnet network model

575 views
Skip to first unread message

Hao Wang

unread,
Mar 16, 2014, 7:49:37 PM3/16/14
to gem5-g...@googlegroups.com
Hi,

I'm trying to use the garnet network model. But it generates a deadlock; see the output log below:
GPGPU-Sim API: Stream Manager State
GPGPU-Sim API:    stream 0 has 1 operations
GPGPU-Sim API:       0 :  stream operation memcpy host-to-device
panic: Possible Deadlock detected. Aborting!
version: 12 request.paddr: 0x[0x7e8ec7f0, line 0x7e8ec780] m_readRequestTable: 1 current time: 5437930752500 issue_time: 5437680752500 difference: 250000000
 @ cycle 5437930752500
[wakeup:build/VI_hammer/mem/ruby/system/Sequencer.cc, line 107]
Memory Usage: 2928700 KBytes
Program aborted at cycle 5437930752500

I use VI_hammer to take a checkpoint after linux boot; and then restore with the following command:
/research/wangh/noc/codebase/v03/gem5/build/VI_hammer/gem5.opt --outdir=m5out_ckpt /research/wangh/noc/codebase/v03/gem5-gpu/configs/fs_fusion.py --checkpoint-restore=1 --kernel=x86_64-vmlinux-2.6.28.4-smp --script=./backprop.rcS --cpu-type=detailed --restore-with-cpu=timing --num-cpus=4 --clusters=8 --topology=Cluster --garnet-network=fixed

Some more info:
I've tested backprop & srad, both deadlock.
It was running correctly before I add the last option "--garnet-network=fixed";
With smaller input (e.g. 2048 backprop), it runs successfully (only this one test, not sure if the input size matters).
With "--garnet-network=flexible", it seems also working.

I wonder if there is a known issue regarding this?

Thanks.
Hao

Jason Power

unread,
Mar 17, 2014, 10:11:54 AM3/17/14
to Hao Wang, gem5-g...@googlegroups.com
Hi Hao,

We haven't seen this problem, but we also haven't tried to use garnet, either. My guess is that there isn't enough buffering in the fixed garnet to handle the bandwidth generated by the GPU. It's possible that if you increase the deadlock detection threshold that the programs will finish. However, I imagine the performance is going to be awful if the network buffers are getting saturated. 

I haven't used garnet before, but maybe there's an option to increase the size of the buffers? Also, you can increase the deadlock threshold at the RubyController in the python config file for the protocol.

If you find this is a bug in VI_hammer (which it plausible), let us know and we'd love to incorporate a fix.

Jason

Jason Power

unread,
Mar 17, 2014, 10:29:23 AM3/17/14
to Hao Wang, gem5-g...@googlegroups.com
Another thought: If you want to know if it's due to overcommit, check out the ruby.stats file. If you can decode the meaning of those stats, you ought to be able to infer if it's a buffer saturation problem or a bug in the protocol.

Jason

Jason Power

unread,
Mar 19, 2014, 9:51:27 AM3/19/14
to Hao Wang, gem5-g...@googlegroups.com
Hi Hao,

I'm pretty surprised you don't see an average latency spike. Did you check on the copy engine controller specifically?

Also, there are some parameters that we played with in the CE controller to get the right bandwidth. It may be that when using garnet you need to change these parameters. I believe the major ones we change were max_outstanding_requests at the CE sequencer and number_of_TBEs at the CE controller. Both of these are in VI_hammer_fusion.py for the VI_hammer protocol.

My guess is that if that doesn't work you'll have to dig down into the protocol to try to figure out what's going on.

As far as the translation patches go, it should be any day now. I'm just waiting on one more thing.

Jason


On Tue, Mar 18, 2014 at 10:39 PM, Hao Wang <pkuw...@gmail.com> wrote:
Hi Jason,

Thanks for the suggestions.

I did some tests.
The deadlock comes from the gpu copy engine.
By increasing the deadlock detection threshold for that sequencer, I did pass the cudaMemCpy operation that was failed. 
But it still deadlocked during a later memcpy operation.

Checking stats file, I don't see a huge average latency in network number during memcpy.
But the memcpy operation takes much longer time that I expected.
For example, transferring 1008K-entry float array takes 24000us; Assuming 50% utilization of memory link, it should take ~1000us.
Even considering the driver delay (5us), the transferring time seems still too high.

BTW,
I remember you said there will be an incoming patch to handle the page faults for rodinia-nocopy benchmarks.
I wonder if that is coming soon or maybe need some more time.

Thanks!
Hao


------------------------------------------------------
Wang, Hao
Ph.D. candidate
ECE, University of Wisconsin-Madison



On Mon, Mar 17, 2014 at 9:11 AM, Jason Power <powe...@gmail.com> wrote:

Joel Hestness

unread,
Mar 19, 2014, 12:15:07 PM3/19/14
to Jason Power, Hao Wang, gem5-g...@googlegroups.com
Hey guys,
  There are a couple major reasons that memory copy latency is quite long in FS mode: (1) the copy engine accepts virtual addresses for data, so it currently has to do address translations to move data, and (2) we avoid having to deal with copy engine page faults by touching all memory pages in cudaMemcpy.  In FS mode, the address translations are not instantaneous like they are in SE mode, and they can even require page walks when the TLB doesn't contain the translation.  This negatively impacts bandwidth the most.

  Currently, we don't really have a good solution to avoid the poor memory copy modeling in FS mode.  However, it would be pretty surprising if that is causing the deadlock issues.

  If I can be of assistance, please let me know,
  Joel

--
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
Reply all
Reply to author
Forward
0 new messages