GPGPU-sim GTO warp scheduler

986 views
Skip to first unread message

‍이정후(대학원학생/일반대학원 컴퓨터과학과)

<jounghoolee@yonsei.ac.kr>
unread,
Apr 19, 2021, 9:53:05 PM4/19/21
to accel-sim
Hello,

As described GPGPU-sim tested-cfg, most board uses GTO warp scheduler. However when reading the code carefully I found out that GTO scheduling is not like we normally think in computer architecture field. (e.g. described in [1]).

Normally GTO scheduling schedules 'oldest' instruction when previous warp encounters stall. Therefore scheduler should search for a warp that has not issued instruction for longest time and ready to issue instruction.

However GPGPU-sim 4.0.1 GTO scheduler finds a ready warp that has smallest warp id (wid). Which means earliest warp that has been issued to the scheduler. This does not guarantee next issuing instruction is oldest, and always prioritize warps with small wid.

Is there a reason to implement gto scheduler such way? Does that fit well to default warp scheduling policy of real HW? Especially for RTX2060, TitanV, QV100 and TitanX. It would be really nice to share some knowledge while testing with micro benchmarks. I really appreciate Accel-sim paper.

Thank you.

Jounghoo Lee
HPCP lab. @ Yonsei University

[1]: Huang, Jen-Cheng, et al. "GPUMech: GPU performance modeling technique based on interval analysis." 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014.

Mahmoud Khairy

<khairy2011@gmail.com>
unread,
Apr 22, 2021, 5:04:38 PM4/22/21
to accel-sim
Hello Jougnhoo,

The warp ID that used GTO is the dynamic warp ID that is assigned at the warp launch time, not the static warp ID. So, you will find a static warp counter initialized by zero, and everytime you launch a new warp, the warp dynamic ID is assigned the value of the static variable and increment the variable by one. So the oldest warp will have the smallest dynamic warp ID. Please take a deep analysis of the code of warp scheduling and you will see what I mean. So, GTO is implemented correctly as described in the literature. 

For the real HW, it is very hard to know the exact warp scheduling, however, we observe that changing the scheduler from GTO to RR in the simulator did slightly reduce the correlation by a small factor, however, the overall correlation of both is still not perfect. so we kept GTO. It is worth mentioning that figuring out the exact warp scheduling is something that is not perfectly done by Accel-Sim yet and needs more investigation and that is on our TODO list. We are also going to release our microbenchmarks suite that we used in our paper in our next release next month. 
If you are able to figure out the exact warp scheduling in HW, you are more than welcomed to let us know and will update that in our simulator or just do a pull request of your code changes and microbenchmarks and will accept it. Accel-sim is open source and we need the whole community to work on it and improve it further. 

‍이정후(대학원학생/일반대학원 컴퓨터과학과)

<jounghoolee@yonsei.ac.kr>
unread,
Apr 22, 2021, 9:01:57 PM4/22/21
to accel-sim
Hello Mahmoud,

I really thanks to your detailed replied, but let me clarify my concern once again. Maybe GTO we are thinking is little bit different. I have noticed the difference between warp ID and dynamic_warp_id with your well written in-code comment.

I think the problem is dynamic_warp_id only changes when a new warp launched to the scheduler. Let me restate operation of current code with a example.

@ cycle 0 : 32 warps launched to a scheduler, and got dynamic_warp_id from 0 ~ 31.
@ cycle 1~20 : warp0 issue instructions and stalled
@ cycle 21 : scheduler check ready state from warp0 (** notice it is not from warp1**) and warp1 issues instructions.
@ cycle 25 : warp0 becomes ready again. warp1 is still issuing instructions.
@ cycle 30 : warp1 stalls, so warp0 start to issue instruction again (** warp2~ 31 did not get a chance at all **)

Therefore I could observe warps with big dynamic_warp_id tends not to proceed at all until warps with small dynamic_warp_id finishes, even though all warps launched at same cycle.

Additionally, even if warp32 launched at cycle 31 (let's assume warp0 finished), it is unfair warp32 waits until warp1~31 to finish. If warp1~31 issue an instruction and stall at cycle 31~62 respectively, doesn't it have to be warp32's turn to issue an instruction? However in current code if warp1 has become ready again at cycle 63, it becomes warp1's turn. Let me know if I have some miss-understanding about current GPGPU-sim code or GTO scheduling.

As you mentioned that scheduling policy is still in developing, I will let you know as soon as I have find some useful result with real HW validation.

Thanks a lot
Jounghoo Lee
2021년 4월 23일 금요일 오전 6시 4분 38초 UTC+9에 khair...@gmail.com님이 작성:

Mahmoud Khairy

<khairy2011@gmail.com>
unread,
Apr 27, 2021, 3:49:09 PM4/27/21
to accel-sim
Hello:

GTO stands for "greedy then oldest first". Greedy means once you issue an instruction from one warp, you keep prioritizing this warp as long as you have ready instructions to launch from it. You do not change this warp until it is stalled even if you have other ready warps that are issued at the same cycle. Once this warp is stalled, you select the oldest warp (based on dynamic warp id). This is what GTO is described in the literature by the creator of CCWS paper who is by the way by Ph.D. advisor. What you described is something like RRO, round-robin then oldest first. 
Please read this paper for further details:

‍이정후(대학원학생/일반대학원 컴퓨터과학과)

<jounghoolee@yonsei.ac.kr>
unread,
Apr 28, 2021, 10:23:43 PM4/28/21
to accel-sim
Hello Mahmoud,

Thanks a lot for very kind clarification. I miss understood scheme. 

I really look forward to write such a great ISCA paper like Accel-sim. I also found your ISPASS paper interesting.

Thank you again
Jounghoo Lee
2021년 4월 28일 수요일 오전 4시 49분 9초 UTC+9에 khair...@gmail.com님이 작성:
Reply all
Reply to author
Forward
0 new messages