Hi Yunqing,
For the caches, I know there are quite a few: L1 for each SM, and L2 for the whole device. In principle, I want to know all the (or most if it is difficult) information of the cache so that I can explain the performance differences. For example, are there any data sharing for threads within one block via L1 or L2 cache? If there are, how do they keep the synchronization? When using qualifier 'volatile', do they still use cache (if so, they will possibly introduce extra overheads compared with the case with using cache)?
P.S. I don't mind for the mail forwarding.
Regards,
Jianbin
From: Hou Yunqing [hyq.n...@gmail.com]
Sent: Wednesday, April 11, 2012 4:22 AM
To: Jianbin Fang - EWI
Subject: Re: about cache strategies
Hi Jianbin,
Sure if you want I can put you as a committer, so you may contribute to the wiki pages of Stage 2.
As for the caches, which caches are you interested in? Are you interested in nothing other than the L1 data cache for local memory? There are quite a few caches and I haven't done any work on any of them. Also, what are the things you want to know? Cache policy? Associativity? Bandwidth? Organisation?
BTW do you mind if I forward this conversation to asfermi's mailing group (asf...@googlegroups.com)?
Thanks,Yunqing
On Wed, Apr 11, 2012 at 5:38 AM, Jianbin Fang - EWI <J.F...@tudelft.nl> wrote:
Again, I searched the asfermi, and I found that you were looking for the guys who are interested in this project. But it was during last July. Anyway, I am wondering whether you are still looking for someone that is interested in this project? Since I am very interested in this project, especially for the cache and local memory part. Is it clear/finished already? Are there anything still not perfect and I can join in?
Regards,
Jianbin
From: Jianbin Fang - EWI
Sent: Tuesday, April 10, 2012 10:25 PM
To: hyq.n...@gmail.com
Subject: about cache strategies
Dear Yunqing Hou,
This is Jianbin Fang, a PhD student from Delft University of Technology, the Netherlands. These days I want to understand the cache strategies of Fermi architecture (so that for the performance optimization using local memory or cache). I am wondering how to use the asfermi to get this information? I looked into the wiki, but find no clue. Could you please give me some suggestions? Also, I am using OpenCL now. Does the asfermi support OCL for the moment or not?
Regards,
Jianbin
Actually I don't know much about the cache.
I know L2 is serving for all kind of memory operations.
L2 is the so called unifed cache. both as L2 texture cache and L2
instruction cache and event L2 is serving host<->device transfers...
L1 is dedicated for Local memory and L1 data cache (and of course shared
memory).
When working as shared memory. L1 behaves differently from shared memory.
the throughput for L1 is 128B/cycle (in SM frequency) or 64B/cycle (in
SP frequency for femri).
the throughput for L2 is 32B/cycle per port (in sm frequency). Your card
may have 8 or less (for example, 4) ports. So all the sm's may not be
accessing the L2 at the same time.
the constant cache can be access by SP's and load/store units. while
other caches can only be accessed by load/store units.
the constant cache throughput is 4N per cycle (according to NV). or 4*4N
(testing on sm_21 confirms this, violating the nv docuement). where N is
of [1,32].
and the constant cache MAY have very low latency when even compared to
L1 cache (not confirmed).
That's all I know.
For cache replacement policies, check the ptx manual and/or do some
tests your self.
Bark! Bark!
HuanHuan the big huge dog.
On 4/11/2012 4:23 PM, Hou Yunqing wrote:
> HI Jianbin,
>
> Please join the asfermi mailing
> group<https://groups.google.com/forum/?fromgroups#!forum/asfermi>.