Re: about cache strategies

Hou Yunqing

unread,

Apr 11, 2012, 4:23:18 AM4/11/12

to Jianbin Fang - EWI, asfermi Google Group

HI Jianbin,

Please join the asfermi mailing group. HuanHuan (in the mailing group) seems to have better knowledge of the details of the architecture(s).

As for the data sharing you are talking about, it seems more appropriate to use shared memory. If you operate on cached global memory, that sharing may occur as intended if synchronization is properly managed. However it surely will be awkward. But if these don't matter to you, and you just want to know nvcc compiler handles local memory, how local addressing works and how the L1 cache works, then go ahead.

Here are the things I remember (I may be wrong because it's been sometime since I last dealt with cuda):

1. Local memory is private to each thread. They may be shared between threads if you know how to translate local address into global address. The local memory space is not consecutive when mapped to global memory space. For a specific thread, local address A and A+0x4 are separated by 0x4*block size in the global space.

2. I am surprised by myself that I didn't include directives in asfermi to support the setting of frame size and minimal stack size... if you don't know what I'm talking about, find a cubin that uses statically allocated local memory and do cuobjdump -elf on it.

3. Dynamically allocated local memory does not require the setting of stack/frame, if I do not remember wrongly... however asfermi currently does not support dynamic allocation of local memory as it doesn't handle special device calls like malloc() (it won't be hard to add support for that, though, but I don't have the time).

As for cache policies and the details, you'd have to carefully design a series of tests to obtain the information you need. It's not easy, but I guess you'll figure out how to do that.

Regards,

Yunqing

On Wed, Apr 11, 2012 at 2:25 PM, Jianbin Fang - EWI <J.F...@tudelft.nl> wrote:

Hi Yunqing,

For the caches, I know there are quite a few: L1 for each SM, and L2 for the whole device. In principle, I want to know all the (or most if it is difficult) information of the cache so that I can explain the performance differences. For example, are there any data sharing for threads within one block via L1 or L2 cache? If there are, how do they keep the synchronization? When using qualifier 'volatile', do they still use cache (if so, they will possibly introduce extra overheads compared with the case with using cache)?

P.S. I don't mind for the mail forwarding.

Regards,

Jianbin

From: Hou Yunqing [hyq.n...@gmail.com]
Sent: Wednesday, April 11, 2012 4:22 AM
To: Jianbin Fang - EWI
Subject: Re: about cache strategies

Hi Jianbin,

Sure if you want I can put you as a committer, so you may contribute to the wiki pages of Stage 2.

As for the caches, which caches are you interested in? Are you interested in nothing other than the L1 data cache for local memory? There are quite a few caches and I haven't done any work on any of them. Also, what are the things you want to know? Cache policy? Associativity? Bandwidth? Organisation?

BTW do you mind if I forward this conversation to asfermi's mailing group (asf...@googlegroups.com)?

Thanks,

Yunqing

On Wed, Apr 11, 2012 at 5:38 AM, Jianbin Fang - EWI <J.F...@tudelft.nl> wrote:

Again, I searched the asfermi, and I found that you were looking for the guys who are interested in this project. But it was during last July. Anyway, I am wondering whether you are still looking for someone that is interested in this project? Since I am very interested in this project, especially for the cache and local memory part. Is it clear/finished already? Are there anything still not perfect and I can join in?

Regards,

Jianbin

From: Jianbin Fang - EWI
Sent: Tuesday, April 10, 2012 10:25 PM
To: hyq.n...@gmail.com
Subject: about cache strategies

Dear Yunqing Hou,

This is Jianbin Fang, a PhD student from Delft University of Technology, the Netherlands. These days I want to understand the cache strategies of Fermi architecture (so that for the performance optimization using local memory or cache). I am wondering how to use the asfermi to get this information? I looked into the wiki, but find no clue. Could you please give me some suggestions? Also, I am using OpenCL now. Does the asfermi support OCL for the moment or not?

Regards,

Jianbin

Sun HuanHuan

unread,

Apr 12, 2012, 3:32:45 AM4/12/12

to asf...@googlegroups.com

Hi,

Actually I don't know much about the cache.

I know L2 is serving for all kind of memory operations.
L2 is the so called unifed cache. both as L2 texture cache and L2
instruction cache and event L2 is serving host<->device transfers...

L1 is dedicated for Local memory and L1 data cache (and of course shared
memory).

When working as shared memory. L1 behaves differently from shared memory.

the throughput for L1 is 128B/cycle (in SM frequency) or 64B/cycle (in
SP frequency for femri).

the throughput for L2 is 32B/cycle per port (in sm frequency). Your card
may have 8 or less (for example, 4) ports. So all the sm's may not be
accessing the L2 at the same time.

the constant cache can be access by SP's and load/store units. while
other caches can only be accessed by load/store units.

the constant cache throughput is 4N per cycle (according to NV). or 4*4N
(testing on sm_21 confirms this, violating the nv docuement). where N is
of [1,32].

and the constant cache MAY have very low latency when even compared to
L1 cache (not confirmed).

That's all I know.

For cache replacement policies, check the ptx manual and/or do some
tests your self.

Bark! Bark!
HuanHuan the big huge dog.

On 4/11/2012 4:23 PM, Hou Yunqing wrote:
> HI Jianbin,
>
> Please join the asfermi mailing

> group<https://groups.google.com/forum/?fromgroups#!forum/asfermi>.

Jianbin Fang

unread,

Apr 24, 2012, 4:10:13 PM4/24/12

to asfermi

By local memory, I actually mean shared memory (in OpenCL terms).
Sorry for the confusion!

Jianbin

On Apr 11, 10:23 am, Hou Yunqing <hyq.neu...@gmail.com> wrote:
> HI Jianbin,
>
> Please join the asfermi mailing

> group<https://groups.google.com/forum/?fromgroups#!forum/asfermi>.

> > ------------------------------
> > *From:* Hou Yunqing [hyq.neu...@gmail.com]
> > *Sent:* Wednesday, April 11, 2012 4:22 AM
> > *To:* Jianbin Fang - EWI
> > *Subject:* Re: about cache strategies

>
> > Hi Jianbin,
>
> > Sure if you want I can put you as a committer, so you may contribute to
> > the wiki pages of Stage 2.
>
> > As for the caches, which caches are you interested in? Are you
> > interested in nothing other than the L1 data cache for local memory? There
> > are quite a few caches and I haven't done any work on any of them. Also,
> > what are the things you want to know? Cache policy? Associativity?
> > Bandwidth? Organisation?
>
> > BTW do you mind if I forward this conversation to asfermi's mailing
> > group (asf...@googlegroups.com)?
>
> > Thanks,
> > Yunqing
>
> > On Wed, Apr 11, 2012 at 5:38 AM, Jianbin Fang - EWI <J.F...@tudelft.nl>wrote:
>
> >> Again, I searched the asfermi, and I found that you were looking for
> >> the guys who are interested in this project. But it was during last July.
> >> Anyway, I am wondering whether you are still looking for someone that is
> >> interested in this project? Since I am very interested in this project,
> >> especially for the cache and local memory part. Is it clear/finished
> >> already? Are there anything still not perfect and I can join in?
>
> >> Regards,
>
> >> Jianbin

> >> ------------------------------
> >> *From:* Jianbin Fang - EWI
> >> *Sent:* Tuesday, April 10, 2012 10:25 PM
> >> *To:* hyq.neu...@gmail.com
> >> *Subject:* about cache strategies

Hou Yunqing

unread,

Apr 24, 2012, 11:53:05 PM4/24/12

to asf...@googlegroups.com

Just in case anyone cares, I just added support for two new directives: MinStack and MinFrame, to set the minimal stack size and minimal frame size for each kernel. Their usage is the same as the Local or Shared directive.

Yunqing

Dmitry N. Mikushin

unread,

Apr 25, 2012, 11:51:35 AM4/25/12

to asf...@googlegroups.com

Hi Yunqing,

Thanks for new directives! Btw, is that stack only for functions (e.g.
recursion) or also for data?

- D.

2012/4/25 Hou Yunqing <hyq.n...@gmail.com>:

Hou Yunqing

unread,

Apr 25, 2012, 9:22:06 PM4/25/12

to asf...@googlegroups.com

Hmm.. I just thought it's some quick work so I did it.. Not sure if recursion needs non-zero minframe/minstack size. It's a min, after all... Maybe it's just a way to tell the driver "set it up before I run"

Sent from my iPod

Reply all

Reply to author

Forward