regarding locks in shared caches

298 views
Skip to first unread message

sparsh mittal

unread,
Mar 27, 2012, 6:53:00 PM3/27/12
to snip...@googlegroups.com
Hello


I was seeing this code in memory_manager.cc:

 if (m_core_id_master == getCore()->getId()) {
      m_cache_cntlrs[(UInt32)m_last_level_cache]->createSetLocks(
         getCacheBlockSize(),
         k_KILO * cache_parameters[MemComponent::L1_DCACHE].size / (cache_parameters[MemComponent::L1_DCACHE].associativity * getCacheBlockSize()),
         m_core_id_master,
         cache_parameters[m_last_level_cache].shared_cores
      );
   }

What I could not understand is that last level cache is L2, which has different number of sets than L1D. So, what does it accomplish? In my understanding, for shared L2, number of locks should be equal to number of sets, but may be I am wrong.


Thanks and Regards
Sparsh Mittal


Wim Heirman

unread,
Mar 28, 2012, 10:08:00 AM3/28/12
to snip...@googlegroups.com
Hi Sparsh,

The locking scheme in the shared caches is quite complex, but the main
goal is to allow as much independent progress while still making sure
there are no race conditions and no possibility for deadlock. To this
end, there is a shared/exclusive lock protecting the shared caches.
Multiple L1 accesses can go on at the same time, so they take a
non-exclusive lock. Once you need to access the shared L2, an
exclusive lock is requested.

Operations in the L2 can effect other L1s than your own. For example,
if your L2 access causes an eviction of a line out of L2 that's also
in someone else's L1, because the L2 is inclusive, this causes an
invalidation in that other L1. Therefore, all L2 accesses need to lock
all L1s sharing that L2 cache. (That's why the shared/exclusive lock
is used). Moreover, to avoid deadlocks you cannot just grab locks as
you need them (That's why an L2 access doesn't simply lock each L1
wants to touch, but immediately locks all of them).

The next optimization was to not have a single lock for the whole L2,
but one per set. Because the reason of the lock is to - in the
scenario above - avoid concurrent access to the line that will be
invalidated in the other L1, those locks need to be per *L1* set
rather than per L2 set.

This is one of the most complicated parts of the simulator, it took a
while to get those shared caches working correctly (and quickly). Once
you get through that, everything else in Sniper will be a breeze to
understand ;-)

Regards,
Wim

> --
> --
> You received this message because you are subscribed to the Google
> Groups "Sniper simulator" group.
> To post to this group, send email to snip...@googlegroups.com
> To unsubscribe from this group, send email to
> snipersim+...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/snipersim?hl=en

sparsh mittal

unread,
Mar 28, 2012, 10:33:54 AM3/28/12
to snip...@googlegroups.com
Thanks a lot, Wim, for this insightful answer, which is very helpful.


Thanks and Regards
Sparsh Mittal




sparsh mittal

unread,
Mar 29, 2012, 7:00:46 PM3/29/12
to snip...@googlegroups.com
Hello Wim

Could you give some pointers or just a reference to look up to understand synchronization as used in sniper. I know that Sniper uses pthreads, but I thought to ask, to get specific answer. My apologies for taking your time, but that would greatly help me and others also.

Thanks for replying.


Thanks and Regards
Sparsh Mittal




Wim Heirman

unread,
Mar 30, 2012, 4:18:06 AM3/30/12
to snip...@googlegroups.com
Hi Sparsh,

I'm not sure where you're going but here are a few remarks that might
be interesting:

* There are two types of threads: those spawned by the application,
Pin starts a thread for each of these and they run both the
application (the functional part of the simulation) and all Pin
instrumentation functions (which in turn call most of the timing
models). Additionally, using PIN_SpawnInternalThread, more threads
(the "network" threads) are created (one per core) to handle coherency
messages (incoming invalidations/writeback requests, so the
application threads don't need to check for this). Finally, there is a
set of MCP (master controller) threads that handle application
synchronization (emulation of SYS_futex system calls and
pthread_mutex/cond/barrier) and simulator control messages.

* You're not allowed to use pthreads inside the simulator (this is a
Pin constraint). But PIN_SpawnInternalThread works well. Note that
it's encapsulated in a virtual Thread class and a weak linking trick.
PthreadThread implements a version of Thread that uses pthreads, but
is marked with __attribute__((weak)). PinThread implements the same
interface but uses PIN_SpawnInternalThread. If you're building the Pin
version of the simulator, PinThread takes precedence, but in the
standalone version (for replaying traces), the pthreads-based version
is used. A similar setup is used for Lock, which either resolves to
PthreadLock (pthread_mutex) or PinLock (PIN_InitLock).

* All data structures potentially accessed concurrently by two threads
need to be protected by locks. The core performance models are pretty
much private to the application thread for that core, so you won't
find locks in there. But the shared caches are shared by multiple
application threads, and multiple network threads, so they use a more
complex locking scheme (see my other mail).

* You'll also find a ScopedLock class. It takes a lock in it's
constructor, and releases it in the destructor. The cool thing is that
you only need to define the variable and the locking happens
implicitly. So these two pieces of code are equivalent, given that
lock already exists and is of type Lock:

{
lock.acquire();
// do stuff
lock.release();
}

{
ScopedLock sl(lock);
// do stuff
}


Hope this helps, if you have further or more specific questions I'll
be glad to try and help out.

Regards,
Wim

sparsh mittal

unread,
Mar 30, 2012, 8:34:12 AM3/30/12
to snip...@googlegroups.com
Wim and Trevor

Firstly, my best wishes for ISPASS tutorial presentation. Also, thank you for your time and detailed explanation.

Regarding ScopedLock, I understand now (I also see that it can be used for locking IO). Great.

My experimental need is following. I need to implement algorithm, which works at L2 after every interval of e.g. 15M cycles). With cycle-accurate sim., 15M interval is easy to implement. With interval core model and multi-core simulation, I was wondering how to flag that 'It is time for algo to work'. It does not have to be precise 15M. Within a few hundreds of cycles is also OK. For example once all cores reach nearly 15M cyc, a flag should be set. In my opinion, after 15M cycle, the next time that cores are synchronized, that is best time to flag.

Second need is to lock/disable whole L2 for sometime, till algo is done. During this time, core, L1 etc. can work. If we cannot lock L2 alone, disabling core also is OK, if that is possible.

Can you tell, how should I do this. I would be grateful to you.

Thanks and Regards
Sparsh Mittal




Wim Heirman

unread,
Mar 30, 2012, 8:48:22 AM3/30/12
to snip...@googlegroups.com
Sparsh,

Using the hooks system, you can request a callback which will happen
at every barrier synchronization (100 ns by default). Include this in
the constructor of your cache:

Sim()->getHooksManager()->registerHook(HookType::HOOK_PERIODIC,
(HooksManager::HookCallbackFunc)myFunction, (void*)arg);

With myFunction being of this signature:

void myFunction(void* arg, subsecond_time_t simtime);

You'll probably want to register one of these from every L2 object you
have, passing the this pointer as the argument. Note though that there
is a CacheCntlr object for each core - each of these has its own set
of access/hit/miss counters but all of these that correspond to the
same shared L2 point to a shared CacheMasterCntlr which contains all
simulated tag data.

This function will be called *inside* the barrier sync, at this point
it's guaranteed that no core is running and you should be free to
touch all of your L2's state (it'll actually run in the MCP thread,
not the application thread which usually runs your L2, but that should
be fine). In your function, use "simtime" to figure out if you need to
do the algo, don't just change the barrier to 15M cycles else you'll
start to have inaccuracies. As long as myFunction doesn't return, no
core will make further progress, so simulated time is effectively
stopped.

I don't know if you want to have your algorithm take up any simulated
time or whether it can be assumed to complete instantaneously, but if
it does take time, you can just store it's completion time somewhere
in the L2, and if an L1 access comes in with a timestamp before your
thing is supposed to be done, just return some extra latency.

Regards,
Wim

sparsh mittal

unread,
Apr 3, 2012, 4:40:49 PM4/3/12
to snip...@googlegroups.com
On Wed, Mar 28, 2012 at 9:08 AM, Wim Heirman <w...@heirman.net> wrote:
Operations in the L2 can effect other L1s than your own. For example,
if your L2 access causes an eviction of a line out of L2 that's also
in someone else's L1, because the L2 is inclusive, this causes an
invalidation in that other L1. Therefore, all L2 accesses need to lock
all L1s sharing that L2 cache. (That's why the shared/exclusive lock
is used). Moreover, to avoid deadlocks you cannot just grab locks as
you need them (That's why an L2 access doesn't simply lock each L1
wants to touch, but immediately locks all of them).

The next optimization was to not have a single lock for the whole L2,
but one per set. Because the reason of the lock is to - in the
scenario above - avoid concurrent access to the line that will be
invalidated in the other L1, those locks need to be per *L1* set
rather than per L2 set.

Hello Wim

Regarding this locking sytem, I have few questions to clarify. I am using private L1 and shared L2 and I am working to implement 'thread-aware' policies in L2. So this locking is causing problem. When an access misses in L1 and also in L2, then cache_cntrl.cc performs, "have the next cache levels fill themselves with the new data" and then it finds "Tried to read in next-level cache, but data is already gone".

This means that locking also has to be thread-aware. But since locking is per L1 set, and L1 are thread-aware even in current implementation, I could not understand what is causing this problem.  If locking were to be per L2 set, I could think that, based on a thread, a particular L2set has to be locked.

Can you explain or give some pointers.

Thanks a lot.
Sparsh

Wim Heirman

unread,
Apr 3, 2012, 5:17:02 PM4/3/12
to snip...@googlegroups.com
Hi Sparsh,

> Regarding this locking sytem, I have few questions to clarify. I am using
> private L1 and shared L2 and I am working to implement 'thread-aware'
> policies in L2. So this locking is causing problem. When an access misses in
> L1 and also in L2, then cache_cntrl.cc performs, "have the next cache levels
> fill themselves with the new data" and then it finds "Tried to read in
> next-level cache, but data is already gone".
>
> This means that locking also has to be thread-aware. But since locking is
> per L1 set, and L1 are thread-aware even in current implementation, I could
> not understand what is causing this problem.  If locking were to be per L2
> set, I could think that, based on a thread, a particular L2set has to be
> locked.

Since an ejection from the shared L2 can potentially require a line in
a different core's *L1* to be invalidated, all accesses to all
potentially affected L1 lines need to be locked. Usually this means
that all lines in the L2 set you're working in are also locked, unless
the associativity in the L1 is larger than that in the L2. If this is
the case, or if you have made any other source modifications, this
assumption may have been broken.

There's an easy change to make the locking more conservative. In
CacheMasterCntlr::getSetLock (cache_cntlr.cc line 87), always return
the same lock by changing line 89 to
return &m_setlocks.at(0);
This way, only one access *in total* (in contrast to one access per
set as in the default case) can be going on at the same time for your
complete L2. It might slow down simulation if you have a lot of L1
misses, but at least it should be a safe starting point.

Regards,
Wim

sparsh mittal

unread,
Apr 4, 2012, 4:32:00 PM4/4/12
to snip...@googlegroups.com
Thanks. It worked without change. There was some other error in mycode. Thanks for your explanation.

Info: On changing to &m_setlocks.at(0); we lose ~200KIPS for 4-core simulation, with 200ns quantum.



Thanks and Regards
Sparsh Mittal




sparsh mittal

unread,
Apr 10, 2012, 5:47:24 PM4/10/12
to snip...@googlegroups.com
Hello Wim
Greetings.

As said above, that during the working of myFunction, no core will make progress, I have some questions (assuming myFunction completes instantaneously).
1. Inside this function, I need to send flush message to dram. Should I do this outside myFunction (i.e. when sync has been done). I was thinking so, because, since cores are not running, dram should also not be running.
2. In ISPASS tutorial, I saw real-time v/s simulated time. On pthread_cond_wait, exact sync does not happen. Can it pose a limitation?
3. When I try to do sendMsg inside myFunction with PrL1PrL2DramDirectoryMSI::ShmemMsg::FLUSH_REP and ShmemPerfModel::_USER_THREAD
this line causes failure:
assert(! this->m_sharers[sharer_id]);
file: directory_entry_limited_no_broadcast.h
 
Can you help in this regard. Thanks a lot.

Wim Heirman

unread,
Apr 11, 2012, 11:22:16 AM4/11/12
to snip...@googlegroups.com
Hi Sparsh,

> As said above, that during the working of myFunction, no core will make
> progress, I have some questions (assuming myFunction completes
> instantaneously).
> 1. Inside this function, I need to send flush message to dram. Should I do
> this outside myFunction (i.e. when sync has been done). I was thinking so,
> because, since cores are not running, dram should also not be running.

You can still access the DRAM controllers, they run in a different thread.

> 2. In ISPASS tutorial, I saw real-time v/s simulated time. On
> pthread_cond_wait, exact sync does not happen. Can it pose a limitation?

I'm not sure what you mean with exact sync. From a functional point of
view, there will be no error as the thread cannot continue. Also note
the time scales: barrier synchronization happens every 100ns by
default, pthread_cond_* synchronization typically happens at much
larger time scales (microseconds up to seconds).

> 3. When I try to do sendMsg inside myFunction with
> PrL1PrL2DramDirectoryMSI::ShmemMsg::FLUSH_REP and
> ShmemPerfModel::_USER_THREAD
> this line causes failure:
> assert(! this->m_sharers[sharer_id]);
> file: directory_entry_limited_no_broadcast.h

It looks like you're sending this message for a line that you're not
supposed to have, or maybe to the wrong DRAM controller. What you're
doing (if I understand it correctly at least) is the same as when a
dirty line is evicted, take a look at cache_cntlr.cc starting at line
938 which does exactly that.

Regards,
Wim

sparsh mittal

unread,
Apr 11, 2012, 2:36:53 PM4/11/12
to snip...@googlegroups.com
Thanks for your explanations and answers. I have one clarification to seek (inline)


It looks like you're sending this message for a line that you're not
supposed to have, or maybe to the wrong DRAM controller. What you're
doing (if I understand it correctly at least) is the same as when a
dirty line is evicted, take a look at cache_cntlr.cc starting at line
938 which does exactly that.

 
Yes, I am trying to writeback dirty and clean(shared) lines.  For dirty, the code I already sent is used (line 938). For clean, else part of 938 (which is 952) is used. However, I am surprised, since with shared-L2, I should have only one DRAM controller, so how can it be 'wrong dram' controller?
Also, while removing lines from L2 cache, I am taking care of updating in previous levels (inside locking), cloning block info and also m_cache_sharers; still Assertion `! this->m_sharers[sharer_id]' failed error comes.
Can you please give some more directions. Thanks a lot for your time.
 
Sparsh

Wim Heirman

unread,
Apr 12, 2012, 5:24:14 AM4/12/12
to snip...@googlegroups.com
Sparsh,

If you can send me a diff of the changes you made (you can mail it to
me directly), I can take a look at what might be going wrong.

Regards,
Wim

Reply all
Reply to author
Forward
0 new messages