The locking scheme in the shared caches is quite complex, but the main
goal is to allow as much independent progress while still making sure
there are no race conditions and no possibility for deadlock. To this
end, there is a shared/exclusive lock protecting the shared caches.
Multiple L1 accesses can go on at the same time, so they take a
non-exclusive lock. Once you need to access the shared L2, an
exclusive lock is requested.
Operations in the L2 can effect other L1s than your own. For example,
if your L2 access causes an eviction of a line out of L2 that's also
in someone else's L1, because the L2 is inclusive, this causes an
invalidation in that other L1. Therefore, all L2 accesses need to lock
all L1s sharing that L2 cache. (That's why the shared/exclusive lock
is used). Moreover, to avoid deadlocks you cannot just grab locks as
you need them (That's why an L2 access doesn't simply lock each L1
wants to touch, but immediately locks all of them).
The next optimization was to not have a single lock for the whole L2,
but one per set. Because the reason of the lock is to - in the
scenario above - avoid concurrent access to the line that will be
invalidated in the other L1, those locks need to be per *L1* set
rather than per L2 set.
This is one of the most complicated parts of the simulator, it took a
while to get those shared caches working correctly (and quickly). Once
you get through that, everything else in Sniper will be a breeze to
understand ;-)
Regards,
Wim
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Sniper simulator" group.
> To post to this group, send email to snip...@googlegroups.com
> To unsubscribe from this group, send email to
> snipersim+...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/snipersim?hl=en
I'm not sure where you're going but here are a few remarks that might
be interesting:
* There are two types of threads: those spawned by the application,
Pin starts a thread for each of these and they run both the
application (the functional part of the simulation) and all Pin
instrumentation functions (which in turn call most of the timing
models). Additionally, using PIN_SpawnInternalThread, more threads
(the "network" threads) are created (one per core) to handle coherency
messages (incoming invalidations/writeback requests, so the
application threads don't need to check for this). Finally, there is a
set of MCP (master controller) threads that handle application
synchronization (emulation of SYS_futex system calls and
pthread_mutex/cond/barrier) and simulator control messages.
* You're not allowed to use pthreads inside the simulator (this is a
Pin constraint). But PIN_SpawnInternalThread works well. Note that
it's encapsulated in a virtual Thread class and a weak linking trick.
PthreadThread implements a version of Thread that uses pthreads, but
is marked with __attribute__((weak)). PinThread implements the same
interface but uses PIN_SpawnInternalThread. If you're building the Pin
version of the simulator, PinThread takes precedence, but in the
standalone version (for replaying traces), the pthreads-based version
is used. A similar setup is used for Lock, which either resolves to
PthreadLock (pthread_mutex) or PinLock (PIN_InitLock).
* All data structures potentially accessed concurrently by two threads
need to be protected by locks. The core performance models are pretty
much private to the application thread for that core, so you won't
find locks in there. But the shared caches are shared by multiple
application threads, and multiple network threads, so they use a more
complex locking scheme (see my other mail).
* You'll also find a ScopedLock class. It takes a lock in it's
constructor, and releases it in the destructor. The cool thing is that
you only need to define the variable and the locking happens
implicitly. So these two pieces of code are equivalent, given that
lock already exists and is of type Lock:
{
lock.acquire();
// do stuff
lock.release();
}
{
ScopedLock sl(lock);
// do stuff
}
Hope this helps, if you have further or more specific questions I'll
be glad to try and help out.
Regards,
Wim
Using the hooks system, you can request a callback which will happen
at every barrier synchronization (100 ns by default). Include this in
the constructor of your cache:
Sim()->getHooksManager()->registerHook(HookType::HOOK_PERIODIC,
(HooksManager::HookCallbackFunc)myFunction, (void*)arg);
With myFunction being of this signature:
void myFunction(void* arg, subsecond_time_t simtime);
You'll probably want to register one of these from every L2 object you
have, passing the this pointer as the argument. Note though that there
is a CacheCntlr object for each core - each of these has its own set
of access/hit/miss counters but all of these that correspond to the
same shared L2 point to a shared CacheMasterCntlr which contains all
simulated tag data.
This function will be called *inside* the barrier sync, at this point
it's guaranteed that no core is running and you should be free to
touch all of your L2's state (it'll actually run in the MCP thread,
not the application thread which usually runs your L2, but that should
be fine). In your function, use "simtime" to figure out if you need to
do the algo, don't just change the barrier to 15M cycles else you'll
start to have inaccuracies. As long as myFunction doesn't return, no
core will make further progress, so simulated time is effectively
stopped.
I don't know if you want to have your algorithm take up any simulated
time or whether it can be assumed to complete instantaneously, but if
it does take time, you can just store it's completion time somewhere
in the L2, and if an L1 access comes in with a timestamp before your
thing is supposed to be done, just return some extra latency.
Regards,
Wim
Operations in the L2 can effect other L1s than your own. For example,
if your L2 access causes an eviction of a line out of L2 that's also
in someone else's L1, because the L2 is inclusive, this causes an
invalidation in that other L1. Therefore, all L2 accesses need to lock
all L1s sharing that L2 cache. (That's why the shared/exclusive lock
is used). Moreover, to avoid deadlocks you cannot just grab locks as
you need them (That's why an L2 access doesn't simply lock each L1
wants to touch, but immediately locks all of them).
The next optimization was to not have a single lock for the whole L2,
but one per set. Because the reason of the lock is to - in the
scenario above - avoid concurrent access to the line that will be
invalidated in the other L1, those locks need to be per *L1* set
rather than per L2 set.
> Regarding this locking sytem, I have few questions to clarify. I am using
> private L1 and shared L2 and I am working to implement 'thread-aware'
> policies in L2. So this locking is causing problem. When an access misses in
> L1 and also in L2, then cache_cntrl.cc performs, "have the next cache levels
> fill themselves with the new data" and then it finds "Tried to read in
> next-level cache, but data is already gone".
>
> This means that locking also has to be thread-aware. But since locking is
> per L1 set, and L1 are thread-aware even in current implementation, I could
> not understand what is causing this problem. If locking were to be per L2
> set, I could think that, based on a thread, a particular L2set has to be
> locked.
Since an ejection from the shared L2 can potentially require a line in
a different core's *L1* to be invalidated, all accesses to all
potentially affected L1 lines need to be locked. Usually this means
that all lines in the L2 set you're working in are also locked, unless
the associativity in the L1 is larger than that in the L2. If this is
the case, or if you have made any other source modifications, this
assumption may have been broken.
There's an easy change to make the locking more conservative. In
CacheMasterCntlr::getSetLock (cache_cntlr.cc line 87), always return
the same lock by changing line 89 to
return &m_setlocks.at(0);
This way, only one access *in total* (in contrast to one access per
set as in the default case) can be going on at the same time for your
complete L2. It might slow down simulation if you have a lot of L1
misses, but at least it should be a safe starting point.
Regards,
Wim
> As said above, that during the working of myFunction, no core will make
> progress, I have some questions (assuming myFunction completes
> instantaneously).
> 1. Inside this function, I need to send flush message to dram. Should I do
> this outside myFunction (i.e. when sync has been done). I was thinking so,
> because, since cores are not running, dram should also not be running.
You can still access the DRAM controllers, they run in a different thread.
> 2. In ISPASS tutorial, I saw real-time v/s simulated time. On
> pthread_cond_wait, exact sync does not happen. Can it pose a limitation?
I'm not sure what you mean with exact sync. From a functional point of
view, there will be no error as the thread cannot continue. Also note
the time scales: barrier synchronization happens every 100ns by
default, pthread_cond_* synchronization typically happens at much
larger time scales (microseconds up to seconds).
> 3. When I try to do sendMsg inside myFunction with
> PrL1PrL2DramDirectoryMSI::ShmemMsg::FLUSH_REP and
> ShmemPerfModel::_USER_THREAD
> this line causes failure:
> assert(! this->m_sharers[sharer_id]);
> file: directory_entry_limited_no_broadcast.h
It looks like you're sending this message for a line that you're not
supposed to have, or maybe to the wrong DRAM controller. What you're
doing (if I understand it correctly at least) is the same as when a
dirty line is evicted, take a look at cache_cntlr.cc starting at line
938 which does exactly that.
Regards,
Wim
It looks like you're sending this message for a line that you're not
supposed to have, or maybe to the wrong DRAM controller. What you're
doing (if I understand it correctly at least) is the same as when a
dirty line is evicted, take a look at cache_cntlr.cc starting at line
938 which does exactly that.
If you can send me a diff of the changes you made (you can mail it to
me directly), I can take a look at what might be going wrong.
Regards,
Wim