Hey guys,
Hopefully, I can offer a much larger context for all of this. @Konstantinos: You might find the answer to your question in part (II) toward the end.
TL;DR: Much info about Ruby's functional access implementation, then tips for working with and around them.
First, to be clear, technically speaking there is nothing "broken" about functional access support in Ruby. In fact, Ruby's implementation ensures that the simulated system does not enter an invalid state. It calls a fatal() if it is possible to enter an invalid state. Here's why:
Functional accesses can change simulated system state, which means that the simulator needs to make sure that a functional access does not leave the simulated system in an invalid state. An example invalid state might be allowing two shared copies of data, but only functionally writing to one of them, so they end up being different while still being shared. This should not happen in a real system, so we're going to try to avoid situations like this. There are two options for ensuring the system doesn't enter an invalid state:
1) The (VERY) hard way: Track down all versions of the piece of data currently held in the system, and update them as appropriate. This requires looking for the data in caches, MSHRs, buffered in interconnects and off-chip memory. One must also be mindful of the current state of the data in all locations, so it also requires inspecting the coherence protocol for each location where state is tracked (i.e. potentially including directories). You might imagine how exceptionally difficult this could be to consider how the data must be updated under multiple concurrent protocol activities, especially if Ruby were to try to support functional accesses in all the different coherence protocols that are available.
2) The existing way: Track down all versions of the piece of data that are currently held in state-bearing locations (i.e. caches and memory), and only update if they are all in non-transient states. If the data is in a non-transient state in each of these locations, it is a strong indicator that the data does not exist in MSHRs or buffered in the interconnect (note: It's only a "strong indicator", because this is dependent on the coherence protocol, which could do crazy things). In this case, updating the data can be greatly simplified. However, if the data is in a transient state in some controller, then deciding how to handle the situation is hard. Currently, Ruby triggers a fatal() with the "Ruby functional X failed..." when it finds itself in this situation. The fatal() call is a correct way of enforcing that the system not enter an invalid state. However, it is frustrating that the execution is unable to continue. You can try to work around this a bit, more below...
Ruby implements option 2. To do so, it tracks access permissions with each state-bearing version of data, and these permissions are annotated on the protocol states declaration. I recommend trying to get your access permissions correct before working around the remaining functional access problems, because the number of cases you'll have to deal with should be far smaller. The different access permissions (defined in gem5/src/mem/protocol/RubySlicc_Exports.sm) are as follows in bold:
-----------------------------------------------------------------------------------------------------------------------
// AccessPermission
// The following five states define the access permission of all memory blocks.
// These permissions have multiple uses. They coordinate locking and
// synchronization primitives, as well as enable functional accesses.
// One should not need to add any additional permission values and it is very
// risky to do so.
enumeration(AccessPermission, desc="...", default="AccessPermission_NotPresent") {
// Valid data
Read_Only, desc="block is Read Only (modulo functional writes)";
Read_Write, desc="block is Read/Write";
// Possibly Invalid data
// The maybe stale permission indicates that accordingly to the protocol,
// there is no guarantee the block contains valid data. However, functional
// writes should update the block because a dataless PUT request may
// revalidate the block's data.
Maybe_Stale, desc="block can be stale or revalidated by a dataless PUT";
// In Broadcast/Snoop protocols, memory has no idea if it is exclusive owner
// or not of a block, making it hard to make the logic of having only one
// read_write block in the system impossible. This is to allow the memory to
// say, "I have the block" and for the RubyPort logic to know that this is a
// last-resort block if there are no writable copies in the caching hierarchy.
// This is not supposed to be used in directory or token protocols where
// memory/NB has an idea of what is going on in the whole system.
Backing_Store, desc="for memory in Broadcast/Snoop protocols";
// Invalid data
Invalid, desc="block is in an Invalid base state";
NotPresent, desc="block is NotPresent";
Busy, desc="block is in a transient state, currently invalid";
}
-----------------------------------------------------------------------------------------------------------------------
If you want to try to get your protocol to work within Ruby's existing functional access structure, you'll need to set these access permissions correctly on the states of all controllers in a protocol and this can be tricky. First, Invalid and NotPresent cache lines don't affect functional accesses because the data does not exist in caches in that state. These are associated with invalid states in a protocol, so should be easy to get right. The important (and trickiest) are Busy, Read_Only, Read_Write and Maybe_Stale:
Busy access permission indicates that a cache line is in a transient state, and thus, the data block may occupy an MSHR or controller queue, or be in transit to another cache. In this case, it would be very hard to figure out whether a line can and should be updated, so if Ruby finds any copies of the data to be Busy, it currently calls the fatal().
Read_Only indicates that, under normal coherence protocol activity, a data copy cannot be written to by a PUT request to the cache before using the coherence protocol to upgrade the state. Read_Only is generally associated with lines that are in a shared or owned states. This is nuanced, but important: PUT requests, being actual coherence protocol requests (i.e. NOT functional), trigger an upgrade request to the appropriate controller (e.g. directory), and this will/may cause coherence traffic to invalidate other copies before actually writing data into the line. Note that this cannot happen functionally, because the data's state updates must occur before the data, and these updates cannot occur instantaneously.
However, assuming that no other copies of the data are Busy, you can still perform both functional reads and writes on Read_Only data without updating the coherence state. The functional read reasoning should be obvious: a data requester has permission to read the data, so it can functionally read it as well. The write side is a little confusing: To a data requester, a functional write appears as though the data was magically changed to the appropriate functional value. As long as all copies of the data are updated in the same way, any data requester will see the functional write as though that was the data to begin with, so coherence state does not need to be changed.
Read_Write indicates that a data copy can be written to by a PUT request without any intervening coherence activity. Read_Write is generally associated with lines in the exclusive or modified states. Given that a requester can read or write, it's clear that (assuming there are no Busy copies of the data) functional reads and writes are allowed to the line.
Maybe_Stale indicates that a line may actually be stale from the current version of the data held elsewhere. Maybe_Stale is used when a higher level of the cache hierarchy holds the current version of the data, which may have been modified but has not yet been written back. In this case, (assuming there are no Busy copies of the data) functional writes should still update the line, because it may still be a current and valid version of the data.
Like I said, if you get all your protocol states' access permissions set correctly, this should eliminate a lot of functional access issues.
Alright, so how do we work around the Ruby functional access fatal issue?
Unfortunately, while often improbable, transient states are bound to happen when the simulator wants to do a functional access, resulting in the fatal(). There are a couple ways to work around the issue:
I) Case-by-case: If the functional access failure is resulting from "magic" simulator behavior, there are a couple ways to try to avoid the fatal():
For these, I'll describe using the following example: in our gem5-gpu CUDA library, a benchmark running in the simulated system will read its full binary to ensure that the operating system loads the whole binary into memory. This happens in __cudaRegisterFatBinary2() in benchmarks/libcuda/cuda_runtime_api.cc. We do this in order to ensure that the simulator can to get at the PTX code within the binary: it functionally reads the appropriate portion of the binary from simulated system memory, which is only guaranteed to be mapped because of the reads in the CUDA library. It is still possible (though unlikely) that the binary mapped into memory will be in a transient state between caches when the functional read is performed. This would result in the fatal().
A) One way to avoid the fatal() is to decrease its likelihood by perturbing the system just before the functional access: By touching the binary in __cudaRegisterFatBinary2(), it is very likely that lines to be functionally read will either be in a shared coherence protocol state in a cache, or they will have been evicted from cache and only be available in the backing store. This is why it is VERY rare for __cudaRegisterFatBinary2() to fail with a "Ruby functional read failed" fatal(). In cases like this, reading the data from within the simulated system is likely to put the data into a shared or evicted (i.e. non-transient) state before the functional read.
B) Another way to avoid the situation is to eliminate the functional accesses either by implementing the desired "magic" behavior in a real way, or by side-stepping it: Instead of using the simulator to magically pull the PTX code from the simulated system, we could reimplement our GPGPU-Sim initialization by giving it access to the benchmark binary outside of the simulated system. For example, in SE mode, the benchmark running in the system is often passed to the simulator on the command line and points to the benchmark on the host system. We could just have GPGPU-Sim grab the PTX from that binary (note: we opted against this, because this will not work with gem5's full-system mode, in which the binary lives on a disk image rather than in the host's file system).
II) Special broad case: If all RubySequencers/RubyPorts in the simulated access backing store for their accesses (i.e. access_phys_mem is set to True), then the backing store is always the current version of the data and can always be functionally accessed (we *think*). This is currently the case for VI_hammer in gem5-gpu. Namely, in the files, gem5-gpu/configs/VI_hammer.py, gem5-gpu/configs/VI_hammer_fusion.py, and gem5-gpu/configs/VI_hammer_split.py, every RubySequencer is instantiated with the access_phys_mem parameter set to True. Therefore, all memory accesses that go through these sequencers will, at the very end of the memory access, get or set the appropriate data from the (single) backing store (see gem5/src/mem/ruby/system/RubyPort.cc RubyPort::MemSlavePort::recvFunctional() and RubyPort::MemSlavePort::hitCallback()). In other words, the backing store always serves as the functional current version of the data, and all accesses functionally handle data in the hitCallback() at the end of the memory access. The Ruby controllers of the protocol may still handle data, but in reality, they are only performing the timing portion of the memory access.
Due to this structure, we *think* it is valid for Ruby to skip over the check that results in the fatal(). I recently added a patch to our gem5-patches queue that facilitates this skip. We're not sure if this is always a correct way to handle functional accesses, so the patch is guarded by the Mercurial guard 'func_access'. You can apply the patch by first qselecting this guard:
% hg qselect func_access
number of unguarded, unapplied patches has changed from 19 to 20
% hg qpush -a
(working directory not at a head)
applying common/flush_response
[...]
applying joel/hack_ruby_func_access
applying personal_patches_delimiter
patch personal_patches_delimiter is empty
now at: personal_patches_delimiter
I've also copied the patch here in case it would be cumbersome to update to our latest gem5-patches:
-----------------------------------------------------------------------------------------------------------------------
diff --git a/src/mem/ruby/system/RubyPort.cc b/src/mem/ruby/system/RubyPort.cc
--- a/src/mem/ruby/system/RubyPort.cc
+++ b/src/mem/ruby/system/RubyPort.cc
@@ -304,7 +304,7 @@
// Unless the requester explicitly said otherwise, generate an error if
// the functional request failed
- if (!accessSucceeded && !pkt->suppressFuncError()) {
+ if (!accessSucceeded && !pkt->suppressFuncError() && !access_phys_mem) {
fatal("Ruby functional %s failed for address %#x\n",
pkt->isWrite() ? "write" : "read", pkt->getAddr());
}
-----------------------------------------------------------------------------------------------------------------------
Hopefully this is complete enough to help you figure out your situation. I would appreciate any feedback on whether this description is adequate or helpful, since this functionality will likely be confusing for others as well.
Joel