hello, about "Ruby functional read failed for address xxxxx" arising from Rubyport

argonZ argonZ

unread,

Sep 26, 2013, 3:22:05 AM9/26/13

to gem5-g...@googlegroups.com

Hello guys,
We have encountered a "Ruby functional read failed for address xxxxx"
issue arising from Rubyport when we run the Rodinia benchmarks on a fs
mode gem5-gpu with VI_hammer protocol. We notice that there is a
similiar question on this which is tittled "Errors running rodinia
benchmarks in FS mode". However, it seems the issue is not fixed. We
wonder if there is a solution to handle it.

Thank you very much!

Best,
Yuang

Jason Power

unread,

Sep 26, 2013, 11:16:49 AM9/26/13

to argonZ argonZ, gem5-g...@googlegroups.com

There is not a simple solution to fix this. Unfortunately it is a limitation of Ruby (in gem5) right now.

Are you experiencing the problem with all of the Rodinia benchmarks? We can run most, if not all, of the benchmarks in FS mode without problems.

Jason

Anilkumar L R

unread,

May 1, 2014, 11:03:20 PM5/1/14

to gem5-g...@googlegroups.com, argonZ argonZ

Hello Jason,

The same issue is happening with us. Is there any work-around for this ?

Regards,

Anil

Marc Orr

unread,

May 2, 2014, 9:34:37 AM5/2/14

to Anilkumar L R, gem5-g...@googlegroups.com, argonZ argonZ

If gem5-gpu still uses the ruby backing store (as I believe it does), you can comment out the bodies of functionalRead/functionalWrite (in src/mem/ruby/system/System.cc last time I looked) such that they always return true. The value they return gets overwritten with a value from the backing store anyways.

Thanks,
Marc

Konstantinos Koukos

unread,

Oct 17, 2014, 4:19:21 PM10/17/14

to gem5-g...@googlegroups.com

Hello,

I am also facing some problems with functionalRead.

I am assuming release consistency in my study so it is allowed to have READ_WRITE permissions for some of
the stable states (e.g., Shared state in a write through protocol). The block can be accessed by many but it will
be potentially modified (the application guarantees that) by only one. Anyone should be be able to access or modify
it. The problem occurs when more than one has the same block:

gem5.opt: build/VIPS/mem/ruby/system/System.cc:460: bool RubySystem::functionalRead(Packet*): Assertion `num_rw <= 1' failed.

I am using the following topology in ruby:
Separate L1/L2 caches for the CPU and separate L1/L2 caches for the GPU. All caches have READ_WRITE permissions for the stable states.
Unified memory with backing_store for the valid blocks. Probably the block belongs to more than one caches (which is valid for my protocol)
and triggers the assertion.

Any ideas on how to bypass this exclusivity ???

Best Regards,
Konstantinos.

Jason Power

unread,

Oct 17, 2014, 5:17:11 PM10/17/14

to Konstantinos Koukos, gem5-g...@googlegroups.com

Hi Konstantinos,

You could try commenting out the assertion and see if your application works then. The failed functional access is probably getting the right data from the backing store anyway. Overall the functional access support in Ruby is pretty broken.

Also, if there are two caches with RW permission, how would you know which one has the "correct" data. I would assume that the data from either is fine, so again, just commenting out the assertion and using, say, the data first cache with RW permission should work.

Jason

Konstantinos Koukos

unread,

Oct 17, 2014, 5:41:42 PM10/17/14

to gem5-g...@googlegroups.com, koukos.ko...@gmail.com

Hi Jason, thanks for the answer.

It seems that is pretty hard to support release consistency indeed. Yes, i tried it already and it breaks. I tried a different approach
instead. I set the conflicting states to read_only. It seems that the simulation continues with this change... No one forbids to have
multiple read valid copies. How does this affects the stores / writes ? Is this just dumping all the data directly to the memory (directory) ?

About the release consistency we assume data race free (DRF) applications. If the application has a write after write from different
cores or read after write without synchronization the execution will be incorrect but the protocol (simulator) shouldn't crash. In this
case we just throw some warnings (for us to fix the application) but we should continue the execution.

Thanks once again,
K.

Konstantinos Koukos

unread,

Oct 17, 2014, 5:57:55 PM10/17/14

to gem5-g...@googlegroups.com, koukos.ko...@gmail.com

Sorry the question was a bit confusing.

Assume the following example where i have all the GPU L1 caches into a stable state S(shared). This means that

they may all share the same cache line for read. What i also want is one of them to be able to write the block. Since

there is a single state which should allow writes i initially set it to read_write (someone but only one can write / no owner).

The problem with the functional reads forces me to switch to read_only and the question is:

How does this affects the stores that comes from the sequencer for this module (L1 cache machine) ? Is there any difference

for the protocol states, etc ? If it just affects where the data are written and just uses backing_store to sent the data directly to

memory then is just what i want.

Best Regards,

Konstantinos.

Joel Hestness

unread,

Oct 18, 2014, 1:32:44 PM10/18/14

to Konstantinos Koukos, gem5-gpu developers

Hey guys,

Hopefully, I can offer a much larger context for all of this. @Konstantinos: You might find the answer to your question in part (II) toward the end.

TL;DR: Much info about Ruby's functional access implementation, then tips for working with and around them.

First, to be clear, technically speaking there is nothing "broken" about functional access support in Ruby. In fact, Ruby's implementation ensures that the simulated system does not enter an invalid state. It calls a fatal() if it is possible to enter an invalid state. Here's why:

Functional accesses can change simulated system state, which means that the simulator needs to make sure that a functional access does not leave the simulated system in an invalid state. An example invalid state might be allowing two shared copies of data, but only functionally writing to one of them, so they end up being different while still being shared. This should not happen in a real system, so we're going to try to avoid situations like this. There are two options for ensuring the system doesn't enter an invalid state:

1) The (VERY) hard way: Track down all versions of the piece of data currently held in the system, and update them as appropriate. This requires looking for the data in caches, MSHRs, buffered in interconnects and off-chip memory. One must also be mindful of the current state of the data in all locations, so it also requires inspecting the coherence protocol for each location where state is tracked (i.e. potentially including directories). You might imagine how exceptionally difficult this could be to consider how the data must be updated under multiple concurrent protocol activities, especially if Ruby were to try to support functional accesses in all the different coherence protocols that are available.

2) The existing way: Track down all versions of the piece of data that are currently held in state-bearing locations (i.e. caches and memory), and only update if they are all in non-transient states. If the data is in a non-transient state in each of these locations, it is a strong indicator that the data does not exist in MSHRs or buffered in the interconnect (note: It's only a "strong indicator", because this is dependent on the coherence protocol, which could do crazy things). In this case, updating the data can be greatly simplified. However, if the data is in a transient state in some controller, then deciding how to handle the situation is hard. Currently, Ruby triggers a fatal() with the "Ruby functional X failed..." when it finds itself in this situation. The fatal() call is a correct way of enforcing that the system not enter an invalid state. However, it is frustrating that the execution is unable to continue. You can try to work around this a bit, more below...

Ruby implements option 2. To do so, it tracks access permissions with each state-bearing version of data, and these permissions are annotated on the protocol states declaration. I recommend trying to get your access permissions correct before working around the remaining functional access problems, because the number of cases you'll have to deal with should be far smaller. The different access permissions (defined in gem5/src/mem/protocol/RubySlicc_Exports.sm) are as follows in bold:

-----------------------------------------------------------------------------------------------------------------------

// AccessPermission

// The following five states define the access permission of all memory blocks.

// These permissions have multiple uses. They coordinate locking and

// synchronization primitives, as well as enable functional accesses.

// One should not need to add any additional permission values and it is very

// risky to do so.

enumeration(AccessPermission, desc="...", default="AccessPermission_NotPresent") {

// Valid data

Read_Only, desc="block is Read Only (modulo functional writes)";

Read_Write, desc="block is Read/Write";

// Possibly Invalid data

// The maybe stale permission indicates that accordingly to the protocol,

// there is no guarantee the block contains valid data. However, functional

// writes should update the block because a dataless PUT request may

// revalidate the block's data.

Maybe_Stale, desc="block can be stale or revalidated by a dataless PUT";

// In Broadcast/Snoop protocols, memory has no idea if it is exclusive owner

// or not of a block, making it hard to make the logic of having only one

// read_write block in the system impossible. This is to allow the memory to

// say, "I have the block" and for the RubyPort logic to know that this is a

// last-resort block if there are no writable copies in the caching hierarchy.

// This is not supposed to be used in directory or token protocols where

// memory/NB has an idea of what is going on in the whole system.

Backing_Store, desc="for memory in Broadcast/Snoop protocols";

// Invalid data

Invalid, desc="block is in an Invalid base state";

NotPresent, desc="block is NotPresent";

Busy, desc="block is in a transient state, currently invalid";

}

-----------------------------------------------------------------------------------------------------------------------

If you want to try to get your protocol to work within Ruby's existing functional access structure, you'll need to set these access permissions correctly on the states of all controllers in a protocol and this can be tricky. First, Invalid and NotPresent cache lines don't affect functional accesses because the data does not exist in caches in that state. These are associated with invalid states in a protocol, so should be easy to get right. The important (and trickiest) are Busy, Read_Only, Read_Write and Maybe_Stale:

Busy access permission indicates that a cache line is in a transient state, and thus, the data block may occupy an MSHR or controller queue, or be in transit to another cache. In this case, it would be very hard to figure out whether a line can and should be updated, so if Ruby finds any copies of the data to be Busy, it currently calls the fatal().

Read_Only indicates that, under normal coherence protocol activity, a data copy cannot be written to by a PUT request to the cache before using the coherence protocol to upgrade the state. Read_Only is generally associated with lines that are in a shared or owned states. This is nuanced, but important: PUT requests, being actual coherence protocol requests (i.e. NOT functional), trigger an upgrade request to the appropriate controller (e.g. directory), and this will/may cause coherence traffic to invalidate other copies before actually writing data into the line. Note that this cannot happen functionally, because the data's state updates must occur before the data, and these updates cannot occur instantaneously.

However, assuming that no other copies of the data are Busy, you can still perform both functional reads and writes on Read_Only data without updating the coherence state. The functional read reasoning should be obvious: a data requester has permission to read the data, so it can functionally read it as well. The write side is a little confusing: To a data requester, a functional write appears as though the data was magically changed to the appropriate functional value. As long as all copies of the data are updated in the same way, any data requester will see the functional write as though that was the data to begin with, so coherence state does not need to be changed.

Read_Write indicates that a data copy can be written to by a PUT request without any intervening coherence activity. Read_Write is generally associated with lines in the exclusive or modified states. Given that a requester can read or write, it's clear that (assuming there are no Busy copies of the data) functional reads and writes are allowed to the line.

Maybe_Stale indicates that a line may actually be stale from the current version of the data held elsewhere. Maybe_Stale is used when a higher level of the cache hierarchy holds the current version of the data, which may have been modified but has not yet been written back. In this case, (assuming there are no Busy copies of the data) functional writes should still update the line, because it may still be a current and valid version of the data.

Like I said, if you get all your protocol states' access permissions set correctly, this should eliminate a lot of functional access issues.

Alright, so how do we work around the Ruby functional access fatal issue?

Unfortunately, while often improbable, transient states are bound to happen when the simulator wants to do a functional access, resulting in the fatal(). There are a couple ways to work around the issue:

I) Case-by-case: If the functional access failure is resulting from "magic" simulator behavior, there are a couple ways to try to avoid the fatal():

For these, I'll describe using the following example: in our gem5-gpu CUDA library, a benchmark running in the simulated system will read its full binary to ensure that the operating system loads the whole binary into memory. This happens in __cudaRegisterFatBinary2() in benchmarks/libcuda/cuda_runtime_api.cc. We do this in order to ensure that the simulator can to get at the PTX code within the binary: it functionally reads the appropriate portion of the binary from simulated system memory, which is only guaranteed to be mapped because of the reads in the CUDA library. It is still possible (though unlikely) that the binary mapped into memory will be in a transient state between caches when the functional read is performed. This would result in the fatal().

A) One way to avoid the fatal() is to decrease its likelihood by perturbing the system just before the functional access: By touching the binary in __cudaRegisterFatBinary2(), it is very likely that lines to be functionally read will either be in a shared coherence protocol state in a cache, or they will have been evicted from cache and only be available in the backing store. This is why it is VERY rare for __cudaRegisterFatBinary2() to fail with a "Ruby functional read failed" fatal(). In cases like this, reading the data from within the simulated system is likely to put the data into a shared or evicted (i.e. non-transient) state before the functional read.

B) Another way to avoid the situation is to eliminate the functional accesses either by implementing the desired "magic" behavior in a real way, or by side-stepping it: Instead of using the simulator to magically pull the PTX code from the simulated system, we could reimplement our GPGPU-Sim initialization by giving it access to the benchmark binary outside of the simulated system. For example, in SE mode, the benchmark running in the system is often passed to the simulator on the command line and points to the benchmark on the host system. We could just have GPGPU-Sim grab the PTX from that binary (note: we opted against this, because this will not work with gem5's full-system mode, in which the binary lives on a disk image rather than in the host's file system).

II) Special broad case: If all RubySequencers/RubyPorts in the simulated access backing store for their accesses (i.e. access_phys_mem is set to True), then the backing store is always the current version of the data and can always be functionally accessed (we *think*). This is currently the case for VI_hammer in gem5-gpu. Namely, in the files, gem5-gpu/configs/VI_hammer.py, gem5-gpu/configs/VI_hammer_fusion.py, and gem5-gpu/configs/VI_hammer_split.py, every RubySequencer is instantiated with the access_phys_mem parameter set to True. Therefore, all memory accesses that go through these sequencers will, at the very end of the memory access, get or set the appropriate data from the (single) backing store (see gem5/src/mem/ruby/system/RubyPort.cc RubyPort::MemSlavePort::recvFunctional() and RubyPort::MemSlavePort::hitCallback()). In other words, the backing store always serves as the functional current version of the data, and all accesses functionally handle data in the hitCallback() at the end of the memory access. The Ruby controllers of the protocol may still handle data, but in reality, they are only performing the timing portion of the memory access.

Due to this structure, we *think* it is valid for Ruby to skip over the check that results in the fatal(). I recently added a patch to our gem5-patches queue that facilitates this skip. We're not sure if this is always a correct way to handle functional accesses, so the patch is guarded by the Mercurial guard 'func_access'. You can apply the patch by first qselecting this guard:

% hg qselect func_access

number of unguarded, unapplied patches has changed from 19 to 20

% hg qpush -a

(working directory not at a head)

applying common/flush_response

[...]

applying joel/hack_ruby_func_access

applying personal_patches_delimiter

patch personal_patches_delimiter is empty

now at: personal_patches_delimiter

I've also copied the patch here in case it would be cumbersome to update to our latest gem5-patches:

-----------------------------------------------------------------------------------------------------------------------

diff --git a/src/mem/ruby/system/RubyPort.cc b/src/mem/ruby/system/RubyPort.cc

--- a/src/mem/ruby/system/RubyPort.cc

+++ b/src/mem/ruby/system/RubyPort.cc

@@ -304,7 +304,7 @@

// Unless the requester explicitly said otherwise, generate an error if

// the functional request failed

- if (!accessSucceeded && !pkt->suppressFuncError()) {

+ if (!accessSucceeded && !pkt->suppressFuncError() && !access_phys_mem) {

fatal("Ruby functional %s failed for address %#x\n",

pkt->isWrite() ? "write" : "read", pkt->getAddr());

}

-----------------------------------------------------------------------------------------------------------------------

Hopefully this is complete enough to help you figure out your situation. I would appreciate any feedback on whether this description is adequate or helpful, since this functionality will likely be confusing for others as well.

Joel

--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/

Konstantinos Koukos

unread,

Oct 19, 2014, 7:59:13 AM10/19/14

to gem5-g...@googlegroups.com, koukos.ko...@gmail.com

Thanks a lot Joel for explaining how functional reads/writes work in such detail. This is an excellent tutorial on how
to set protocol access permissions. I am sure it will help many users understand how to set access permissions properly.
I tested the patch and works fine. I already had in my configuration files access_phys_mem = True in all sequencers.
The problem was with settings read_write permissions in blocks that could not be (by protocol definition) exclusive. Now i
understood how to make things right.
Thanks once again,

Best Regards,
Konstantinos.

"Stamatis Kavvadias (Σταμάτης Καββαδίας)"

unread,

Dec 1, 2014, 4:14:48 PM12/1/14

to gem5-g...@googlegroups.com

Hello all,

    I have the following problem. Trying to provide support for overlapped CE/GPU operations and multiple streams, I find that, if the CPU is not blocked, I immediately run into a "functional read failed" (I am running rodinia backprop, which is a very simple benchmark, running only two kernels on the GPU). Tracing the situation reveals that the IOMMU utilizes ...functional reads (*and* writes depending on PTE flags). This happens on the very first access of the CE, which triggers an anticipated page table walk, but seems to find a PTE with not appropriate access permissions (according to your explanation in previous emails on the "functional read failed" issue). This is happening when I remove the CPU blocking implemented in cudaMemcpy system call. I do this both because I will need to utilize cudaMemcpyAsync (which I implemented and which should not block) in order to use specific streams and, also, because CUDA documentation specifies that all host <--> device transfers do not block for transfers less than 64KB.

    Now, my understanding of the situation is that the CPU cache is, probably, accessing the same PTE (maybe just evicting it). But, the real problem is that the MMUs (both of the CPU and of the GPU) may write the PTE and *the OS is unaware of an IOMMU*. So, one question is, how does the OS handle the case when two threads of the same process run on two CPUs and does this mean some particular setup for the CPU MMUs, in case you now. Also, I understand that the current IOMMU functionality does not support the CPU not to block, in the general case. Thus, your assumption is not just that the process issuing an operation to the GPU stays on the CPU, but that it also remains blocked until the operation completes! This is very restricting for what I want to do and I might have to abandon full system gem5-gpu altogether, if this is the case.

    Another thing is that this situation reveals that your expectation, that access_phys_mem=true might be enough to allow all functional accesses, is not correct. That is, because functional accesses *are not atomic*, but go through ports and retries, if there is a data race an intervening functional access may get a transient value (for functional reads), or cause an invalid state or even an error. In other words, because functional accesses are to simulated data (usually OS data), when the simulated devices do not know what the software is doing, intervention can lead to an invalid state. This can be the case for the IOMMU doing functional reads. It may read the physical address of a page frame that the OS is about to swap out (transient value of a functional read), or it may update a PTE while another MMU is also updating it.

    But it seams you have implemented bypassing L1 cases for VI_hammer, which does not make sense, if page table walks utilize functional accesses in the end, and not even the coherent L2 caches (which would not be enough to solve the problem I believe). Can you elaborate on this?

    Any feedback on this issue will be greatly appreciated! IMO the most promising approach is to exploit the case of SMP kernels and multiple CPUs, in which case the MMUs should be triggering interrupts(?) when there might be a race, leaving PTE updates to the OS.

Thanks,

Stamatis

-- 


Stamatis Kavvadias, PhD

Research Associate
TEI Crete, Greece

"Stamatis Kavvadias (Σταμάτης Καββαδίας)"

unread,

Dec 2, 2014, 10:53:09 AM12/2/14

to gem5-g...@googlegroups.com

Hello again,

    I should probably apologize... It is not be the IOMMU doing the functional access. My trace presents initiation of a timing translation from system.gpu.shader_mmu and then there is a bunch of functional accesses, from system.cpu.dtb.walker!

    I think it must be the GPUSyscallHelper used in the implementation of the cudaMemcpy system call (still verifying this). I am still not clear whether this is OK to allow, since GPUSyscallHelper also doing writes, and vtophys() (initiating the functional accesses) may be updating PTEs (e.g., as modified), at the same time that ShaderMMU may be doing the same thing.... I'll check that after I am sure it is GPUSyscallHelper.

    Is there some way in place to get the name of the module initiating a functional access? I mean beyond the MachineID from the request; a name like that printed by DPRINTF.

Anyway, this mail was just to tell you I was wrong. I have to debug it now...

Stamatis

"Stamatis Kavvadias (Σταμάτης Καββαδίας)"

unread,

Dec 3, 2014, 9:16:39 AM12/3/14

to gem5-g...@googlegroups.com

Hello again,

I believe, I have worked around the problem (tested only backprop at the moment). Instead of 'new'
I used 'posix_memalign' to allocate call_params.arg_lengths, call_params.args, and call_params.ret
in CUDA system call invocation inside libcuda/cuda_runtime_api.cc. The different allocator does not
provide addresses in the same page as malloc has used to allocate source data regions in the benckmark.
I could also require the addresses of call_params fields to be page-aligned to be sure... Anyway, it
worked for now.

Regards,

Stamatis

Reply all

Reply to author

Forward