Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

AMD Cache speed funny

328 views
Skip to first unread message

Vir Campestris

unread,
Jan 30, 2024, 11:36:21 AMJan 30
to
I've knocked up a little utility program to try to work out some
performance figures for my CPU.

It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache

What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.

A C++ fragment is this. I can post the whole thing if it would help.

// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;

Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time

if (mask == 0) break; // Stop if we've run out of mask

mask >>= 1; // shrink the mask
}

As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.

All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.

But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.

Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.

A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.

What am I missing?

Thanks
Andy

Anton Ertl

unread,
Jan 30, 2024, 12:37:06 PMJan 30
to
Vir Campestris <vir.cam...@invalid.invalid> writes:
> for (size_t index = 0; index < storeWordCount; ++index)
> {
> // read and write a word in store.
> Raw[index & mask] ^= index;
> }
...
>When the mask is very small (3) it slows to 18GB/s. With 1 it halves
>again, and with zero (so it only operates on the same word over and
>over) it's half again. A fifth of the size with a large block.
>
>Something odd is happening here when I hammer the same location (32
>bytes and on down) so that it's slower. Yet this ought to be in the L1
>data cache.
>
>A late thought was to replace that ^= index with something that reads
>the memory only, or that writes it only, instead of doing a
>read-modify-write cycle. That gives me much faster performance with
>writes than reads. And neither read only, nor write only, show this odd
>slow down with small masks.
>
>What am I missing?

When you do

raw[0] ^= index;

in every step you read the result of the pervious iteration, xor it,
and store it again. This means that you have one chain of RMW data
dependences, with one RMW per iteration. On the Zen2 (which your
3400G has), this requires 8 cycles (see column H of
<http://www.complang.tuwien.ac.at/anton/memdep/>). With mask=1, you
get 2 chains, each with one 8-cycle RMW every second iteration, so you
need 4 cycles per iteration (see my column C). With mask=3, you get 4
chains and 2 cycles per iteration. Looking at my results, I would
expect another doubling with mask=7, but maybe your loop is running
into resource limits at that point (mine does 4 RMWs per iteration).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Michael S

unread,
Jan 30, 2024, 12:38:20 PMJan 30
to
On Tue, 30 Jan 2024 16:36:17 +0000
Vir Campestris <vir.cam...@invalid.invalid> wrote:

> I've knocked up a little utility program to try to work out some
> performance figures for my CPU.
>
> It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
> 4MB L3 cache
> 2MB L2 cache
> 384kb L1 cache
>

That's for the whole chip and it includes L1I caches.
For individual core and excluding L1I the numbers are:
4MB L3 cache
512 KB L2 cache
32 KB L1D cache
First, I'd look at generated asm.
If compiler was doing a good job then at mask <= 4095 (32 KB) you should
see slightly less than 1 iteration of the loop per cycle, i.e. assuming
4.2 GHz clock, approximately 30 GB/s.
Since you see less, it's a sign that compiler did less than perfect job.
Try to help it with manual loop unrolling.

As to the problem with lower performance at very small masks, it's
expected. CPU tries to execute loads speculatively out of order under
assumption that they don't alias with preceding stores. So actual loads
runs few loop iterations ahead of the stores. We can't say for sure how
many iterations ahead, but 7 to 10 iterations sounds like a good guess.
When your mask=7 (32 bytes) then aliasing starts to happen. On old
primitive CPUs, like Pentium 4, it causes massive slowdown, because
those early loads has to be replayed after rather significant delay
of about 20 cycles (length of pipeline). Your Zen1+ CPU is much smarter,
it detects that things are no good and stops wild speculations. So, you
don't see huge slowdown. But without speculation every load starts only
after all stores that preceded it in program order were either
committed into L1D cache or their address was checked against the
speculative load address and no aliasing was found. Since you see only
mild slowdown, it seems that the later is done rather effectively and
your CPU is still able to run loads speculatively, but now only 2 or 3
steps ahead, which is not enough to get the same performance as before.

MitchAlsup1

unread,
Jan 30, 2024, 3:15:48 PMJan 30
to
Vir Campestris wrote:

> As you can see it starts with a large mask (in fact for a whole GB) and
> halves it as it goes around.

> All looks fine at first. I get about 8GB per second with a large mask,
> at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
> gets smaller. No apparent effect when it gets under the L1 cache size.

The execution window is apparently able to absorb the latency of L3 miss,
and stream L3->L1 accesses.

Anton answered the question regarding small masks.

Michael S

unread,
Jan 30, 2024, 3:37:12 PMJan 30
to
On Tue, 30 Jan 2024 20:11:42 +0000
mitch...@aol.com (MitchAlsup1) wrote:

> Vir Campestris wrote:
>
> > As you can see it starts with a large mask (in fact for a whole GB)
> > and halves it as it goes around.
>
> > All looks fine at first. I get about 8GB per second with a large
> > mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
> > as the mask gets smaller. No apparent effect when it gets under the
> > L1 cache size.
>
> The execution window is apparently able to absorb the latency of L3
> miss, and stream L3->L1 accesses.
>

That sounds unlikely. L3 latency is too big to be covered by execution
window. Much more likely they have adequate HW prefetch from L3 to L2
and may be (less likely) even to L1D.

Terje Mathisen

unread,
Jan 31, 2024, 1:59:46 AMJan 31
to
Vir Campestris wrote:
> I've knocked up a little utility program to try to work out some
> performance figures for my CPU.
>
> It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
Mitch, Anton and Michael have already answered, I just want to add that
we have one additional potential factor:

Rowhammer protection:

It is possible that the pattern of re-XORing the same or a small number
of locations over and over could trigger a pattern detector which was
designed to mitigate against Rowhammer.

OTOH, this would much more easily be handled with memory range based
coalescing of write operations in the last level cache, right?

I.e. for normal (write combining) memory, it would (afaik) be legal to
delay the actual writes to RAM for a significant time, long enough to
merge multiple memory writes.

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,
Jan 31, 2024, 3:36:22 AMJan 31
to
Terje Mathisen <terje.m...@tmsw.no> writes:
>Rowhammer protection:
>
>It is possible that the pattern of re-XORing the same or a small number=20
>of locations over and over could trigger a pattern detector which was=20
>designed to mitigate against Rowhammer.

I don't think that memory controller designers have actually
implemented Rowhammer protection: I would expect that the processor
manufacturers would have bregged about that if they had. They have
not. And even RAM manufacturers have stopped mentioning anything
about Rowhammer in their specs. It seems that all hardware
manufacturers have decided that Rowhammer is something that will just
disappear from public knowledge (and therefore from what they have to
deal with) if they just ignore it long enough. It appears that they
are right.

They seem to take the same approach wrt Spectre-family attacks. In
that case, however, new variants appear all the time, so maybe the
approach won't work here.

However, in the present case "the same small number of locations" is
not hammered, because a small number of memory locations fits into the
cache in the adjacent access pattern that this test uses, and all
writes will just be to the cache.

>OTOH, this would much more easily be handled with memory range based=20
>coalescing of write operations in the last level cache, right?

We have had write-back caches (at the L2 or L1 level, and certainly at
the LLC level) since the later 486 years.

>I.e. for normal (write combining) memory

Normal memory is write-back. AFAIK write combining is for stuff like
graphics card memory.

>it would (afaik) be legal to=20
>delay the actual writes to RAM for a significant time, long enough to=20
>merge multiple memory writes.

And this is what actually happens, through the magic of write-back
caches.

Michael S

unread,
Jan 31, 2024, 6:13:59 AMJan 31
to
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."

By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.



Scott Lurndal

unread,
Jan 31, 2024, 10:04:54 AMJan 31
to
ARMv8 has a control bit that can be set to allow EL0 access
to the DC system instructions. By default it is a privileged
instruction. It is up to the operating software to enable
it for user-mode code.

Anton Ertl

unread,
Jan 31, 2024, 1:24:02 PMJan 31
to
Michael S <already...@yahoo.com> writes:
>I have very little to add to very good response by Anton.
>That little addition is: the most if not all Rowhammer POC examples rely
>on CLFLUSH. That's what the manual says about it:
>"Executions of the CLFLUSH instruction are ordered with respect to each
>other and with respect to writes, locked read-modify-write
>instructions, fence instructions, and executions of CLFLUSHOPT to the
>same cache line.1 They are not ordered with respect to executions of
>CLFLUSHOPT to different cache lines."
>
>By now, it seems obvious that making CLFLUSH instruction non-privilaged
>and pretty much non-restricted by memory range/page attributes was a
>mistake, but that mistake can't be fixed without breaking things.
>Considering that CLFLUSH exists since very early 2000s, it is
>understandable.
>IIRC, ARMv8 did the same mistake a decade later. It is less
>understandable.

Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).

However, AFAIK this is insufficient for fixing Rowhammer. Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.

The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go. With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.

MitchAlsup

unread,
Jan 31, 2024, 3:16:55 PMJan 31
to
Anton Ertl wrote:

> Michael S <already...@yahoo.com> writes:
>>I have very little to add to very good response by Anton.
>>That little addition is: the most if not all Rowhammer POC examples rely
>>on CLFLUSH. That's what the manual says about it:
>>"Executions of the CLFLUSH instruction are ordered with respect to each
>>other and with respect to writes, locked read-modify-write
>>instructions, fence instructions, and executions of CLFLUSHOPT to the
>>same cache line.1 They are not ordered with respect to executions of
>>CLFLUSHOPT to different cache lines."
>>
>>By now, it seems obvious that making CLFLUSH instruction non-privilaged
>>and pretty much non-restricted by memory range/page attributes was a
>>mistake, but that mistake can't be fixed without breaking things.
>>Considering that CLFLUSH exists since very early 2000s, it is
>>understandable.
>>IIRC, ARMv8 did the same mistake a decade later. It is less
>>understandable.

> Ideally caches are fully transparent microarchitecture, then you don't
> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
> getting DRAM up-to-date for DMA from I/O devices.

I have wondered for a while about why device access is not to coherent
space. If it were so, then no CFLUSH functionality is needed, I/O can
just read/write an address and always get the freshest copy. {{Maybe
not the device itself, but the PCIe Root could translate from device
access space to memory access space (coherent).}}

> An alternative
> would be to let the memory controller remember which lines are
> modified, and if the I/O device asks for that line, get the up-to-date
> data from the cache line using the cache-consistency protocol. This
> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
> concerned, the ordering constraints may still be relevant), so there
> is a way to fix this mistake (if it is one).

> However, AFAIK this is insufficient for fixing Rowhammer.

If L3 (LLC) is not a processor cache but a great big read/write buffer
for DRAM, then Rowhammering is significantly harder to accomplish.

> Caches have
> relatively limited associativity, up to something like 16-way
> set-associativity, so if you write to the same set 17 times, you are
> guaranteed to miss the cache. With 3 levels of cache you may need 49
> accesses (probably less), but I expect that the resulting DRAM
> accesses to a cache line are still not rare enough that Rowhammer
> cannot happen.

Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.

So, the trick is to detect the RowHammering and insert refresh commands.

Michael S

unread,
Jan 31, 2024, 3:49:23 PMJan 31
to
On Wed, 31 Jan 2024 17:17:21 GMT
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Michael S <already...@yahoo.com> writes:
> >I have very little to add to very good response by Anton.
> >That little addition is: the most if not all Rowhammer POC examples
> >rely on CLFLUSH. That's what the manual says about it:
> >"Executions of the CLFLUSH instruction are ordered with respect to
> >each other and with respect to writes, locked read-modify-write
> >instructions, fence instructions, and executions of CLFLUSHOPT to the
> >same cache line.1 They are not ordered with respect to executions of
> >CLFLUSHOPT to different cache lines."
> >
> >By now, it seems obvious that making CLFLUSH instruction
> >non-privilaged and pretty much non-restricted by memory range/page
> >attributes was a mistake, but that mistake can't be fixed without
> >breaking things. Considering that CLFLUSH exists since very early
> >2000s, it is understandable.
> >IIRC, ARMv8 did the same mistake a decade later. It is less
> >understandable.
>
> Ideally caches are fully transparent microarchitecture, then you don't
> need stuff like CLFLUSH. My guess is that CLFLUSH is there for
> getting DRAM up-to-date for DMA from I/O devices. An alternative
> would be to let the memory controller remember which lines are
> modified, and if the I/O device asks for that line, get the up-to-date
> data from the cache line using the cache-consistency protocol.

Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully
I/O-coherent for several years, I find your theory unlikely.

Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.

Of course, this particular use case does not require *non-priviledged*
CLFLUSH, so obviously Intel had different reason.


> This
> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
> concerned, the ordering constraints may still be relevant), so there
> is a way to fix this mistake (if it is one).
>
> However, AFAIK this is insufficient for fixing Rowhammer. Caches have
> relatively limited associativity, up to something like 16-way
> set-associativity, so if you write to the same set 17 times, you are
> guaranteed to miss the cache. With 3 levels of cache you may need 49
> accesses (probably less), but I expect that the resulting DRAM
> accesses to a cache line are still not rare enough that Rowhammer
> cannot happen.
>

Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.

Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.

Today we have yet another variant called RowPress that bypasses TRR
mitigation more reliably than mult-rate RH. I think this one would be
practically impossible without CLFLUSH., esp. when system under attack
carries other DRAM accesses in parallel with attackers code.


> The first paper on Rowhammer already outlined how the memory
> controller could count how often adjacent DRAM rows are accessed and
> thus weaken the row under consideration. This approach needs a little
> adjustment for Double Rowhammer and not immediately neighbouring rows,
> but otherwise seems to me to be the way to go.

IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors
typically run lengthy tests in order to figure it out.


> With autorefresh in
> the DRAM devices these days, the DRAM manufacturers could implement
> this on their own, without needing to coordinate with memory
> controller designers. But apparently they think that the customers
> don't care, so they can save the expense.
>
> - anton


They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a
demonstration of good intentions.



MitchAlsup

unread,
Jan 31, 2024, 6:25:54 PMJan 31
to
There was no assumption that this could result in a side channel or
attack vector at the time of its non-privileged inclusion. Afterwards
there was no reason to make it privileged until 2017 and by then the
ability to do anything about it has vanished.

Me, personally, I see this as a violation of the cache is there to
reduce memory latency principle and thereby improve performance.

>> This
>> would turn CLFLUSH into a noop (at least as far as writing to DRAM is
>> concerned, the ordering constraints may still be relevant), so there
>> is a way to fix this mistake (if it is one).
>>
>> However, AFAIK this is insufficient for fixing Rowhammer. Caches have
>> relatively limited associativity, up to something like 16-way
>> set-associativity, so if you write to the same set 17 times, you are
>> guaranteed to miss the cache. With 3 levels of cache you may need 49
>> accesses (probably less), but I expect that the resulting DRAM
>> accesses to a cache line are still not rare enough that Rowhammer
>> cannot happen.
>>

> Original RH required very high hammering rate that certainly can't be
> achieved by playing with associativity of L3 cache.

> Newer multiside hammering probably can do it in theory, but it would be
> very difficult in practice.

The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.

The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.

> Today we have yet another variant called RowPress that bypasses TRR
> mitigation more reliably than mult-rate RH. I think this one would be
> practically impossible without CLFLUSH., esp. when system under attack
> carries other DRAM accesses in parallel with attackers code.

>> The first paper on Rowhammer already outlined how the memory
>> controller could count how often adjacent DRAM rows are accessed and
>> thus weaken the row under consideration. This approach needs a little
>> adjustment for Double Rowhammer and not immediately neighbouring rows,
>> but otherwise seems to me to be the way to go.

> IMHO, all thise solutions are pure fantasy, because memory controller
> does not even know which rows are physically adjacent.

Different DIMMs and even different DRAMs on the same DIMM may not
share that correspondence. {There is a lot of bit line and a little
word line repair done at the tester.}

a...@littlepinkcloud.invalid

unread,
Feb 1, 2024, 4:39:27 AMFeb 1
to
Michael S <already...@yahoo.com> wrote:
>
> By now, it seems obvious that making CLFLUSH instruction non-privilaged
> and pretty much non-restricted by memory range/page attributes was a
> mistake, but that mistake can't be fixed without breaking things.
> Considering that CLFLUSH exists since very early 2000s, it is
> understandable.
> IIRC, ARMv8 did the same mistake a decade later. It is less
> understandable.

For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make
instruction changes visible. Also, regardless of icache coherence, when
using non-volatile memory you need an efficient way to flush dcache to
the point of peristence. You need that in order to make sure that a
transaction has been written to a log.

With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent. Which is a good idea, but not everyone does it.

Andrew.

Michael S

unread,
Feb 1, 2024, 8:36:53 AMFeb 1
to
For the later, privileged flush instruction sounds sufficient.

For the former, ARMv8 appears to have a special instruction (or you can
call it a special variant of DC instruction) - Clean by virtual address
to point of unification (DC CVAU). This instruction alone would not
make RH attack much easier. The problem is that privilagability of this
instruction controlled by the same bit as privilagability of two much
more dangerous variations of DC (DC CVAC and DC CIVAC).

> Which is a good idea, but not everyone does it.
>
> Andrew.

Neoverse N1 had it. I don't know about the rest of Neoverse series.


EricP

unread,
Feb 1, 2024, 9:05:46 AMFeb 1
to
The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.

Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.



EricP

unread,
Feb 1, 2024, 9:20:36 AMFeb 1
to
CLFLUSH wouldn't be useful for that as it flushes for a virtual address.
It also allows all sorts reorderings that you don't want to think about
during a (possibly emergency) cache sync.

The privileged WBINVD and WBNOINVD instructions are intended for that.
It sounds like they basically halt the core for the duration of the
write back of all modified lines.



Michael S

unread,
Feb 1, 2024, 9:30:32 AMFeb 1
to
According to Wikipedia, CLFLUSH was not introduced with SSE.
It was introduced together with SSE2, but formally is not part of it.
CLFLUSHOPT came much, much, much later and was likely related to Optane
DIMMs aspirations of late 2010s.





Chris M. Thomasson

unread,
Feb 1, 2024, 3:49:04 PMFeb 1
to
Then there are the LFENCE, SFENCE and MFENCE for write back memory.
Non-temporal stores, iirc.

Chris M. Thomasson

unread,
Feb 1, 2024, 3:49:51 PMFeb 1
to
Oops, non-write back memory! IIRC. Sorry.

a...@littlepinkcloud.invalid

unread,
Feb 2, 2024, 5:20:32 AMFeb 2
to
Michael S <already...@yahoo.com> wrote:
> On Thu, 01 Feb 2024 09:39:13 +0000
> a...@littlepinkcloud.invalid wrote:
>
>> Michael S <already...@yahoo.com> wrote:
>> >
>> > By now, it seems obvious that making CLFLUSH instruction
>> > non-privilaged and pretty much non-restricted by memory range/page
>> > attributes was a mistake, but that mistake can't be fixed without
>> > breaking things. Considering that CLFLUSH exists since very early
>> > 2000s, it is understandable.
>> > IIRC, ARMv8 did the same mistake a decade later. It is less
>> > understandable.
>>
>> For Arm, with its non-coherent data and instruction caches, you need
>> some way to flush dcache to the point of unification in order to make
>> instruction changes visible. Also, regardless of icache coherence,
>> when using non-volatile memory you need an efficient way to flush
>> dcache to the point of peristence. You need that in order to make
>> sure that a transaction has been written to a log.
>>
>> With the latter, you could restrict dcache flushes to pages with a
>> particular non-volatile attribute. I don't think there's anything you
>> can do about the former, short of simply making i- and d-cache
>> coherent.
>
> For the later, privileged flush instruction sounds sufficient.

Does it? You're trying for hight throughput, and a full system call
wouldn't help with that. And besides, if userspace can ask kernel to
do something on its behalf, you haven't added any security by making
it privileged.

> For the former, ARMv8 appears to have a special instruction (or you can
> call it a special variant of DC instruction) - Clean by virtual address
> to point of unification (DC CVAU). This instruction alone would not
> make RH attack much easier. The problem is that privilagability of this
> instruction controlled by the same bit as privilagability of two much
> more dangerous variations of DC (DC CVAC and DC CIVAC).

Ah, thanks.

Andrew.

EricP

unread,
Feb 2, 2024, 12:04:06 PMFeb 2
to
MitchAlsup wrote:
> Anton Ertl wrote:
>
>
> Rowhammer happens when you beat on the same cache line multiple times
> {causing a charge sharing problem on the word lines. Every time you cause
> the DRAM to precharge (deActivate) you lose the count on how many times
> you have to bang on the same word line to disrupt the stored cells.
>
> So, the trick is to detect the RowHammering and insert refresh commands.

It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.

Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.

And the threshold when it triggers has been changing as drams become more
dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.

So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.



EricP

unread,
Feb 2, 2024, 12:15:56 PMFeb 2
to
MitchAlsup wrote:
> Michael S wrote:
>
>> Original RH required very high hammering rate that certainly can't be
>> achieved by playing with associativity of L3 cache.
>
>> Newer multiside hammering probably can do it in theory, but it would be
>> very difficult in practice.
>
> The problem here is the fact that DRAMs do not use linear decoders, so
> address X and address X+1 do not necessarily shared paired word lines.
> The word lines could be as far as ½ the block away from each other.
>
> The DRAM decoders are faster and smaller when there is a grey-like-code
> imposed on the logical-address to physical-word-line. This also happens
> in SRAM decoders. Going back and looking at the most used logical to
> physical mapping shows that while X and X+1 can (occasionally) be side
> by side, X, X+1 and X+2 should never be 3 words lines in a row.

A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.

I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.

Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.



Thomas Koenig

unread,
Feb 2, 2024, 1:09:18 PMFeb 2
to
EricP <ThatWould...@thevillage.com> schrieb:

> Then a giant 2 million input OR gate would tell us if any row's
> canary had flipped.

That would look... interesting.

How are large OR gates actually constructed? I would assume that an
eight-input OR gate could look something like

nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

which would reduce the number of inputs by a factor of 2^3, so
seven layers of these OR gates would be needed.

Wiring would be interesting as well...

MitchAlsup

unread,
Feb 2, 2024, 2:20:56 PMFeb 2
to
Thomas Koenig wrote:

> EricP <ThatWould...@thevillage.com> schrieb:

>> Then a giant 2 million input OR gate would tell us if any row's
>> canary had flipped.

> That would look... interesting.

> How are large OR gates actually constructed? I would assume that an
> eight-input OR gate could look something like

> nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

Close, but NANDs come with 4-inputs and NORs come with 3*, so you get
a 3×4 = 12:1 reduction per pair of stages.

2985984->248832->20736->1728->144->12->1

> which would reduce the number of inputs by a factor of 2^3, so
> seven layers of these OR gates would be needed.

6 not 7

> Wiring would be interesting as well...

That is why we have 10 layers of metal--oh wait DRAMs don't have that
much metal.....

(*) NANDs having 4 inputs while NORs only have 3 is a consequence of
P-channel transistors having lower transconductance and higher body
effects, and there are differences between planar transistors and
finFETs here, too.

MitchAlsup

unread,
Feb 2, 2024, 2:36:00 PMFeb 2
to
EricP wrote:

> MitchAlsup wrote:
>> Anton Ertl wrote:
>>
>>
>> Rowhammer happens when you beat on the same cache line multiple times
>> {causing a charge sharing problem on the word lines. Every time you cause
>> the DRAM to precharge (deActivate) you lose the count on how many times
>> you have to bang on the same word line to disrupt the stored cells.
>>
>> So, the trick is to detect the RowHammering and insert refresh commands.

> It's not just the immediately physically adjacent rows -
> I think I read that the effect falls off for up to +-3 rows away.

My understanding is that RowHammer has to access the same row multiple times
to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
A long time ago We found a problem with one manufactures SRAM when the same
row was hit >6,000 times, there was enough charge sharing that the adjacent
dynamic word decoder also fired so we had 2 or 3 word lines active at the
same time. We encountered this when a LD missed the cache and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the device
and back, while the CPU was continuing to read the ICache every cycle.

My limited understanding of RowPress is that you should not keep the Row open
for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My bet is
that this is a leakage issue on the bit line made sensitive by the word line.

> Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.

DRAMs are funny like this. Adjacent bit lines store data differently. Even
bits store 0 as 0 and 1 as 1 while odd cells store 0 as 1 and 1 as 0. They
do this so the sense amplified has a differential to sense, either the even
cell or the odd cell is asserted on the bit line pair and the sense amp then
has a differential to sense. One line goes up a little or down a little while
the other bit line stays where it is.

EricP

unread,
Feb 2, 2024, 5:22:40 PMFeb 2
to
MitchAlsup wrote:
> EricP wrote:
>
>> MitchAlsup wrote:
>>> Anton Ertl wrote:
>>>
>>>
>>> Rowhammer happens when you beat on the same cache line multiple times
>>> {causing a charge sharing problem on the word lines. Every time you
>>> cause
>>> the DRAM to precharge (deActivate) you lose the count on how many times
>>> you have to bang on the same word line to disrupt the stored cells.
>>>
>>> So, the trick is to detect the RowHammering and insert refresh commands.
>
>> It's not just the immediately physically adjacent rows -
>> I think I read that the effect falls off for up to +-3 rows away.
>
> My understanding is that RowHammer has to access the same row multiple
> times
> to disrupt bits in an adjacent row. This sounds like a charge sharing
> problem.

Yes, as I understand it charge migration.
I had a nice document on the root cause of Rowhammer but I can't seem to
find it again. This one is a little heavy on the semiconductor physics:

On dram rowhammer and the physics of insecurity, 2020
https://ieeexplore.ieee.org/iel7/16/9385809/09366976.pdf

"Experimental evidence points to two mechanisms for the RH disturb,
namely cell transistor subthreshold leakage and electron injection
into the p-well of the DRAM array from the hammered cell transistors
and their subsequent capture by the storage node (SN) junctions [13].

Regarding the subthreshold leakage, lower cell transistor threshold
voltages have been shown to correlate with higher susceptibility to RH.
This is consistent with crosstalk between the switching aggressor wordline
and the victim wordlines pulling up the latter sufficiently in the
potential to drain away some of the victim cell’s stored charge [14], [15].

Regarding the injected electrons from the hammered cell transistors,
the blame for these has been placed on two different origins.
The first describes a collapsing inversion layer associated with the
hammered cell transistor where a population of electrons is injected
into the p-well as the transistor’s gate turns off [16]. The second
describes electron injection from charge traps near the silicon/gate
dielectric interface of the cell select transistor [13], [17].
Several studies look into techniques for hampering the migration of
these injected electrons."

> A long time ago We found a problem with one manufactures SRAM when the same
> row was hit >6,000 times, there was enough charge sharing that the
> adjacent dynamic word decoder also fired so we had 2 or 3 word lines
> active at the same time. We encountered this when a LD missed the cache
> and was sent down
> through NorthBridge, SouthBridge, onto another bus, finally out to the
> device
> and back, while the CPU was continuing to read the ICache every cycle.

I think of this as aging: each activation ages the rows up to some distance
by amounts depending on the distance due to charge migration.

Originally it was found by activating rows immediately adjacent to the
victim but then they looked and found it further out to +-4 rows.
This effect appears to be called the Rowhammer "blast radius".

This paper is from 2023 but I'm sure I've seen mention of this effect
before but not called blast radius.

BLASTER: Characterizing the Blast Radius of Rowhammer, 2023
https://www.research-collection.ethz.ch/handle/20.500.11850/617284
https://dramsec.ethz.ch/papers/blaster.pdf

"In particular, we show for the first time that BLASTER significantly
reduces the number of necessary activations to the victim-adjacent
aggressors using other aggressor rows that are up to four rows away
from the victim."

> My limited understanding of RowPress is that you should not keep the Row
> open
> for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My
> bet is
> that this is a leakage issue on the bit line made sensitive by the word
> line.

Yes, from what I read the factors affecting Rowhammer vulnerability are:

1) DRAM chip temperature, 2) aggressor row active time,
and 3) victim DRAM cell’s physical location.

EricP

unread,
Feb 2, 2024, 6:24:44 PMFeb 2
to
EricP wrote:
> MitchAlsup wrote:
>
>> A long time ago We found a problem with one manufactures SRAM when the
>> same
>> row was hit >6,000 times, there was enough charge sharing that the
>> adjacent dynamic word decoder also fired so we had 2 or 3 word lines
>> active at the same time. We encountered this when a LD missed the cache
>> and was sent down
>> through NorthBridge, SouthBridge, onto another bus, finally out to the
>> device
>> and back, while the CPU was continuing to read the ICache every cycle.
>
> I think of this as aging: each activation ages the rows up to some distance
> by amounts depending on the distance due to charge migration.
>
> Originally it was found by activating rows immediately adjacent to the
> victim but then they looked and found it further out to +-4 rows.
> This effect appears to be called the Rowhammer "blast radius".
>
> This paper is from 2023 but I'm sure I've seen mention of this effect
> before but not called blast radius.
>
> BLASTER: Characterizing the Blast Radius of Rowhammer, 2023
> https://www.research-collection.ethz.ch/handle/20.500.11850/617284
> https://dramsec.ethz.ch/papers/blaster.pdf
>
> "In particular, we show for the first time that BLASTER significantly
> reduces the number of necessary activations to the victim-adjacent
> aggressors using other aggressor rows that are up to four rows away
> from the victim."

To elaborate a bit, as I understand it this means that if a dram
has a blast radius of +-3 and we take 7 rows A B C D E F G,
and assuming the aging factor is linear, then any read or refresh
of row D resets its age to 0 but ages C&E by 3, B&F by 2, A&G by 1.
If any row age total hits 15,000 its data dies.

This is why I thought canary bits might work, because they integrate the
sum of all adjacent activates while taking blast distance into account.
As long as the canary _reliably_ dies at age 12,000 and the data at 15,000
then the dram could transparently refresh the aged-out rows.



Anton Ertl

unread,
Feb 3, 2024, 4:25:24 AMFeb 3
to
Michael S <already...@yahoo.com> writes:
>On Wed, 31 Jan 2024 17:17:21 GMT
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> The first paper on Rowhammer already outlined how the memory
>> controller could count how often adjacent DRAM rows are accessed and
>> thus weaken the row under consideration. This approach needs a little
>> adjustment for Double Rowhammer and not immediately neighbouring rows,
>> but otherwise seems to me to be the way to go.
>
>IMHO, all thise solutions are pure fantasy, because memory controller
>does not even know which rows are physically adjacent. POC authors
>typically run lengthy tests in order to figure it out.

Given that the attackers can find out, it is just a lack of
communication between DRAM manufacturers and memory controller
manufacturers that result in that ignorance. Not a valid excuse.

There is a standardization committee (JEDEC) that documents how
various DRAM types are accessed, refreshed etc. They put information
about that (and about RAM overclocking (XMP, Expo)) in the SPD ROMs of
the DIMMs, so they can also put information about line adjacency
there.

>> With autorefresh in
>> the DRAM devices these days, the DRAM manufacturers could implement
>> this on their own, without needing to coordinate with memory
>> controller designers. But apparently they think that the customers
>> don't care, so they can save the expense.
...
>They cared enough to implement the simplest of proposed solutions - TRR.
>Yes, it was quickly found insufficient, but at least there was a
>demonstration of good intentions.

Yes. However, looking at Table III of
<https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
seems to be significant differences between manufacturers A and D on
one hand, and B and C on the other, with exploits taking much longer
for B and C, and failing in some cases.

One may wonder if the DRAM manufacturers could have put their
physicists to the task of identifying the conditions under which bit
flips can occur, and identify the refreshes that are at least
necessary to prevent these conditions from occuring. If they have not
done so, or if they have not implemented the resulting recommendations
(or passed them to the memory controller people), a certain amount of
blame rests on them.

Anyway, never mind the blame, looking into the future, I find it
worrying that I did not find any mention of Rowhammer protection in
the specs of DIMMs when I last looked.

Anton Ertl

unread,
Feb 3, 2024, 5:07:40 AMFeb 3
to
EricP <ThatWould...@thevillage.com> writes:
>MitchAlsup wrote:
>> Michael S wrote:
>>
>>> Original RH required very high hammering rate that certainly can't be
>>> achieved by playing with associativity of L3 cache.
>>
>>> Newer multiside hammering probably can do it in theory, but it would be
>>> very difficult in practice.
>>
>> The problem here is the fact that DRAMs do not use linear decoders, so
>> address X and address X+1 do not necessarily shared paired word lines.
>> The word lines could be as far as ½ the block away from each other.
>>
>> The DRAM decoders are faster and smaller when there is a grey-like-code
>> imposed on the logical-address to physical-word-line. This also happens
>> in SRAM decoders. Going back and looking at the most used logical to
>> physical mapping shows that while X and X+1 can (occasionally) be side
>> by side, X, X+1 and X+2 should never be 3 words lines in a row.
>
>A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
>So having a counter for each row is impractical.

A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).

Alternatively, a kind of cache could be used. Keep counts of N most
recently accessed rows, remove the row on refresh; when accessing a
row that has not been in the cache, evict the entry for the row with
the lowest count C, and set the count of the loaded row to C+1. When
a count (or ensemble of counts) reaches the limit, refresh every row.

This would take much less memory, but require finding the entry with
the lowest count. By dividing the cache into sets, this becomes more
realistic; upon reaching a limit, only the rows in the blast radius of
the lines in a set need to be refreshed.

>I was wondering if each row could have "canary" bit,
>a specially weakened bit that always flips early.
>This would also intrinsically handle the cases of effects
>falling off over the +-3 adjacent rows.
>
>Then a giant 2 million input OR gate would tell us if any row's
>canary had flipped.

Yes, doing it in analog has its charms. However, I see the following
difficulties:

* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?

* To flip a bit in one direction, AFAIK the hammering rows have to
have a specific content. I guess with a blast radius of 4 rows on
each side, you could have 4 columns. Each row has a canary in one
of these columns and the three adjacent bits in this column are
attacker bits that have the value that is useful for effecting a bit
flip in a canary. Probably a more refined variant of this idea
would be necessary is necessary to deal with diagonal influence and
the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
in this thread.

MitchAlsup

unread,
Feb 3, 2024, 12:10:58 PMFeb 3
to
> ....
>>They cared enough to implement the simplest of proposed solutions - TRR.
>>Yes, it was quickly found insufficient, but at least there was a
>>demonstration of good intentions.

> Yes. However, looking at Table III of
> <https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
> seems to be significant differences between manufacturers A and D on
> one hand, and B and C on the other, with exploits taking much longer
> for B and C, and failing in some cases.

> One may wonder if the DRAM manufacturers could have put their
> physicists to the task of identifying the conditions under which bit
> flips can occur, and identify the refreshes that are at least
> necessary to prevent these conditions from occuring. If they have not
> done so, or if they have not implemented the resulting recommendations
> (or passed them to the memory controller people), a certain amount of
> blame rests on them.

> Anyway, never mind the blame, looking into the future, I find it
> worrying that I did not find any mention of Rowhammer protection in
> the specs of DIMMs when I last looked.

My information is that they (DRAM mfgs) looked and said they could not
fix a problem that emanated from the DRAM controller.

> - anton

EricP

unread,
Feb 3, 2024, 12:12:41 PMFeb 3
to
Ah I see from the RowPress paper that it is different from RowHammer.
RowHammer is based on activation counts and RowPress on activation time.
Previously papers had just said that activation time correlated with
bit flips and I guess everyone just assumed it was the same mechanism.
But the RowPress paper shows it affects different bits from RowHammer.
Also RowPress and RowHammer tend to flip in different directions,
RowHammer flips 0 to 1 and RowPress 1 to 0 (taking the true and anti
cell logic states into account). Possibly one is doing electron injection
and the other hole injection.


MitchAlsup

unread,
Feb 3, 2024, 12:16:56 PMFeb 3
to
Anton Ertl wrote:

>
>>Then a giant 2 million input OR gate would tell us if any row's
>>canary had flipped.

> Yes, doing it in analog has its charms. However, I see the following
> difficulties:

> * How do you measure whether a bit has flipped without refreshing it
> and thus resetting the canary?

You know what its value should be and you raise hell when it is not as
expected. This may require 2 canary bits.

Anton Ertl

unread,
Feb 3, 2024, 12:53:19 PMFeb 3
to
mitch...@aol.com (MitchAlsup) writes:
>Anton Ertl wrote:
>
>>
>>>Then a giant 2 million input OR gate would tell us if any row's
>>>canary had flipped.
>
>> Yes, doing it in analog has its charms. However, I see the following
>> difficulties:
>
>> * How do you measure whether a bit has flipped without refreshing it
>> and thus resetting the canary?
>
>You know what its value should be and you raise hell when it is not as
>expected.

So that is about detecting Rowhammer after the fact. Yes, you could
do that when the row is refreshed. The only problem is that by then
the attacker could have extracted the secret(s) with the
Rowhammer-based attack. Better than nothing, but still not a very
attractive approach.

I prefer a solution that detects that a row might suffer a bit flip
after several more accesses, and refreshes the row befor that happens.
And I don't think that this can be implemented with an analog canary
that works like a DRAM cell; but I am not a solid-state physicist,
maybe there is a way.

MitchAlsup

unread,
Feb 3, 2024, 2:11:22 PMFeb 3
to
Anton Ertl wrote:

> mitch...@aol.com (MitchAlsup) writes:
>>Anton Ertl wrote:
>>
>>>
>>>>Then a giant 2 million input OR gate would tell us if any row's
>>>>canary had flipped.
>>
>>> Yes, doing it in analog has its charms. However, I see the following
>>> difficulties:
>>
>>> * How do you measure whether a bit has flipped without refreshing it
>>> and thus resetting the canary?
>>
>>You know what its value should be and you raise hell when it is not as
>>expected.

> So that is about detecting Rowhammer after the fact. Yes, you could
> do that when the row is refreshed. The only problem is that by then
> the attacker could have extracted the secret(s) with the
> Rowhammer-based attack. Better than nothing, but still not a very
> attractive approach.

> I prefer a solution that detects that a row might suffer a bit flip
> after several more accesses, and refreshes the row before that happens.
> And I don't think that this can be implemented with an analog canary
> that works like a DRAM cell; but I am not a solid-state physicist,
> maybe there is a way.

Sooner or later, designers will have to come to the realization that
an external DRAM controller can never guarantee everything every DRAM
actually needs to retain data under all conditions, and the DRAMs
are going to have to change the interface such that requests flow
in and results flow out based on the DRAM internal controller--much
like that of a SATA disk drive.

Let us face it, the DDR-6 interface model is based on the 16K-bit
DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
double data rated, and each step added address bits to RAS and CAS.

I suspect when this happens, the DRAMs will partition the inbound
address into 3 or 4 sections, and use each section independently
Bank-Row-Column or block-bank-row-column.

In addition each building block will be internally self timed, no
external need to refresh the bank-row, and the only non access
command in the arsenal is power-down and power-up.

You can only put so much lipstick on a pig.

> - anton

EricP

unread,
Feb 4, 2024, 12:00:49 AMFeb 4
to
They said that the current threshold for causing flips in an immediate
neighbor is 4800 activations, but with a blast radius of +-4 that
can be in any of the 8 neighbors, so your counter threshold will have
to trigger refresh at 1/8 of that level or every 600 activations.

And as the dram features get smaller that threshold number will go down
and probably the blast radius will go up. So this could have scaling
issues in the future.

> Alternatively, a kind of cache could be used. Keep counts of N most
> recently accessed rows, remove the row on refresh; when accessing a
> row that has not been in the cache, evict the entry for the row with
> the lowest count C, and set the count of the loaded row to C+1. When
> a count (or ensemble of counts) reaches the limit, refresh every row.

That would be a CAM or assoc sram and would have to hold a large
number of entries. This would have to be in the memory controller.

> This would take much less memory, but require finding the entry with
> the lowest count. By dividing the cache into sets, this becomes more
> realistic; upon reaching a limit, only the rows in the blast radius of
> the lines in a set need to be refreshed.
>
>> I was wondering if each row could have "canary" bit,
>> a specially weakened bit that always flips early.
>> This would also intrinsically handle the cases of effects
>> falling off over the +-3 adjacent rows.
>>
>> Then a giant 2 million input OR gate would tell us if any row's
>> canary had flipped.
>
> Yes, doing it in analog has its charms. However, I see the following
> difficulties:
>
> * How do you measure whether a bit has flipped without refreshing it
> and thus resetting the canary?

The canary would have to be a little more complicated than a standard
storage cell because it has to compare the cell to the expected value
and then drive an output transistor to pull down a dynamic bit line
for a wired-OR of all the canaries in a bank.
Hopefully that would isolate the canary from its read bit line changes.

Fitting this into a dram row could be a problem.
This would all have the same height as a normal row to fit horizontally
along a dram row so it didn't bugger up the row spacing.

> * To flip a bit in one direction, AFAIK the hammering rows have to
> have a specific content. I guess with a blast radius of 4 rows on
> each side, you could have 4 columns. Each row has a canary in one
> of these columns and the three adjacent bits in this column are
> attacker bits that have the value that is useful for effecting a bit
> flip in a canary. Probably a more refined variant of this idea
> would be necessary is necessary to deal with diagonal influence and
> the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
> in this thread.
>
> - anton

Each canary might be 3 cells with alternating patterns,
even row numbers are inited to 010 and odd rows to 101,
positioned in vertical columns. Presumably this would put
the maximum and a predictable stress on the center bit.
Since the expected value for each row is hard wired it is
easy to test if it changes.

MitchAlsup1

unread,
Feb 4, 2024, 4:15:45 PMFeb 4
to
Anton Ertl wrote:

> EricP <ThatWould...@thevillage.com> writes:
>>MitchAlsup wrote:
>>> Michael S wrote:
>>>
>>>> Original RH required very high hammering rate that certainly can't be
>>>> achieved by playing with associativity of L3 cache.
>>>
>>>> Newer multiside hammering probably can do it in theory, but it would be
>>>> very difficult in practice.
>>>
>>> The problem here is the fact that DRAMs do not use linear decoders, so
>>> address X and address X+1 do not necessarily shared paired word lines.
>>> The word lines could be as far as ½ the block away from each other.
>>>
>>> The DRAM decoders are faster and smaller when there is a grey-like-code
>>> imposed on the logical-address to physical-word-line. This also happens
>>> in SRAM decoders. Going back and looking at the most used logical to
>>> physical mapping shows that while X and X+1 can (occasionally) be side
>>> by side, X, X+1 and X+2 should never be 3 words lines in a row.
>>
>>A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
>>So having a counter for each row is impractical.

> A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.

You are comparing a 16-bit incrementor and its associated flip-flop
with a single transistor divided by the number of them in a word. My
guess is that you are off by 20× (should be close to 4%)

Anton Ertl

unread,
Feb 5, 2024, 4:34:12 AMFeb 5
to
I was thinking about counting each access only when the cache line is
accessed. Then there needs to be only one incrementor per bank, and
the counter can be stored in DRAM like the payload data.

But thinking about it again, I wonder how counters would be reset.
Maybe, when the counter reaches the limit, all lines in its blast
radius are refereshed, and the counter of the present line is reset to
0.

Another disadvantage would be that we have to make decisions about
possible rowhammering only based on one counter, and have to trigger
refreshes of all lines in the blast radius based on worst-case
scenarios (i.e., assuming that other rows in the blast radius have any
count up to the limit).

Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.

Alternatively, if you want to invest more, one could follow your idea
and have counter SRAM (maybe including counting circuitry) for each
row; each refresh of a line would increment the counters in the blast
radius by an appropriate amount, and when a counter reaches its limit,
it would trigger a refresh of that row.

>My guess is that you are off by 20× (should be close to 4%)

Even 4% is not "impractical".

Anton Ertl

unread,
Feb 5, 2024, 4:47:56 AMFeb 5
to
EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
>> Admittedly, if you just update the counter for a specific row and the
>> refresh all rows in the blast radius when a limit is reached, you
>> may get many more refreshes than the minimum necessary, but given that
>> normal programs usually do not hammer specific row ranges, the
>> additional refreshes may still be relatively few in non-attack
>> situations (and when being attacked, you prefer lower DRAM performance
>> to a successful attack).
>
>They said that the current threshold for causing flips in an immediate
>neighbor is 4800 activations, but with a blast radius of +-4 that
>can be in any of the 8 neighbors, so your counter threshold will have
>to trigger refresh at 1/8 of that level or every 600 activations.

So only 10 bits of counter are necessary, reducing the overhead to
0.125%:-).

>And as the dram features get smaller that threshold number will go down
>and probably the blast radius will go up. So this could have scaling
>issues in the future.

Yes.

>> Alternatively, a kind of cache could be used. Keep counts of N most
>> recently accessed rows, remove the row on refresh; when accessing a
>> row that has not been in the cache, evict the entry for the row with
>> the lowest count C, and set the count of the loaded row to C+1. When
>> a count (or ensemble of counts) reaches the limit, refresh every row.
>
>That would be a CAM or assoc sram and would have to hold a large
>number of entries. This would have to be in the memory controller.

Possibly. Recent DRAMs also support self-refresh (to allow powering
down the connection to the memory controller); this kind of stuff
could also be on the DRAM device, avoiding all the problems that
memory controllers have with knowing the characteristics of the DRAM
device.

>> * How do you measure whether a bit has flipped without refreshing it
>> and thus resetting the canary?
>
>The canary would have to be a little more complicated than a standard
>storage cell because it has to compare the cell to the expected value

Maybe capacitative coupling (as used for flash AFAIK) could be used to
measure the contents of the canary without discharging it. There
still would be tunneling, as in Rowhammer itself, but I guess one
could account for that.

Anton Ertl

unread,
Feb 5, 2024, 4:59:35 AMFeb 5
to
mitch...@aol.com (MitchAlsup) writes:
>Sooner or later, designers will have to come to the realization that
>an external DRAM controller can never guarantee everything every DRAM
>actually needs to retain data under all conditions, and the DRAMs
>are going to have to change the interface such that requests flow
>in and results flow out based on the DRAM internal controller--much
>like that of a SATA disk drive.
>
>Let us face it, the DDR-6 interface model is based on the 16K-bit
>DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
>double data rated, and each step added address bits to RAS and CAS.

I don't know about DDR6, but the DDR5 command interface is
significantly more complex
<https://en.wikipedia.org/wiki/DDR5#Command_encoding> than early
asynchronous DRAM.

>I suspect when this happens, the DRAMs will partition the inbound
>address into 3 or 4 sections, and use each section independently
>Bank-Row-Column or block-bank-row-column.

Looking at the commands from the link above, Activate already
transfers the row in two pieces, and the read and write are also
transferred in two pieces.

>In addition each building block will be internally self timed, no
>external need to refresh the bank-row, and the only non access
>command in the arsenal is power-down and power-up.

Self-refresh is already there, but AFAIK only used when processing is
suspended.

However, there are many commands, many more than in the 16kx1 DRAMs of
old. What would make them go in the direction of simplifying the
interface? The hardest part these days seems to be getting the high
transfer rates to work, the rest of the interface is probably
comparatively easy.

MitchAlsup1

unread,
Feb 5, 2024, 5:30:56 PMFeb 5
to
My DRAM controller (AMD Opteron rev G) used ACTivate commands instead of
refresh commands to refresh rows in DDR2 DRAM. The timings were better.
It just did not come back and ask for data from the RASed row.

> However, there are many commands, many more than in the 16kx1 DRAMs of
> old. What would make them go in the direction of simplifying the
> interface?

Pins that are less expensive.

> The hardest part these days seems to be getting the high
> transfer rates to work, the rest of the interface is probably
> comparatively easy.

This is from DDR4 and onward where one has to control drive strength
and clock edge offsets (with a DLL) to transfer data that fast.

> - anton

EricP

unread,
Feb 6, 2024, 4:41:28 PMFeb 6
to
Anton Ertl wrote:
> mitch...@aol.com (MitchAlsup1) writes:
>> Anton Ertl wrote:
>>
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
>>>> So having a counter for each row is impractical.
>>> A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
>> You are comparing a 16-bit incrementor and its associated flip-flop
>> with a single transistor divided by the number of them in a word.
>
> I was thinking about counting each access only when the cache line is
> accessed. Then there needs to be only one incrementor per bank, and
> the counter can be stored in DRAM like the payload data.

Dram row reads are destructive so a single row activate command
internally has three cycles: read, sense and redrive, restore.

The counter could be stored in the dram cells and the
N-bit incrementer integrated into the bit line sense amp latches,
such that when the activate command does its restore cycle
it writes back the incremented counter.
The incremented counter would also be available in the row buffer.

Since the next precharge can't happen for 40-50 ns we have some
time to decide what to do next.

> But thinking about it again, I wonder how counters would be reset.
> Maybe, when the counter reaches the limit, all lines in its blast
> radius are refereshed, and the counter of the present line is reset to
> 0.

On a row read if the counter hits its threshold limit the restore
cycle writes back a count of 0, otherwise the incremented counter.

The problem is with the +-4 blast radius refreshes. Each of those refreshes
ages its neighbors which we need to track, so we can't reset those counters.
This could cause a write amplification where each refresh repeatedly
triggers 4 more refreshes.

It is possible to use the counter as a state machine.
Something like...
1) For normal, periodic refreshes set count to some initial value.
2) For reads increment count and if carry-out then reset to initial value
and schedule immediate blast refresh of +-4 neighbor rows.
3) For blast row refresh increment count but don't check for overflow.
If there is a count overflow it gets detected on its next row read.

> Another disadvantage would be that we have to make decisions about
> possible rowhammering only based on one counter, and have to trigger
> refreshes of all lines in the blast radius based on worst-case
> scenarios (i.e., assuming that other rows in the blast radius have any
> count up to the limit).

Yes, unless there is a way to infer the total counts for the neighbors.
Bloom filter?
But see below.

> Both disadvantages lead to far more refreshes than necessary to
> prevent Rowhammer, but that approach may still be good enough.

Lets see how bad this is.

The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.

That overhead is only going to grow as dram density increases.



MitchAlsup1

unread,
Feb 10, 2024, 6:20:47 PMFeb 10
to
EricP wrote:

> Anton Ertl wrote:
>> mitch...@aol.com (MitchAlsup1) writes:
>>> Anton Ertl wrote:

>> Both disadvantages lead to far more refreshes than necessary to
>> prevent Rowhammer, but that approach may still be good enough.

Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?

> Lets see how bad this is.

> The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
> That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
> And the whole dram is refreshed every 64 ms reseting all the counters
> so the counts are not cumulative.

I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.

> That overhead is only going to grow as dram density increases.

So are all the attack vectors.

MitchAlsup1

unread,
Feb 10, 2024, 6:25:48 PMFeb 10
to
Anton Ertl wrote:

> EricP <ThatWould...@thevillage.com> writes:
>>Anton Ertl wrote:
>>> A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
>>> Admittedly, if you just update the counter for a specific row and the
>>> refresh all rows in the blast radius when a limit is reached, you
>>> may get many more refreshes than the minimum necessary, but given that
>>> normal programs usually do not hammer specific row ranges, the
>>> additional refreshes may still be relatively few in non-attack
>>> situations (and when being attacked, you prefer lower DRAM performance
>>> to a successful attack).
>>
>>They said that the current threshold for causing flips in an immediate
>>neighbor is 4800 activations, but with a blast radius of +-4 that
>>can be in any of the 8 neighbors, so your counter threshold will have
>>to trigger refresh at 1/8 of that level or every 600 activations.

> So only 10 bits of counter are necessary, reducing the overhead to
> 0.125%:-).

>>And as the dram features get smaller that threshold number will go down
>>and probably the blast radius will go up. So this could have scaling
>>issues in the future.

> Yes.

If the DRAM manufactures placed a Faraday shield over the DRAM arrays
{A gound plane} the blast radius goes from a linear charge sharing issue
to a quadratic charge sharing issue. Such a ground plane is a layer of
metal with a single <never changing> voltage on it. This might change the
blast radius from 8 to 2.

{{We did this kind of things for SRAM so we could run large signal count
busses over the SRAM arrays.}}

> - anton

Anton Ertl

unread,
Feb 11, 2024, 8:34:08 AMFeb 11
to
mitch...@aol.com (MitchAlsup1) writes:
>EricP wrote:
>
>> Anton Ertl wrote:
>>> mitch...@aol.com (MitchAlsup1) writes:
>>>> Anton Ertl wrote:
>
>>> Both disadvantages lead to far more refreshes than necessary to
>>> prevent Rowhammer, but that approach may still be good enough.
>
>Would you rather have a few more refreshes or a few more ECC repairs ?!?
>with the potential for a few ECC repair fails ?!!?

That's not the issue at hand here. The issue at hand here is whether
the relatively cheap mechanism I described has an acceptable number of
additional refreshes during normal operation, or whether a more
expensive (in terms of area) mechanism is needed to fix Rowhammer.

Concerning ECC, many computers do not have ECC memory, and for those
that have it, ECC does not reliably fix Rowhammer; if it did, the fix
would be simple: Use ECC, which is a good idea anyway, even if it
costs 25% more chips in case of DDR5 DIMMs.

EricP

unread,
Feb 11, 2024, 10:46:30 AMFeb 11
to
MitchAlsup1 wrote:
> EricP wrote:
>
>> Anton Ertl wrote:
>>> mitch...@aol.com (MitchAlsup1) writes:
>>>> Anton Ertl wrote:
>
>>> Both disadvantages lead to far more refreshes than necessary to
>>> prevent Rowhammer, but that approach may still be good enough.
>
> Would you rather have a few more refreshes or a few more ECC repairs ?!?
> with the potential for a few ECC repair fails ?!!?

I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.

>> Lets see how bad this is.
>
>> The single line threshold of 4800 and blast radius of 8 = 600 trigger
>> count.
>> That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
>> And the whole dram is refreshed every 64 ms reseting all the counters
>> so the counts are not cumulative.
>
> I think what RowPress tells us that waiting 60± ms and then refreshing
> every row
> is worse for data retention than spreading the refreshes out over the
> 64ms max
> interval rather evenly.

Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by row?
I would expect doing so would introduce big stalls into memory access.

64 ms / 8192 rows per block = 7.8125 us row interval.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
verses 8192*50 ns = 409.6 us memory stall every 64 ms.



MitchAlsup1

unread,
Feb 11, 2024, 3:00:49 PMFeb 11
to
My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
back was active it would allow REF to slip. But on a second timer event
it would interrupt data transfer and induce 2 refreshes to catch up. In
general, this worked well as it almost never happened.

> Lets say 50 ns row refresh time.
> So thats either 50 ns every 7.8 us

A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.

When one changes page boundaries the HoB address bits are essentially
randomized by the TLB:: why not just close the row at that point ?

Michael S

unread,
Feb 12, 2024, 10:14:39 AMFeb 12
to

Michael S

unread,
Feb 12, 2024, 10:28:12 AMFeb 12
to
On Sun, 11 Feb 2024 19:57:34 +0000
mitch...@aol.com (MitchAlsup1) wrote:

DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s

> When one changes page boundaries the HoB address bits are essentially
> randomized by the TLB:: why not just close the row at that point ?
>

Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.

Scott Lurndal

unread,
Feb 12, 2024, 3:27:18 PMFeb 12
to
AArch64 supports translation granules of 4k, 16k and 64k. 4K
and 64K are the most common. While the architecture defines
16k, an implementation is free to not support it and I'm not aware of any
widespread usage.

Michael S

unread,
Feb 12, 2024, 4:12:55 PMFeb 12
to
I think, 16KB is the main page size on Apple. Android is trying the
same, but so far has problems.
Apple+Android == approximately 101% of AArch64 total.

MitchAlsup1

unread,
Feb 12, 2024, 5:45:45 PMFeb 12
to
Bits<19:12> changed. How hard is that to detect ??

> Besides, in aarch64 world 16KB pages are rather common. And in x86
> world "transparent huge pages" are rather common.

Neither of which prevent closing the row to avoid memory retention
issues.

Michael S

unread,
Feb 12, 2024, 6:15:35 PMFeb 12
to
Do you always answer one statement before reading the next statement?

> > Besides, in aarch64 world 16KB pages are rather common. And in x86
> > world "transparent huge pages" are rather common.
>
> Neither of which prevent closing the row to avoid memory retention
> issues.
>

What scenario of attack do you have in mind?
I would think that neither in "classic" multi-side Row Hammer nor in Row
Press attacker has to cross CPU page boundaries. If he (attacker)
happens to know that memory controller likes to close DRAMraws on any
particular address boundary, then he can easily avoid accessing last
cache line before that particular boundary.

BTW, all this attacks (or should I say, all this POCs, because I don't
think that somebody ever caught real RH/RP attack launched by real bad
guy) rather heavily depend on big or huge pages. They are close to
impossible with small pages, even when "small" means 16 KB rather than
4 KB.

MitchAlsup1

unread,
Feb 12, 2024, 7:20:46 PMFeb 12
to
I actually wrote the above after writing the below.

>> > Besides, in aarch64 world 16KB pages are rather common. And in x86
>> > world "transparent huge pages" are rather common.
>>
>> Neither of which prevent closing the row to avoid memory retention
>> issues.
>>

> What scenario of attack do you have in mind?

RowPress depends on keeping the row open too long--clearly evident in the
charts in the document.

> I would think that neither in "classic" multi-side Row Hammer nor in Row
> Press attacker has to cross CPU page boundaries. If he (attacker)
> happens to know that memory controller likes to close DRAMraws on any
> particular address boundary, then he can easily avoid accessing last
> cache line before that particular boundary.

RowHammer depends on closing the row too often.

Performance (single CPU) depends on allowing the open row to service
several pending requests streaming data at CAS access speeds.

There is a balance to be found by preventing RowHammer from opening
nearby rows too often and in preventing RowPress from holding them
open for too long.

I happen to think (without evidence beyond that of the rRowPress document)
that the balance is distributing refreshes evenly across the refresh
interval (as evidenced in the charts in RowPress document. It ends up
that with modern DDR this enables about 4096 bytes to be read/written
to a row before closing it (within a factor of 2-4).

Michael S

unread,
Feb 13, 2024, 10:19:29 AMFeb 13
to
On Tue, 13 Feb 2024 00:19:18 +0000
Clarification for casual observers that didn't bother to read Row Press
paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of attack
significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
For a given interruption rate, longer interruptions reduce effectiveness
of attack, but not dramatically so. For example, for most practically
important interruption rate of 128 KHz (period=7.81 usec) increasing
duration of off interval from absolute minimum allowed by protocol
(~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.


> > I would think that neither in "classic" multi-side Row Hammer nor
> > in Row Press attacker has to cross CPU page boundaries. If he
> > (attacker) happens to know that memory controller likes to close
> > DRAMraws on any particular address boundary, then he can easily
> > avoid accessing last cache line before that particular boundary.
>
> RowHammer depends on closing the row too often.
>

Yes, except that it is unknown whether major RH impact is done by
closing the row or by opening it. The later is more likely. But since
the rate of opening and closing is the same, this finer difference is
not important.

> Performance (single CPU) depends on allowing the open row to service
> several pending requests streaming data at CAS access speeds.
>
> There is a balance to be found by preventing RowHammer from opening
> nearby rows too often and in preventing RowPress from holding them
> open for too long.
>

There is no balance. Opening nearby rows too often helps both variants
of attack.

> I happen to think (without evidence beyond that of the rRowPress
> document) that the balance is distributing refreshes evenly across
> the refresh interval (as evidenced in the charts in RowPress
> document. It ends up that with modern DDR this enables about 4096
> bytes to be read/written to a row before closing it (within a factor
> of 2-4).
>

Huh?
DDR4-3200 channel transfers data at rate approaching 25.6 GB/s. DDR5
will be the same when it reaches it's projected maximum speed of 6400.
25.6 GB/s * 7.81 usec = 200,000 bytes. That's a factor of 49 rather than
2-4.







EricP

unread,
Feb 13, 2024, 11:24:57 AMFeb 13
to
Michael S wrote:
> On Tue, 13 Feb 2024 00:19:18 +0000
> mitch...@aol.com (MitchAlsup1) wrote:
>> RowPress depends on keeping the row open too long--clearly evident in
>> the charts in the document.
>>
>
> Clarification for casual observers that didn't bother to read Row Press
> paper: RowPress attack does not depends on keeping row open
> continuously.
> Short interruptions actually greatly improve effectiveness of attack
> significantly increasing BER for a given duration of attack. After
> all, RowPress *is* a variant of RowHammer.

RowPress documents that keeping the aggressor row open longer lowers
the limit on the adjacent rows before opens (RowHammers) causes bit flips.
Also the paper notes that DRAM manufacturers, eg Micron and Samsung,
already document that keeping a row open longer can cause read-disturbance.
What's new is the paper documents the interaction between row activation
time and the subsequent number of opens (RowHammers) needed to flip a bit.

Also note that different bits are susceptible to RowPress and RowHammer.
See section 4.3

RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

"RowPress breaks memory isolation by keeping a DRAM row open for a long
period of time, which disturbs physically nearby rows enough to cause
bitflips. We show that RowPress amplifies DRAM’s vulnerability to
read-disturb attacks by significantly reducing the number of row
activations needed to induce a bitflip by one to two orders of
magnitude under realistic conditions. In extreme cases, RowPress induces
bitflips in a DRAM row when an adjacent row is activated only once."

"We show that keeping a DRAM row (i.e., aggressor row) open for a long
period of time (i.e., a large aggressor row ON time, tAggON) disturbs
physically nearby DRAM rows. Doing so induces bitflips in the victim row
without requiring (tens of) thousands of activations to the aggressor row."

> For a given interruption rate, longer interruptions reduce effectiveness
> of attack, but not dramatically so. For example, for most practically
> important interruption rate of 128 KHz (period=7.81 usec) increasing
> duration of off interval from absolute minimum allowed by protocol
> (~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.

Reduced by a factor of up to 363. Under figure 1.

"We observe that as tAggON increases, compared to the most effective
RowHammer pattern, the most effective Row-Press pattern reduces ACmin
1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).

Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON increases."

Michael S

unread,
Feb 13, 2024, 12:00:44 PMFeb 13
to
On Tue, 13 Feb 2024 11:24:10 -0500
EricP <ThatWould...@thevillage.com> wrote:

> Michael S wrote:
> > On Tue, 13 Feb 2024 00:19:18 +0000
> > mitch...@aol.com (MitchAlsup1) wrote:
> >> RowPress depends on keeping the row open too long--clearly evident
> >> in the charts in the document.
> >>
> >
> > Clarification for casual observers that didn't bother to read Row
> > Press paper: RowPress attack does not depends on keeping row open
> > continuously.
> > Short interruptions actually greatly improve effectiveness of attack
> > significantly increasing BER for a given duration of attack. After
> > all, RowPress *is* a variant of RowHammer.
>
> RowPress documents that keeping the aggressor row open longer lowers
> the limit on the adjacent rows before opens (RowHammers) causes bit
> flips.

Correct, but irrelevant.

> Also the paper notes that DRAM manufacturers, eg Micron and
> Samsung, already document that keeping a row open longer can cause
> read-disturbance. What's new is the paper documents the interaction
> between row activation time and the subsequent number of opens
> (RowHammers) needed to flip a bit.
>

Correct and relevant, but not to the issue at hand which is criticism
of Mitch's ideas of mitigation.
ACmin by itself is a wrong measure of efficiency of attack.
The right measure is reciprocal of the total duration of attack.
At any given duty cycle reciprocal of the total duration of attack
grows with increased rate of interruptions (a.k.a. hammering rate).
The general trend is the same as for all other RH variants, the only
difference that dependency on hammering rate is somewhat weaker.

Relatively weak influence of duty cycle itself is shown in figure 22.

The practical significance of RowPress is due to two factors.
(1) is the factor is the one you mentioned above - it can flip
different bits from those flippable by other RH variants.
(2) is that it is not affected at all by DDR4 TRR
attempt of mitigation.

The third, less important factor is that RowPress appears quite robust
to differences between major manufacturers.

However, one should not overlook that efficiency of RowPress attacks
when measured by the most important criterion of BER per duration of
attack is many times lower than earlier techniques of double-sided and
multi-sided hammering.


EricP

unread,
Feb 13, 2024, 12:05:57 PMFeb 13
to
Michael S wrote:
> On Tue, 13 Feb 2024 00:19:18 +0000
> mitch...@aol.com (MitchAlsup1) wrote:
>> RowHammer depends on closing the row too often.
>
> Yes, except that it is unknown whether major RH impact is done by
> closing the row or by opening it. The later is more likely. But since
> the rate of opening and closing is the same, this finer difference is
> not important.

A Deeper Look into RowHammers Sensitivities Experimental Analysis
of Real DRAM Chips and Implications on Future Attacks and Defenses, 2021
https://arxiv.org/pdf/2110.10291

That paper pre-dates the RowPress one and notes:

"6.1 Impact of Aggressor Row’s On-Time

Obsv. 8. As the aggressor row stays active longer (i.e., tAggON increases),
more DRAM cells experience RowHammer bit flips and they
experience RowHammer bit flips at lower hammer counts."

Obsv. 9. RowHammer vulnerability consistently worsens as tAggON
increases in DRAM chips from all four manufacturers.

6.2 Impact of Aggressor Row’s Off-Time

Obsv. 10. As the bank stays precharged longer (i.e., tAggOFF increases),
fewer DRAM cells experience RowHammer bit flips and they
experience RowHammer bit flips at higher hammer counts.

Obsv. 11. RowHammer vulnerability consistently reduces as
tAggOFF increases in DRAM chips from all four manufacturers."






Michael S

unread,
Feb 14, 2024, 3:50:40 AMFeb 14
to
novaBBS is not updating since yesterday, so Mitch is not aware of
our latest posts.

EricP

unread,
Feb 14, 2024, 10:53:07 AMFeb 14
to
Michael S wrote:
> On Tue, 13 Feb 2024 11:24:10 -0500
> EricP <ThatWould...@thevillage.com> wrote:
>> Michael S wrote:
>>> On Tue, 13 Feb 2024 00:19:18 +0000
>>> mitch...@aol.com (MitchAlsup1) wrote:
>>>> RowPress depends on keeping the row open too long--clearly evident
>>>> in the charts in the document.
>>>>
>>> Clarification for casual observers that didn't bother to read Row
>>> Press paper: RowPress attack does not depends on keeping row open
>>> continuously.
>>> Short interruptions actually greatly improve effectiveness of attack
>>> significantly increasing BER for a given duration of attack. After
>>> all, RowPress *is* a variant of RowHammer.
>> RowPress documents that keeping the aggressor row open longer lowers
>> the limit on the adjacent rows before opens (RowHammers) causes bit
>> flips.
>
> Correct, but irrelevant.

It was kinda the whole point of the RowPress paper.

>> Also the paper notes that DRAM manufacturers, eg Micron and
>> Samsung, already document that keeping a row open longer can cause
>> read-disturbance. What's new is the paper documents the interaction
>> between row activation time and the subsequent number of opens
>> (RowHammers) needed to flip a bit.
>>
>
> Correct and relevant, but not to the issue at hand which is criticism
> of Mitch's ideas of mitigation.
>
>> Also note that different bits are susceptible to RowPress and
>> RowHammer. See section 4.3
>>
>> RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
>> https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

I just found out that there are two different versions of the RowPress
paper and I was looking at the older one. The updated version is:

RowPress: Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://arxiv.org/pdf/2306.17061.pdf

>>> For a given interruption rate, longer interruptions reduce
>>> effectiveness of attack, but not dramatically so. For example, for
>>> most practically important interruption rate of 128 KHz
>>> (period=7.81 usec) increasing duration of off interval from
>>> absolute minimum allowed by protocol (~50ns) to 2 usec reduces
>>> efficiency of attack only by factor of 2 o 3.
>> Reduced by a factor of up to 363. Under figure 1.
>>
>> "We observe that as tAggON increases, compared to the most effective
>> RowHammer pattern, the most effective Row-Press pattern reduces ACmin
>> 1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
>> refresh interval (7.8 μs),
>> 2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
>> the maximum allowed tAggON, and
>> 3) down to only one activation for an extreme tAggON of 30 ms
>> (highlighted by dashed red boxes).
>>
>> Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
>> increases."
>>
>
> ACmin by itself is a wrong measure of efficiency of attack.

I'm not interested in the efficiency of the attack.
ACmin, the minimum absolute count of opens above which we lose data
is the number I'm interested in.

> The right measure is reciprocal of the total duration of attack.
> At any given duty cycle reciprocal of the total duration of attack
> grows with increased rate of interruptions (a.k.a. hammering rate).
> The general trend is the same as for all other RH variants, the only
> difference that dependency on hammering rate is somewhat weaker.
>
> Relatively weak influence of duty cycle itself is shown in figure 22.

Looking at figure 22 on the arxiv version of the paper,
this is a completely different test. This test was to explain the
discrepancy between the RowPress results and the earlier cited papers.

BER is the fraction of DRAM cells in a DRAM row that experience bitflips.
Its a different measure because RowPress detects when ANY data loss begins,
not the fraction of lost data bits (efficiency) after it kicks in.

Obsv 16 explains it, the BER for the bottom two lines,
which are the ones with a long total tA2A, goes up in all graphs
by between a factor of 10 to about 500, which is the RowPress effect.

To my eye what this test shows is the PRE phase may *heal* some of the
damaging effects that the ACT phase causes, but only to a certain point.
Possibly the PRE phase scavenges the ACT hot injection carriers.

> The practical significance of RowPress is due to two factors.
> (1) is the factor is the one you mentioned above - it can flip
> different bits from those flippable by other RH variants.
> (2) is that it is not affected at all by DDR4 TRR
> attempt of mitigation.

I take away something completely different: there are multiple interacting
error mechanisms at work here. RowHammer and RowPress are likely
completely different physics and fixing one won't fix the other.

It also suggests there may be other similar mechanisms waiting to be found.

> The third, less important factor is that RowPress appears quite robust
> to differences between major manufacturers.
>
> However, one should not overlook that efficiency of RowPress attacks
> when measured by the most important criterion of BER per duration of
> attack is many times lower than earlier techniques of double-sided and
> multi-sided hammering.

For me the BER is irrelevant if it is above 0.0.
I want to know where the errors start which is ACmin.


Michael S

unread,
Feb 14, 2024, 12:46:50 PMFeb 14
to
On Wed, 14 Feb 2024 10:51:47 -0500
You may be interested, but I don't understand why.
For me, the important thing is how much time it take until probability
of the flip become significant.
Suppose, attack (A) hammers at 5 MHz and has ACmin=5e4. Attack (B)
hammers at 0.13 MHz (typical for RP in real-world setup) and has
ACmin=3e3.
Then I'd say that attack (A) is 2.3 times more dangerous.

Back to real world, researchers demonstrated that multi-side
hammering can have ACmin that is significantly lower than our imaginary
attack (A), so the only remaining question is how fast can we hammer
without triggering TRR. My 5MHz number probably hard to achieve for
attacker, but 2-3 MHz sound doable.
Different like coupling in different frequency bands - yes.
But both caused by insufficient isolation.

> It also suggests there may be other similar mechanisms waiting to be
> found.
>
> > The third, less important factor is that RowPress appears quite
> > robust to differences between major manufacturers.
> >
> > However, one should not overlook that efficiency of RowPress attacks
> > when measured by the most important criterion of BER per duration of
> > attack is many times lower than earlier techniques of double-sided
> > and multi-sided hammering.
>
> For me the BER is irrelevant if it is above 0.0.
> I want to know where the errors start which is ACmin.
>

So, call it time to first flip. The principle is the same.
Still, MSRH causes harm faster than RP.



EricP

unread,
Feb 15, 2024, 6:27:46 PMFeb 15
to
Michael S wrote:
> On Wed, 14 Feb 2024 10:51:47 -0500
> EricP <ThatWould...@thevillage.com> wrote:
>> Michael S wrote:
>>> On Tue, 13 Feb 2024 11:24:10 -0500
>>> EricP <ThatWould...@thevillage.com> wrote:
>>>>
>>>> "We observe that as tAggON increases, compared to the most
>>>> effective RowHammer pattern, the most effective Row-Press pattern
>>>> reduces ACmin 1) by 17.6× on average (up to 40.7×) when tAggON is
>>>> as large as the refresh interval (7.8 μs),
>>>> 2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
>>>> the maximum allowed tAggON, and
>>>> 3) down to only one activation for an extreme tAggON of 30 ms
>>>> (highlighted by dashed red boxes).
>>>>
>>>> Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
>>>> increases."
>>>>
>>> ACmin by itself is a wrong measure of efficiency of attack.
>> I'm not interested in the efficiency of the attack.
>> ACmin, the minimum absolute count of opens above which we lose data
>> is the number I'm interested in.
>
> You may be interested, but I don't understand why.
> For me, the important thing is how much time it take until probability
> of the flip become significant.

Because in terms of designing a memory controller
*any* bit flip due to RH or RP is unacceptable.
After RH/RP starts to inject errors the rate it does so
doesn't matter because the memory bank is corrupt.

> Suppose, attack (A) hammers at 5 MHz and has ACmin=5e4. Attack (B)
> hammers at 0.13 MHz (typical for RP in real-world setup) and has
> ACmin=3e3.
> Then I'd say that attack (A) is 2.3 times more dangerous.

That can tell you that dram A is more susceptible to a RH attack than B.

But what matters to a dram controller is whether ACmin opens can be
reached inside the refresh interval of 64 ms. After that minimum is reached,
how fast it corrupts memory in flips/sec is irrelevant since the number
of corrupted bits is more than are correctable by SECDED.

> Back to real world, researchers demonstrated that multi-side
> hammering can have ACmin that is significantly lower than our imaginary
> attack (A), so the only remaining question is how fast can we hammer
> without triggering TRR. My 5MHz number probably hard to achieve for
> attacker, but 2-3 MHz sound doable.

The RowHammer fix Target Row Refresh TRR is triggered when the Maximum
Activate Count MAC is reached within the Maximum Activate Window time tMAW.
It doesn't matter how opens are distributed in time within tMAW.
It looks like tMAW is the chip refresh interval of 64 ms.
When MAC is reached TRR must immediately refresh the two adjacent rows.

The problem with TRR is that the controller (presumably) reads the
MAC and tMAW values from the DRAM configuration registers.
However RowPress shows that holding a row open greatly lowers the MAC
trigger level, bypassing the TRR fix.

Also as Blaster shows, TRR refreshing the two adjacent rows is not enough.
It would need to refresh +-4 rows, and that would also further divide
the MAC trigger limit by 8.
I'm just guessing, based on the reports that different bits are affected
for RH and RP, and that RH flips 0's to 1's while RP flips 1's to 0's.
But I don't think they have had time to look at the details for RP yet.
0 new messages