FENCE.I Implementation

469 views
Skip to first unread message

Bill Huffman

unread,
Jul 13, 2018, 4:23:07 PM7/13/18
to RISC-V ISA Dev
I'm thinking about possible implementations of FENCE.I. In an
implementation with a non-coherent I-Cache and where interrupt latency
matters, it seems that to implement FENCE.I, one either needs to put the
I-Cache valid bits in flops for instant clear or one needs to restart
the instruction after an interrupt. The second choice could become
arbitrarily slow if interrupt density is high. Of course the first
choice is somewhat costly in power and area.

I realize that better cacheops can be defined, but that doesn't relieve
the designer of supporting FENCE.I and it doesn't keep it from being
used in code. Does anyone have suggestions?

Bill


Andrew Waterman

unread,
Jul 13, 2018, 6:11:41 PM7/13/18
to Bill Huffman, RISC-V ISA Dev
The designs I've worked on that have incoherent I$ have gone the route
of storing the valid bits in flops and flash-clearing them.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/17136fa2-b94a-f424-1449-669f8e0c63b6%40cadence.com.

Cesar Eduardo Barros

unread,
Jul 13, 2018, 7:02:01 PM7/13/18
to Bill Huffman, RISC-V ISA Dev
Do not change the valid bit; change the meaning of the valid bit. The
value in each cache entry does not have to change for it to no longer be
considered valid. It's similar to tightening a screw by keeping it fixed
with a screwdriver and rotating the whole world around it.

The way I would do something like this in software, would be to have a
global sequence number. Every time I wanted to invalidate the whole
cache, I would increment the sequence number. Every cache entry would
get a copy of this global sequence number, and would be considered valid
only if its copy of the sequence number still matches the global
sequence number.

Of course, there's the issue of wraparound, especially in hardware where
the size of the sequence number must be limited. The trick to avoid it
is to consider that there's no problem with wraparound as long as the
copy of the sequence number has been overwritten with a newer value by
the time the global sequence number wraps around. So as long as the
sequence number has more possible values than there are cache lines, the
following pseudocode should work:

invalidate_cache:
cache_line[invalidate_row].counter := global_counter
increment(invalidate_row)
increment(global_counter)

is_cache_valid:
cache_line[index].counter == global_counter

(And invalidate_row might just be the lower bits of global_counter, to
save space.)

This overwrites the sequence number of a single cache line on every
step, while guaranteeing that the sequence numbers on the whole cache
have already been incremented by the time it wraps around.

You could make variations on the theme, like this one with a separate
valid bit and global_counter the size of the cache index (normal
operations on the cache only have to set the valid bit, which should be
simpler to implement):

invalidate_cache:
cache_line[global_counter].valid := false
cache_line[global_counter].counter := global_counter
increment(global_counter)

is_cache_valid:
cache_line[index].valid &&
cache_line[index].counter == global_counter

set_cache_valid:
cache_line[index].valid := valid

Proving that this scheme works is left as an exercise for the reader.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Bill Huffman

unread,
Jul 13, 2018, 7:29:03 PM7/13/18
to Cesar Eduardo Barros, RISC-V ISA Dev
Thank you, Cesar. I thought about this as I've done it before for other
structures. But it seems more complex and I was hoping someone might
have something completely different to say, like that I misunderstood
the spec requirements. ;-)

Your suggestion is as fast as could be desired and requires a bit more
memory but not many extra gates. Tag memory probably only increases by
a single digit percent.

An in-between solution would be a flop per index or per 4 or 8 indexes
with a similar valid bit per line in addition. All memory valid bits
are cleared at the time their flop is set.

Bill
Reply all
Reply to author
Forward
0 new messages