[PATCH] aarch64: invalidate instruction cache after loading code into memory

Waldemar Kozaczuk

unread,

Dec 19, 2020, 11:43:03 PM12/19/20

to osv...@googlegroups.com, Waldemar Kozaczuk

This is quite small and simple patch but it has taken me almost 2 months
researching and understanding the problem and finding the right solution. It
has involved reading ARMv8 programmer guide, posting questions to ARM forums
as well as trying to debug the problem mostly in trial-and-error fashion as somewhat
documented by the issue #1100. The special credit goes to Claudio
Fontana who helped me tremendously by explaining and suggesting
many valuable ideas.

As the issue #1100 explains, OSv would occasionally or quite repeatedly
depending on the application, crash due to an unexpected Unknown Reason
class synchronous exception (EC=0). This would never happen in emulated
mode (QEMU with TCG) but quite freqently on real ARM hardware like RPI 4
on QEMU with KVM or Firecracker. Per ARM documentation -
https://developer.arm.com/docs/ddi0595/h/aarch64-system-registers/esr_el1#ISS_exceptionswithanunknownreason
- there are many potential causes of EC=0 exception including "attempted execution
of an instruction bit pattern that has no allocated instruction" which
means trying to execute garbage.

All of those potential causes which I quite meticulously researched,
examined and discussed some with Claudio, did not seem to apply or did
not make much sense in OSv context. Until one of them did when I stumbled
across this article - https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/caches-and-self-modifying-code
- about "self-modifying code". Initially this article seemed to apply to JIT-type of scenarios
but then after eventually seeing this small font annotation:
"A more common (though less obvious) example is that of an operating
system kernel: from the point of view of the processor, some code in the
system is modifying some other code in the system every time a process
is swapped in or out." it kind of started making me think that OSv
dynamic linker is somewhat close.

Then I eventually found this paragraph in ARMv8 programmer's guide
in chapter 11.5 "Cache maintenance":
"It is sometimes necessary for software to clean or invalidate a cache.
This might be required when the contents of external memory have been
changed and it is necessary to remove stale data from the cache. It can
also be required after MMU-related activity such as changing access
permissions, cache policies, or virtual to Physical Address mappings, or
when I and D-caches must be synchronized for dynamically generated code
such as JIT-compilers and dynamic library loaders."

In essence aarch64 architecture (Modified Harvard) defines separate instruction and data caches -
I-cache and D-cache, so it is sometimes necessary to invalidate instruction
cache after loading code into memory. Which is exactly what the article
about self modifying code explains. How does it apply to OSv?
Well, OSv dynamic linker being part of kernel (code A) loads
in memory application code (B), which by itself does not mean OSv
modifies its own kernel code but it dynamically loads another code
and executes it in the same memory space.

Making this long story short, this patch modifies critical part
of OSv memory management code - populate_vma() - which gets called any time
vma portion (page) is filled due to page fault or eagerly. It changes
the populate_vma() by making it invalidate instruction cache if the vma is executable
per its permission - in essence any time any code is loaded into
memory. To achieve it delegates to an obscure built-in -
__clear_cache(). This logic is actually no-op in x86-64 port,
as this architecture has very strong automatic instruction/data cache
consistency and there is no need to do anyting special like for aarch64.

Fixes #1100

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
arch/aarch64/mmu.cc | 19 +++++++++++++++++++
arch/x64/mmu.cc | 3 +++
core/mmu.cc | 4 ++++
include/osv/mmu.hh | 2 ++
4 files changed, 28 insertions(+)

diff --git a/arch/aarch64/mmu.cc b/arch/aarch64/mmu.cc
index dd8ef850..bc89701d 100644
--- a/arch/aarch64/mmu.cc
+++ b/arch/aarch64/mmu.cc
@@ -97,4 +97,23 @@ bool is_page_fault_write_exclusive(unsigned int esr) {
bool fast_sigsegv_check(uintptr_t addr, exception_frame* ef) {
return false;
}
+
+void vma_invalidate_cache(vma *vma, void *v, size_t size) {
+ // As aarch64 architecture defines separate instruction and data caches -
+ // I-cache and D-cache, it is sometimes necessary to invalidate instruction
+ // cache after loading code into memory. For more details of why and when
+ // it is necessary please read this excellent article -
+ // https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/caches-and-self-modifying-code.
+ //
+ // So when OSv dynamic linker, being part of kernel code, loads pages
+ // of executable sections of ELF segments into memory, we need to invalidate
+ // the I-cache area of that memory right before it gets executed.
+ // In essence any time part of vma with executable permission
+ // gets populated this function gets called from mmu.cc:populate_vma().
+ if (vma->perm() & perm_exec) {
+ // For more details about what this built-in does, please read this gcc documentation -
+ // https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
+ __builtin___clear_cache((char*)v, (char*)(v + size));
+ }
+}
}
diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 24da5caa..c923e4c0 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -191,4 +191,7 @@ bool fast_sigsegv_check(uintptr_t addr, exception_frame* ef)

return false;
}
+
+void vma_invalidate_cache(vma *vma, void *v, size_t size) {
+}
}
diff --git a/core/mmu.cc b/core/mmu.cc
index ff3fab47..10dd35e4 100644
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -1206,6 +1206,10 @@ ulong populate_vma(vma *vma, void *v, size_t size, bool write = false)
vma->operate_range(populate_small<Account>(map, vma->perm(), write, vma->map_dirty()), v, size) :
vma->operate_range(populate<Account>(map, vma->perm(), write, vma->map_dirty()), v, size);

+ // On some architectures it might be necessary to invalidate CPU caches
+ // after the vma memory is populated with code
+ vma_invalidate_cache(vma, v, size);
+
return total;
}

diff --git a/include/osv/mmu.hh b/include/osv/mmu.hh
index 1830048c..87b83526 100644
--- a/include/osv/mmu.hh
+++ b/include/osv/mmu.hh
@@ -319,6 +319,8 @@ std::string procfs_maps();

unsigned long all_vmas_size();

+void vma_invalidate_cache(vma *vma, void *v, size_t size);
+
}

#endif /* MMU_HH */
--
2.28.0

Nadav Har'El

unread,

Dec 20, 2020, 11:44:48 AM12/20/20

to Waldemar Kozaczuk, Osv Dev

Hi, very good job finding and solving this problem. I have some comments below, but also a question whether this works correctly on SMP:

I don't think I understand how this is different from x86, which also has an instruction cache and data cache...

Maybe the point is (and I'm just guessing, I haven't looked into these details a very long time...) that in x86

these caches work on physical addresses, while in aarch64 they work on virtual addresses? Is this true?

Please look inline below for more comments.

I see a problem here, but maybe you can tell me why I'm wrong.

Presumably, this "cache invalidation" thing needs to happen on ALL cpus, not just this one. Because one CPU might have loaded the code, and a different CPU ran it.

We have exactly the same issue with the TLB, so we have an elaborate mechanism to ensure a TLB flush on all CPUs - and to do it "lazily" (see commit 7e38453 and my latest fix to this area, c9e5af6deef6ef9420bdf6437f8c34371c8df4e9 which might explain some things). Don't we need similar mechanisms also for these instruction cache flushes to run them on all processors?

+
return total;
}

diff --git a/include/osv/mmu.hh b/include/osv/mmu.hh
index 1830048c..87b83526 100644
--- a/include/osv/mmu.hh
+++ b/include/osv/mmu.hh
@@ -319,6 +319,8 @@ std::string procfs_maps();

unsigned long all_vmas_size();

+void vma_invalidate_cache(vma *vma, void *v, size_t size);
+

Maybe you can add a comment about what "cache" this is supposed to invalidate. Maybe you should explain

(see above - I'm not sure my guess is really correct) that if the architecture caches something (instructions or

data) using virtual addresses, this function should invalidate the old cached information for this area.

}

#endif /* MMU_HH */
--
2.28.0

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20201220044251.3577-1-jwkozaczuk%40gmail.com.

Waldek Kozaczuk

unread,

Dec 20, 2020, 5:12:44 PM12/20/20

to OSv Development

Hi,

I do not think physical vs virtual is the difference. There are ARM instructions that let one invalidate or clean cache lines by specifying

virtual address though.

The real difference, I think (though I am not very familiar with x64), is that x64 architecture guarantees that given data at address V whether found in instruction cache or data cache (we mean here L1/L2 and possibly L3 caches) is always consistent with each other. In other words, it is impossible to have a different value at V in the data cache and instruction cache. The data and instruction caches are strongly consistent. Or at least this is transparent to the programmer. The TLB synchronization and all its side effects are a different story.

Now on ARM, it is possible that the value at V found in the data cache may be different (inconsistent) from what is (if any) in the instruction cache even at the L1 level. And programmers need o deal with that in some cases of self-modifying code or dynamic linker. So in ARM, the data and instruction caches are NOT strongly consistent

This a very good point. But I think we do not need to run it on all cores because the translation table entries specify cache shareability to Inner Sharable domain. In other words, all memory we map we set shared bits of translation entries to true (see https://github.com/cloudius-systems/osv/blob/21440360f41803fc54460ae78b3d928837621e5d/arch/aarch64/arch-mmu.hh#L35-L41 and https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arch-mmu.hh#L160 in make_pte()).

The clear_cache() built uses the following instructions for each cache line:

dc cvau, x2 // Clear data cache by VA in x2 to Point of Unification

ic ivau, x0. // Invalidate instruction cache by VA in x0 to POint of Unification

There is this table in ARM Cortex-A Series Programmer’s Guide for ARMv8-A

Table 14-1 Instructions with broadcast

IC IVAU, Xt | I-cache invalidate by address to Point of Unification | Maybe (b)

DC CVAU, Xt | D-cache clean by address to Point of Unification | Maybe (b)

b) Broadcast determined by shareability of memory region

It is my understanding that both instructions dc cvau and ic ivau are broadcast between cores in the same Inner Sharable domain (same cluster) as long as the memory translations entries are set so.

+
return total;
}

diff --git a/include/osv/mmu.hh b/include/osv/mmu.hh
index 1830048c..87b83526 100644
--- a/include/osv/mmu.hh
+++ b/include/osv/mmu.hh
@@ -319,6 +319,8 @@ std::string procfs_maps();

unsigned long all_vmas_size();

+void vma_invalidate_cache(vma *vma, void *v, size_t size);
+

Maybe you can add a comment about what "cache" this is supposed to invalidate. Maybe you should explain
(see above - I'm not sure my guess is really correct) that if the architecture caches something (instructions or
data) using virtual addresses, this function should invalidate the old cached information for this area.

I will add some comments.

Nadav Har'El

unread,

Dec 21, 2020, 7:03:18 AM12/21/20

to Waldek Kozaczuk, OSv Development

This is interesting. As I said, I'm not an expert (not even close) on these matters, but from a few minutes I read, I think that maybe both issues are relevant:

If you write to some memory containing code, your modification will appear in the dcache, but not in the icache, and execution won't see it.
Additionally, the icache is virual-indexed (it seems the technical term is VIVT, or VIPT, in newer ARM) so if a mapping maps new code into an old virtual address, the icache will not notice it, and a TLB flush is not enough (the icache is virtual-indexed to avoid TLB lookups).

Maybe you're right that #1 is the more likely reason to see problems, but I think #2 is also possible, e.g., if you run one object, unmap it, and then map a different object in the same address.

+ // On some architectures it might be necessary to invalidate CPU caches
+ // after the vma memory is populated with code
+ vma_invalidate_cache(vma, v, size);

I see a problem here, but maybe you can tell me why I'm wrong.
Presumably, this "cache invalidation" thing needs to happen on ALL cpus, not just this one. Because one CPU might have loaded the code, and a different CPU ran it.
We have exactly the same issue with the TLB, so we have an elaborate mechanism to ensure a TLB flush on all CPUs - and to do it "lazily" (see commit 7e38453 and my latest fix to this area, c9e5af6deef6ef9420bdf6437f8c34371c8df4e9 which might explain some things). Don't we need similar mechanisms also for these instruction cache flushes to run them on all processors?
This a very good point. But I think we do not need to run it on all cores because the translation table entries specify cache shareability to Inner Sharable domain. In other words, all memory we map we set shared bits of translation entries to true (see https://github.com/cloudius-systems/osv/blob/21440360f41803fc54460ae78b3d928837621e5d/arch/aarch64/arch-mmu.hh#L35-L41 and https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arch-mmu.hh#L160 in make_pte()).

The clear_cache() built uses the following instructions for each cache line:

dc cvau, x2 // Clear data cache by VA in x2 to Point of Unification

By the way, if the point was to invalidate the *instruction* cache, why do you want to call this instruction which clears the data cache?

Wouldn't it be better not to rely on this obscure gcc builtin, and instead just call the "ic" instruction you mentioned below?

On the other hand, in https://developer.arm.com/documentation/den0024/a/Caches/Cache-maintenance

I see that there's apparently some need to call both, and also, call them in a loop (?), so maybe we shouldn't

reinvent clear_cache() unless we're sure what it does.

ic ivau, x0. // Invalidate instruction cache by VA in x0 to POint of Unification

There is this table in ARM Cortex-A Series Programmer’s Guide for ARMv8-A

Table 14-1 Instructions with broadcast
IC IVAU, Xt | I-cache invalidate by address to Point of Unification | Maybe (b)
DC CVAU, Xt | D-cache clean by address to Point of Unification | Maybe (b)
b) Broadcast determined by shareability of memory region
It is my understanding that both instructions dc cvau and ic ivau are broadcast between cores in the same Inner Sharable domain (same cluster) as long as the memory translations entries are set so.

Unfortunately, I'm not familiar enough (or at all...) with this instruction set to have any idea if this makes sense.

It is very possible you are correct... If you know how to reproduce the bug on a single core, I wonder how difficult it is to reproduce on a multi-core, with one core loading code and the other code running it.

Waldek Kozaczuk

unread,

Dec 22, 2020, 10:44:19 AM12/22/20

to OSv Development

Right.

Additionally, the icache is virual-indexed (it seems the technical term is VIVT, or VIPT, in newer ARM) so if a mapping maps new code into an old virtual address, the icache will not notice it, and a TLB flush is not enough (the icache is virtual-indexed to avoid TLB lookups).

Yes, but calling vma_invalidate_cache() from populate_vma() should take care of this issue as well because the __clean_cache() would clean (flush) D-cache and push new code into memory and invalidate I-cache and thus force fetching new code for those old virtual addresses before it gets executed.

Maybe you're right that #1 is the more likely reason to see problems, but I think #2 is also possible, e.g., if you run one object, unmap it, and then map a different object in the same address.

See above.

+ // On some architectures it might be necessary to invalidate CPU caches
+ // after the vma memory is populated with code
+ vma_invalidate_cache(vma, v, size);

I see a problem here, but maybe you can tell me why I'm wrong.
Presumably, this "cache invalidation" thing needs to happen on ALL cpus, not just this one. Because one CPU might have loaded the code, and a different CPU ran it.
We have exactly the same issue with the TLB, so we have an elaborate mechanism to ensure a TLB flush on all CPUs - and to do it "lazily" (see commit 7e38453 and my latest fix to this area, c9e5af6deef6ef9420bdf6437f8c34371c8df4e9 which might explain some things). Don't we need similar mechanisms also for these instruction cache flushes to run them on all processors?
This a very good point. But I think we do not need to run it on all cores because the translation table entries specify cache shareability to Inner Sharable domain. In other words, all memory we map we set shared bits of translation entries to true (see https://github.com/cloudius-systems/osv/blob/21440360f41803fc54460ae78b3d928837621e5d/arch/aarch64/arch-mmu.hh#L35-L41 and https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arch-mmu.hh#L160 in make_pte()).

The clear_cache() built uses the following instructions for each cache line:

dc cvau, x2 // Clear data cache by VA in x2 to Point of Unification

By the way, if the point was to invalidate the *instruction* cache, why do you want to call this instruction which clears the data cache?
Wouldn't it be better not to rely on this obscure gcc builtin, and instead just call the "ic" instruction you mentioned below?

On the other hand, in https://developer.arm.com/documentation/den0024/a/Caches/Cache-maintenance
I see that there's apparently some need to call both, and also, call them in a loop (?), so maybe we shouldn't
reinvent clear_cache() unless we're sure what it does.

I think both are relevant and necessary as we need to force pushing code (as data) from D-cache into main memory and invalidate I-cache so that it freshly fetches the loaded code from maon memory.

ic ivau, x0. // Invalidate instruction cache by VA in x0 to POint of Unification

There is this table in ARM Cortex-A Series Programmer’s Guide for ARMv8-A

Table 14-1 Instructions with broadcast
IC IVAU, Xt | I-cache invalidate by address to Point of Unification | Maybe (b)
DC CVAU, Xt | D-cache clean by address to Point of Unification | Maybe (b)
b) Broadcast determined by shareability of memory region
It is my understanding that both instructions dc cvau and ic ivau are broadcast between cores in the same Inner Sharable domain (same cluster) as long as the memory translations entries are set so.

Unfortunately, I'm not familiar enough (or at all...) with this instruction set to have any idea if this makes sense.
It is very possible you are correct... If you know how to reproduce the bug on a single core, I wonder how difficult it is to reproduce on a multi-core, with one core loading code and the other code running it.

So here is what I did which is closest to your suggestions:

First I tried to artificially change elf.cc to force mmaping files with mmu::mmap_populate so that all code get loaded eagerly on one cpu. And then I changed the app.cc code to pin the executing thread to other cpu like that:

{

+ auto next_cpu = (sched::cpu::current()->id + 1) % sched::cpus.size();

+ printf("^^^ pinning to cpu: %d\n", next_cpu);

+ sched::thread::pin(sched::cpus[next_cpu]);

__libc_stack_end = __builtin_frame_address(0);

//

// Explicitly initialize the application ELF object which would have been

I could not reproduce the issue at all. BTW even when I removed the code __clear_cache() it would still run fine. It could be that eagerly loading all code may by itself trigger clearing d-cache and invalidating i-cache. Please not that normally most code is loaded lazily by page fault and that is when the issue happens most of the time. I thought that

So I concluded that maybe this a flawed experiment as making the code execute on other cpu (which presumably does not have anything in i-cache for those virtual addresses) will force it to fetch code into i-cache from the main memory. Or maybe the cpu caches interconnect broadcasting mechanism helps here.

So then I changed my experiment a bit. I still left pinning code in place but reverted back to page-fault based populating of code. Reasoning being than one cpu (A) will page fault and load some code into memory and possibly the same code will be executed by other cpu (B).

Again I could not reproduce the issue.

I am not sure how conclusive these experiments are. Do you have other suggestions?

Waldek Kozaczuk

unread,

Dec 22, 2020, 10:49:59 AM12/22/20

to OSv Development

I could not reproduce the issue at all. BTW even when I removed the code __clear_cache() it would still run fine. It could be that eagerly loading all code may by itself trigger clearing d-cache and invalidating i-cache. Please not that normally most code is loaded lazily by page fault and that is when the issue happens most of the time. I thought that ...

Take back the part about not being able to reproduce with mmu::mmap_populate - I could reproduce it but it was much rarer.

Nadav Har'El

unread,

Dec 22, 2020, 3:22:38 PM12/22/20

to Waldek Kozaczuk, OSv Development

On Tue, Dec 22, 2020 at 5:44 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

On Monday, December 21, 2020 at 7:03:18 AM UTC-5 Nadav Har'El wrote:

This is interesting. As I said, I'm not an expert (not even close) on these matters, but from a few minutes I read, I think that maybe both issues are relevant:
If you write to some memory containing code, your modification will appear in the dcache, but not in the icache, and execution won't see it.
Right.
Additionally, the icache is virual-indexed (it seems the technical term is VIVT, or VIPT, in newer ARM) so if a mapping maps new code into an old virtual address, the icache will not notice it, and a TLB flush is not enough (the icache is virtual-indexed to avoid TLB lookups).
Yes, but calling vma_invalidate_cache() from populate_vma() should take care of this issue as well because the __clean_cache() would clean (flush) D-cache and push new code into memory and invalidate I-cache and thus force fetching new code for those old virtual addresses before it gets executed.

Oh, right. I forgot that flushing the dcache is not just about *clearing* it, it's also about *writing* the dirty cache lines back into memory!

Now I understand better.

So I guess this __clean_cache() function does everything as those docs say you should be doing, and is exactly the right function to use.

It's surprising that there is a gcc builtin that does such a specific ARM-specific thing. In x86 usually when there's something very-x86-specific we need to do, we need to have some inline assembly.

Unfortunately, I'm not familiar enough (or at all...) with this instruction set to have any idea if this makes sense.
It is very possible you are correct... If you know how to reproduce the bug on a single core, I wonder how difficult it is to reproduce on a multi-core, with one core loading code and the other code running it.

https://events.static.linuxfound.org/sites/events/files/slides/slides_10.pdf

gives some sort of presentation of how these caches work, which suggest that perhaps (?) the flushed cache lines get written to L2 cache, but not necessarily higher up, so processors in different NUMA nodes (or whatever they call them in ARM) may still have the wrong things in icache.

So here is what I did which is closest to your suggestions:

Hmm, maybe this bug can't exist at all:

First you need to have two cores, A and B which do not share an L2 cache. You load code on processor A and then ask processor B to run it.

The code can't just be missing in processor B's view (we have memory ordering for that).

Moreover, if B can see the new data coming from main memory, it should (?) invalidate the icache itself. No?

So maybe we don't need to worry about remote-cpu icache flushes at all, and your code is 100% fine.

I also can't find anyone mentioning such a thing ("icache invalidation on all cpus") on any search I tried to do.

Unfortunately, I can't say I'm very confident in my conclusions here, I really have no experience with ARM :-(

First I tried to artificially change elf.cc to force mmaping files with mmu::mmap_populate so that all code get loaded eagerly on one cpu. And then I changed the app.cc code to pin the executing thread to other cpu like that:

Do you know if you have CPUs which do *not* share a PoU (L2 cache)? If they do, then probably the invalidation works on all of them anyway.

Waldek Kozaczuk

unread,

Dec 27, 2020, 7:21:27 PM12/27/20

to OSv Development

So here is a couple of other links I have found:

https://chromium.googlesource.com/v8/v8/+/refs/heads/roll/src/arm64/cpu-arm64.cc#40 which is similar to clear_cache() builtin. This part is most interesting:

"dc civac, %[dline] \n\t"

"add %[dline], %[dline], %[dsize] \n\t"
"cmp %[dline], %[end] \n\t"
"b.lt 0b \n\t"
// Barrier to make sure the effect of the code above is visible to the rest
// of the world.
// dsb : Data Synchronisation Barrier
// ish : Inner SHareable domain
// The point of unification for an Inner Shareable shareability domain is
// the point by which the instruction and data caches of all the processors
// in that Inner Shareable shareability domain are guaranteed to see the
// same copy of a memory location. See ARM DDI 0406B page B2-12 for more
// information.
"dsb ish \n\t"
// Invalidate every line of the I cache containing the target data.
"1: \n\t"
// ic : instruction cache maintenance
// i : invalidate
// va : by address
// u : to the point of unification
"ic ivau, %[iline] \n\t"
"add %[iline], %[iline], %[isize] \n\t"
"cmp %[iline], %[end] \n\t"
"b.lt 1b \n\t"
// Barrier to make sure the effect of the code above is visible to the rest
// of the world.
"dsb ish \n\t"
// Barrier to ensure any prefetching which happened before this code is
// discarded.
// isb : Instruction Synchronisation Barrier "isb \n\t"

I have also found even better paper focusing exactly on our problem - "ARMv8-A system semantics: instruction fetch in
relaxed architectures" - https://hal.inria.fr/hal-02509910/document.

And here are the best parts:

"The ability to execute code that has previously been written to data memory is fundamental to computing: finegrained self-modifying code is now rare, and (rightly) deprecated, but program loading, dynamic linking, JIT compilation, debugging, and OS configuration all rely on executing code from data writes. However, because these are relatively infrequent operations, hardware designers have been able to optimise by partially separating the instruction and data paths, e.g. with distinct instruction caching, which by default may not be coherent with data accesses. This can introduce programmer-visible behaviour analogous to that of user-mode relaxed-memory concurrency, and require specific additional synchronisation to correctly pick up code modifications. Exactly what these are is not entirely clear in the current ARMv8-A architecture text, just as pre-2018 user-mode concurrency was not."

This seems to explain that the problem we are trying to solve does not only apply to an esoteric self-modifying code case only but in reality, it is more prevalent.

The most interesting sentence below is about IC IVAU which seems to suggest that it invalidates entries in instruction caches of all cores. I think in reality it means all cores in the given sharability domain specified by memory translation entries. OSv (and other OS-es) usea Inner Sharable Domain.

"If software requires coherency between instruction execution and memory, it must manage this coherency using Context 6 B. Simner et al. synchronization events and cache maintenance instructions. The following code sequence can be used to allow a processing element (PE) to execute code that the same PE has written.”

; Coherency example for data and instruction accesses [...]

; Enter this code with containing a new 32-bit instruction,

; to be held in Cacheable space at a location pointed to by Xn.

STR Wt, [Xn] ; Store new instruction

DC CVAU, Xn ; Clean data cache by virtual address (VA) to PoU

DSB ISH ; Ensure visibility of the data cleaned from cache

IC IVAU, Xn ; Invalidate instruction cache by VA to PoU

DSB ISH ; Ensure completion of the invalidations

ISB ; Synchronize the fetched instruction stream

At first sight, this may be entirely mysterious. The remainder of the paper establishes precise semantics for each instruction, explaining why each is required, but as a rough intuition:

1. The DC CVAU,Xn cleans this core’s data cache for address Xn, pushing the new write far enough down the hierarchy for an instruction fetch that misses in the instruction cache to be guaranteed to see the new value. This point is the Point of Unification (PoU) and is usually the point where the instruction and data caches become unified (L2 for most modern devices).

2. The DSB ISH waits for the clean to have happened before letting the later instructions execute (without this, the sequence itself can execute out-of order, and the clean might not have pushed the write down far enough before the instruction cache is updated). The ISH makes this specific to the Inner Shareable Domain: the processor itself, not the system-on-chip. We do not model shareability domains in this paper, so this is equivalent to a DSB SY.

3. The IC IVAU,Xn invalidates any entry for that address in the instruction caches for all cores, forcing any future fetch to miss in the instruction cache, and instead, read the new value from the data memory hierarchy; it also touches some fetch queue machinery.

4. The second DSB ISH ensures the invalidation completes.

5. The final ISB flushes this core’s pipeline, forcing a re-fetch of all program order-later instructions"

I will send a newer patch with better comments explaining all this.

Reply all

Reply to author

Forward