+
return total;
}
diff --git a/include/osv/mmu.hh b/include/osv/mmu.hh
index 1830048c..87b83526 100644
--- a/include/osv/mmu.hh
+++ b/include/osv/mmu.hh
@@ -319,6 +319,8 @@ std::string procfs_maps();
unsigned long all_vmas_size();
+void vma_invalidate_cache(vma *vma, void *v, size_t size);
+
}
#endif /* MMU_HH */
--
2.28.0
--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20201220044251.3577-1-jwkozaczuk%40gmail.com.
dc cvau, x2 // Clear data cache by VA in x2 to Point of Unification
ic ivau, x0. // Invalidate instruction cache by VA in x0 to POint of Unification
IC IVAU, Xt | I-cache invalidate by address to Point of Unification | Maybe (b)
DC CVAU, Xt | D-cache clean by address to Point of Unification | Maybe (b)
b) Broadcast determined by shareability of memory region
+
return total;
}
diff --git a/include/osv/mmu.hh b/include/osv/mmu.hh
index 1830048c..87b83526 100644
--- a/include/osv/mmu.hh
+++ b/include/osv/mmu.hh
@@ -319,6 +319,8 @@ std::string procfs_maps();
unsigned long all_vmas_size();
+void vma_invalidate_cache(vma *vma, void *v, size_t size);
+Maybe you can add a comment about what "cache" this is supposed to invalidate. Maybe you should explain(see above - I'm not sure my guess is really correct) that if the architecture caches something (instructions ordata) using virtual addresses, this function should invalidate the old cached information for this area.
+ // On some architectures it might be necessary to invalidate CPU caches
+ // after the vma memory is populated with code+ vma_invalidate_cache(vma, v, size);I see a problem here, but maybe you can tell me why I'm wrong.Presumably, this "cache invalidation" thing needs to happen on ALL cpus, not just this one. Because one CPU might have loaded the code, and a different CPU ran it.We have exactly the same issue with the TLB, so we have an elaborate mechanism to ensure a TLB flush on all CPUs - and to do it "lazily" (see commit 7e38453 and my latest fix to this area, c9e5af6deef6ef9420bdf6437f8c34371c8df4e9 which might explain some things). Don't we need similar mechanisms also for these instruction cache flushes to run them on all processors?This a very good point. But I think we do not need to run it on all cores because the translation table entries specify cache shareability to Inner Sharable domain. In other words, all memory we map we set shared bits of translation entries to true (see https://github.com/cloudius-systems/osv/blob/21440360f41803fc54460ae78b3d928837621e5d/arch/aarch64/arch-mmu.hh#L35-L41 and https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arch-mmu.hh#L160 in make_pte()).The clear_cache() built uses the following instructions for each cache line:dc cvau, x2 // Clear data cache by VA in x2 to Point of Unification
ic ivau, x0. // Invalidate instruction cache by VA in x0 to POint of Unification
There is this table in ARM Cortex-A Series Programmer’s Guide for ARMv8-ATable 14-1 Instructions with broadcastIC IVAU, Xt | I-cache invalidate by address to Point of Unification | Maybe (b)
DC CVAU, Xt | D-cache clean by address to Point of Unification | Maybe (b)
b) Broadcast determined by shareability of memory region
It is my understanding that both instructions dc cvau and ic ivau are broadcast between cores in the same Inner Sharable domain (same cluster) as long as the memory translations entries are set so.
- Additionally, the icache is virual-indexed (it seems the technical term is VIVT, or VIPT, in newer ARM) so if a mapping maps new code into an old virtual address, the icache will not notice it, and a TLB flush is not enough (the icache is virtual-indexed to avoid TLB lookups).
Maybe you're right that #1 is the more likely reason to see problems, but I think #2 is also possible, e.g., if you run one object, unmap it, and then map a different object in the same address.
+ // On some architectures it might be necessary to invalidate CPU caches
+ // after the vma memory is populated with code+ vma_invalidate_cache(vma, v, size);I see a problem here, but maybe you can tell me why I'm wrong.Presumably, this "cache invalidation" thing needs to happen on ALL cpus, not just this one. Because one CPU might have loaded the code, and a different CPU ran it.We have exactly the same issue with the TLB, so we have an elaborate mechanism to ensure a TLB flush on all CPUs - and to do it "lazily" (see commit 7e38453 and my latest fix to this area, c9e5af6deef6ef9420bdf6437f8c34371c8df4e9 which might explain some things). Don't we need similar mechanisms also for these instruction cache flushes to run them on all processors?This a very good point. But I think we do not need to run it on all cores because the translation table entries specify cache shareability to Inner Sharable domain. In other words, all memory we map we set shared bits of translation entries to true (see https://github.com/cloudius-systems/osv/blob/21440360f41803fc54460ae78b3d928837621e5d/arch/aarch64/arch-mmu.hh#L35-L41 and https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arch-mmu.hh#L160 in make_pte()).The clear_cache() built uses the following instructions for each cache line:dc cvau, x2 // Clear data cache by VA in x2 to Point of Unification
By the way, if the point was to invalidate the *instruction* cache, why do you want to call this instruction which clears the data cache?Wouldn't it be better not to rely on this obscure gcc builtin, and instead just call the "ic" instruction you mentioned below?On the other hand, in https://developer.arm.com/documentation/den0024/a/Caches/Cache-maintenanceI see that there's apparently some need to call both, and also, call them in a loop (?), so maybe we shouldn'treinvent clear_cache() unless we're sure what it does.
ic ivau, x0. // Invalidate instruction cache by VA in x0 to POint of Unification
There is this table in ARM Cortex-A Series Programmer’s Guide for ARMv8-ATable 14-1 Instructions with broadcastIC IVAU, Xt | I-cache invalidate by address to Point of Unification | Maybe (b)
DC CVAU, Xt | D-cache clean by address to Point of Unification | Maybe (b)
b) Broadcast determined by shareability of memory region
It is my understanding that both instructions dc cvau and ic ivau are broadcast between cores in the same Inner Sharable domain (same cluster) as long as the memory translations entries are set so.Unfortunately, I'm not familiar enough (or at all...) with this instruction set to have any idea if this makes sense.It is very possible you are correct... If you know how to reproduce the bug on a single core, I wonder how difficult it is to reproduce on a multi-core, with one core loading code and the other code running it.
I could not reproduce the issue at all. BTW even when I removed the code __clear_cache() it would still run fine. It could be that eagerly loading all code may by itself trigger clearing d-cache and invalidating i-cache. Please not that normally most code is loaded lazily by page fault and that is when the issue happens most of the time. I thought that ...
On Monday, December 21, 2020 at 7:03:18 AM UTC-5 Nadav Har'El wrote:This is interesting. As I said, I'm not an expert (not even close) on these matters, but from a few minutes I read, I think that maybe both issues are relevant:
- If you write to some memory containing code, your modification will appear in the dcache, but not in the icache, and execution won't see it.
Right.
- Additionally, the icache is virual-indexed (it seems the technical term is VIVT, or VIPT, in newer ARM) so if a mapping maps new code into an old virtual address, the icache will not notice it, and a TLB flush is not enough (the icache is virtual-indexed to avoid TLB lookups).
Yes, but calling vma_invalidate_cache() from populate_vma() should take care of this issue as well because the __clean_cache() would clean (flush) D-cache and push new code into memory and invalidate I-cache and thus force fetching new code for those old virtual addresses before it gets executed.
Unfortunately, I'm not familiar enough (or at all...) with this instruction set to have any idea if this makes sense.It is very possible you are correct... If you know how to reproduce the bug on a single core, I wonder how difficult it is to reproduce on a multi-core, with one core loading code and the other code running it.
So here is what I did which is closest to your suggestions:
First I tried to artificially change elf.cc to force mmaping files with mmu::mmap_populate so that all code get loaded eagerly on one cpu. And then I changed the app.cc code to pin the executing thread to other cpu like that: