ARM: Managing stage-2 page table updates

Jan Kiszka

unread,

Jul 20, 2016, 2:09:26 AM7/20/16

to Jailhouse, Marc Zyngier, Mark Rutland, Jean-Philippe Brucker, Antonios Motakis

Hi all,

I'd like to have some word by an expert on what we have to do in
Jailhouse when changing the stage-2 page tables. This is what currently
happens for 32-bit ARM (maybe Tony can add a word on potential
deviations for ARM64):

- all affected CPUs, except the one executing the changes, are put in
an busy-wait loop
- we modify the page tables, i.e. we directly write the desired state
on the executing CPU
- we resume the CPUs, and that triggers
- TLBs flushes (TLBIALL) on CPUs affected by the changes
- in some cases also dcache flushes

I've two questions for a while now:

- Are dcache flushes consistently required, or does the MMU snoop
caches when doing stage-2 translations? We could easily add a
DCCIMVAC for the modified entries, but I saw KVM flushing even
page-wise.
- Under which conditions do we need to apply [1] ("break before make")
in our setup? Is it really required to first write an invalid
PTE/PMD/etc., then flush TLBs on all affected CPUs, and only then
write the new entries?

Any insights welcome (including potential references to ARM ARM sections
so that we can document this in the code)!

Thanks,
Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.arm.devel/6244

--
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

Marc Zyngier

unread,

Jul 20, 2016, 5:05:05 AM7/20/16

to Jan Kiszka, Jailhouse, Mark Rutland, Jean-Philippe Brucker, Antonios Motakis

Hey Jan,

On 20/07/16 07:09, Jan Kiszka wrote:
> Hi all,
>
> I'd like to have some word by an expert on what we have to do in

OK, I'll leave... ;-)

> Jailhouse when changing the stage-2 page tables. This is what currently
> happens for 32-bit ARM (maybe Tony can add a word on potential
> deviations for ARM64):
>
> - all affected CPUs, except the one executing the changes, are put in
> an busy-wait loop
> - we modify the page tables, i.e. we directly write the desired state
> on the executing CPU
> - we resume the CPUs, and that triggers
> - TLBs flushes (TLBIALL) on CPUs affected by the changes
> - in some cases also dcache flushes

It would be interesting to find out in which context you're doing that
because the "busy-wait" loop seems a bit odd. What I'd expect would be to:

- take the page-table lock
- clear the entry
- DCache clean for that line (only on ARMv7)
- TBLIALLIS (ARMv8 can do better by only invalidating the S2 VA)
- Set the entry to the new value
- release the lock

There is no need to do a second TLBI, as invalid entries are guaranteed
not to be cached in the TLBs. You do not need to to TBLIALL on all CPUs,
and can rely on a TLBIALLIS that will broadcast it for you.

>
> I've two questions for a while now:
>
> - Are dcache flushes consistently required, or does the MMU snoop
> caches when doing stage-2 translations? We could easily add a
> DCCIMVAC for the modified entries, but I saw KVM flushing even
> page-wise.

Two things:
- ARMv8 guarantees that the page table walker can snoop the caches,
while we don't have that guarantee on ARMv7, so you need to clean the
cache to the PoC (if memory serves well).
- KVM cleans the cache to the PoC when mapping a page if the guest
doesn't have its own caches enabled. Otherwise you will happily run with
junk while the data sits pretty in the cache. (see my presentation at
KF'15).

> - Under which conditions do we need to apply [1] ("break before make")
> in our setup? Is it really required to first write an invalid
> PTE/PMD/etc., then flush TLBs on all affected CPUs, and only then
> write the new entries?

BBM is always required if you're changing the output address of a page
table entry. No ifs, no buts. For a very long time, we thought we could
get away without it, but as CPUs get much more aggressive at partial TLB
caching, we've seen all kind of weird issues, and Mark can rant about it
for hours (I'm sure he will once he reads this email).

You don't have to BBM if you're changing the permissions (or going from
an invalid entry to a valid one, obviously).

> Any insights welcome (including potential references to ARM ARM sections
> so that we can document this in the code)!

Now you're asking too much. I've only had one coffee, and the idea of
trying to navigate the ARM ARM that early makes me feel slightly
nauseous. Or maybe that the beer from last night. Or the weather.

I'll blame the weather.

M.
--
Jazz is not dead. It just smells funny...

Jan Kiszka

unread,

Jul 20, 2016, 5:25:05 AM7/20/16

to Marc Zyngier, Jailhouse, Mark Rutland, Jean-Philippe Brucker, Antonios Motakis

On 2016-07-20 11:05, Marc Zyngier wrote:
> Hey Jan,
>
> On 20/07/16 07:09, Jan Kiszka wrote:
>> Hi all,
>>
>> I'd like to have some word by an expert on what we have to do in
>
> OK, I'll leave... ;-)

Don't make me nervous...! ;)

>
>> Jailhouse when changing the stage-2 page tables. This is what currently
>> happens for 32-bit ARM (maybe Tony can add a word on potential
>> deviations for ARM64):
>>
>> - all affected CPUs, except the one executing the changes, are put in
>> an busy-wait loop
>> - we modify the page tables, i.e. we directly write the desired state
>> on the executing CPU
>> - we resume the CPUs, and that triggers
>> - TLBs flushes (TLBIALL) on CPUs affected by the changes
>> - in some cases also dcache flushes
>
> It would be interesting to find out in which context you're doing that
> because the "busy-wait" loop seems a bit odd. What I'd expect would be to:
>
> - take the page-table lock

Consider that busy-wait as a "big-hypervisor lock", in replacement of
fine-grained locking to coordinate potential concurrent modifications.
Jailhouse only reconfigures those mappings at well defined points, and
that with that "big hammer".

> - clear the entry
> - DCache clean for that line (only on ARMv7)
> - TBLIALLIS (ARMv8 can do better by only invalidating the S2 VA)
> - Set the entry to the new value

And here we need not additional DCache flush?

> - release the lock
>
> There is no need to do a second TLBI, as invalid entries are guaranteed
> not to be cached in the TLBs. You do not need to to TBLIALL on all CPUs,
> and can rely on a TLBIALLIS that will broadcast it for you.

Ah, that would simplify things!

>
>>
>> I've two questions for a while now:
>>
>> - Are dcache flushes consistently required, or does the MMU snoop
>> caches when doing stage-2 translations? We could easily add a
>> DCCIMVAC for the modified entries, but I saw KVM flushing even
>> page-wise.
>
> Two things:
> - ARMv8 guarantees that the page table walker can snoop the caches,
> while we don't have that guarantee on ARMv7, so you need to clean the
> cache to the PoC (if memory serves well).

OK.

> - KVM cleans the cache to the PoC when mapping a page if the guest
> doesn't have its own caches enabled. Otherwise you will happily run with
> junk while the data sits pretty in the cache. (see my presentation at
> KF'15).

Caching/cache flushing vs. guest caches disabled is another topic which
caused a lot of headache to me in the past weeks (because it seems
fairly broken in Jailhouse, up to causing hypervisor crashes). As you
already brought it to that point:

We currently flush the DCache completely before starting a new guest
(with new code etc.), and also after resetting a guest CPU into
caches-disabled state. The flushing takes place in the middle of the
vmexit. Once in a while, I got nasty crashes due to inconsistent states
of the hypervisor (states that are shared between the CPUs).

My theory for this: After the flush, the CPU picks up some state
variable again into its local cache, then enters the guest which has
caches off. While we are in the guest, another CPU changes that shared
state, but the change is not propagated to the other CPU cache because
that is off. Now that first CPU leaves the guest, uses its stale cache
content and ruins our day. Following that theory, I moved the flush
right before the vmentry, ie. at a point where no more shared state is
touched. That made the crashes go away, apparently.

However, if that theory is true, I think we would have to flush caches
before *every* vmentry if the guest has caches off (we don't to that
right now). Now, what are the real requirements when dealing with guests
in non-cached mode?

>
>> - Under which conditions do we need to apply [1] ("break before make")
>> in our setup? Is it really required to first write an invalid
>> PTE/PMD/etc., then flush TLBs on all affected CPUs, and only then
>> write the new entries?
>
> BBM is always required if you're changing the output address of a page
> table entry. No ifs, no buts. For a very long time, we thought we could
> get away without it, but as CPUs get much more aggressive at partial TLB
> caching, we've seen all kind of weird issues, and Mark can rant about it
> for hours (I'm sure he will once he reads this email).
>
> You don't have to BBM if you're changing the permissions (or going from
> an invalid entry to a valid one, obviously).
>
>> Any insights welcome (including potential references to ARM ARM sections
>> so that we can document this in the code)!
>
> Now you're asking too much. I've only had one coffee, and the idea of
> trying to navigate the ARM ARM that early makes me feel slightly
> nauseous. Or maybe that the beer from last night. Or the weather.
>
> I'll blame the weather.
>
> M.
>

Hehe, fully understand. I will reference this discussion. ;)

Thanks!
Jan

Mark Rutland

unread,

Jul 20, 2016, 5:47:11 AM7/20/16

to Marc Zyngier, Jan Kiszka, Jailhouse, Jean-Philippe Brucker, Antonios Motakis

Hi Jan,

On Wed, Jul 20, 2016 at 10:05:01AM +0100, Marc Zyngier wrote:
> On 20/07/16 07:09, Jan Kiszka wrote:
> > I've two questions for a while now:
> >
> > - Are dcache flushes consistently required, or does the MMU snoop
> > caches when doing stage-2 translations? We could easily add a
> > DCCIMVAC for the modified entries, but I saw KVM flushing even
> > page-wise.
>
> Two things:
> - ARMv8 guarantees that the page table walker can snoop the caches,
> while we don't have that guarantee on ARMv7, so you need to clean the
> cache to the PoC (if memory serves well).

For ARMv8-A, as Marc said, walks are guaranteed to be coherent, for both
AArch64 and AArch32. For ARMv7-A, walks are guaranteed to be coherent in
the presence of the virtualization extensions.

In general for ARMv7-A, walks are only coherent (i.e. snoop all
data/unified caches) if ID_MMFR3[23:20] ("Coherent Walk") contains
0b0001. See the latest ARM ARM, ARM DDI 0406C.c, section B4.1.92.
In other cases, it is necessary to clean to the PoU.

The multiprocessing extensions mandate this coherency, per ARM DDI
0406C.c, section B3.3.1 and ARM DDI 0406C.c, section B3.10.1. The
virtualization extensions mandate the multiprocessing extensions, per
ARM DDI 0406C.c, section B1.7. Thus, in any ARMv7-A implementation with
virtualization, page table walks must be coherent.

In the appendices there are example sequences with further
clarification. For ARMv8-A See ARM DDI 0487A.j, section K10.5.3. For
ARMv7-A see ARM DDI 0406C.c, section D7.5.3.

> - KVM cleans the cache to the PoC when mapping a page if the guest
> doesn't have its own caches enabled. Otherwise you will happily run with
> junk while the data sits pretty in the cache. (see my presentation at
> KF'15).
>
> > - Under which conditions do we need to apply [1] ("break before make")
> > in our setup? Is it really required to first write an invalid
> > PTE/PMD/etc., then flush TLBs on all affected CPUs, and only then
> > write the new entries?
>
> BBM is always required if you're changing the output address of a page
> table entry. No ifs, no buts. For a very long time, we thought we could
> get away without it, but as CPUs get much more aggressive at partial TLB
> caching, we've seen all kind of weird issues, and Mark can rant about it
> for hours (I'm sure he will once he reads this email).

I'll try to resist ;)

There are a couple of other cases (e.g. memory type and cacheability)
too, so it's best to consult the ARM ARM.

For ARMv8-A see ARM DDI 0487A.j, section D4.7.1. For ARMv7-A see ARM DDI
0406C.c, section B3.10.1. In both cases scroll down to "Using
break-before-make when updating translation table entries".

To the best of my knowledge, an "output address" can be the value of a
page, block/section, or table entry, so changing intermediate levels of
table requires BBM, even if the end-to-end translation would happen to
remain equivalent.

> > so that we can document this in the code)!
>
> Now you're asking too much. I've only had one coffee, and the idea of
> trying to navigate the ARM ARM that early makes me feel slightly
> nauseous. Or maybe that the beer from last night. Or the weather.

Luckily I remember a few key phrases to grep for. ;)

Hopefully the references above are of use!

Thanks,
Mark.

Marc Zyngier

unread,

Jul 20, 2016, 6:05:22 AM7/20/16

to Jan Kiszka, Jailhouse, Mark Rutland, Jean-Philippe Brucker, Antonios Motakis

The real requirement is "don't do it". I mean, really. The architecture
doesn't give us any reasonable way to deal with that, hence the horrible
"trap all the VM regs until M+C are set" code we have on KVM.

So if you do share state between your guest and your hypervisor, you
have to perform cache maintenance when the guest has its caches off.

Marc Zyngier

unread,

Jul 20, 2016, 6:07:45 AM7/20/16

to Mark Rutland, Jan Kiszka, Jailhouse, Jean-Philippe Brucker, Antonios Motakis

Ah, I never realized this (3 dependencies to follow is too much for me,
apparently). I guess I'll go and clean some KVm code now! ;-)

Thanks,

Jan Kiszka

unread,

Jul 20, 2016, 6:19:24 AM7/20/16

to Marc Zyngier, Jailhouse, Mark Rutland, Jean-Philippe Brucker, Antonios Motakis

The crash was not over state shared between guest and host (though we
have that as well - to be addressed as well...), this was about side
effects on state shared between CPUs while in HYP mode and trying to
perform some cache maintenance.

Again my question: What are the requirements regarding cache maintenance
when allowing a guest to run with caches off? Jean-Philippe tried to
address that in [1], but it's not complete or not fully correct or even
both.

Jan

[1]
https://github.com/siemens/jailhouse/commit/add44a7a8431058ec9acb3db328166f8a8c34dcb

Jan Kiszka

unread,

Jul 20, 2016, 6:21:55 AM7/20/16

to Marc Zyngier, Mark Rutland, Jailhouse, Jean-Philippe Brucker, Antonios Motakis

Nice side effect :). Good, then we can change mappings without dcache
flushes as well (minus anything that disabling caches in guest mode
implies). Will focus on BBM now.

Thanks,
Jan

Mark Rutland

unread,

Jul 20, 2016, 7:17:11 AM7/20/16

to Jan Kiszka, Marc Zyngier, Jailhouse, Jean-Philippe Brucker, Antonios Motakis

On Wed, Jul 20, 2016 at 12:19:20PM +0200, Jan Kiszka wrote:
> Again my question: What are the requirements regarding cache maintenance
> when allowing a guest to run with caches off?

a) There must be no cacheable alias for the relevant addresses present
in TLBs or page tables for any CPU, for the executing exception level
or higher. Note that this includes hyp on the CPU the guest is
executing on, even during the execution of the guest.

b) All caches to the PoC must not contain entries for any address the
guest will access. i.e. first you must invalidate, or
clean+invalidate the address range to the PoC. This must be done by
VA, and broadcast so as to affect all relevant caches.

If those are not strictly followed, the usual issues resulting from
mismatched attributes, or from unexepcted data cache hit apply.

I believe that KVM is on dodgy ground due to the kernel linear mapping
violating (a), and hence (b) also.

> Jean-Philippe tried to address that in [1], but it's not complete or
> not fully correct or even both.

It looks like that's using Set/Way ops, so that's not correct in all
cases. That does not guarantee the state of shared levels of cache, nor
of system caches.

It doesn't affect other agents and is also incomplete.

Thanks,
Mark.

> [1]
> https://github.com/siemens/jailhouse/commit/add44a7a8431058ec9acb3db328166f8a8c34dcb

Jan Kiszka

unread,

Jul 20, 2016, 7:31:12 AM7/20/16

to Mark Rutland, Marc Zyngier, Jailhouse, Jean-Philippe Brucker, Antonios Motakis

The reasoning behind set/way (thus attempted full) flush is that the
hypervisor does not know which VAs the root cell (a guest) has used in
order to prepare the setup (code & data) of a non-root cell (a second
guest). However, I think we better resolve that inside the non-root cell
itself, over the target range that it actually touched. Tony added
something to the Linux driver anyway, as ARM64 needed that.

Then - IIUC now - we should only have to deal with flushing the stage-2
page tables prior to entering the guest, but then on a VA-basis.

In addition, I will have to look into our communication page, data
shared between hypervisor and cell. But that is less urgent.

Thanks,
Jan

>
> Thanks,
> Mark.
>
>> [1]
>> https://github.com/siemens/jailhouse/commit/add44a7a8431058ec9acb3db328166f8a8c34dcb

Antonios Motakis

unread,

Aug 1, 2016, 1:24:14 PM8/1/16

to Jan Kiszka, Mark Rutland, Marc Zyngier, Jailhouse, Jean-Philippe Brucker

Hello,

Late to the party (response time may be inversely proportional to expert rank).

Yeah, we added an icache flush when loading to cell memory from the driver side, so instruction caches are already consistent for us. We found that simpler than whatever the ARMv7 port is doing :)

Regarding break before make, since Mark confirmed that this applies also when changing intermediate levels of a mapping; we will need to take this into account also in the code paths in Jailhouse that split/merge hugepages as needed.

Rather than a reordering of the cache flush functions called from the core, maybe the most straightforward way to do it is via a check in each set_pte call?

Jan, if I understood right you are already looking into implementing this on ARMv7?

Best regards
Tony

>
> Then - IIUC now - we should only have to deal with flushing the stage-2
> page tables prior to entering the guest, but then on a VA-basis.
>
> In addition, I will have to look into our communication page, data
> shared between hypervisor and cell. But that is less urgent.
>
> Thanks,
> Jan
>
>>
>> Thanks,
>> Mark.
>>
>>> [1]
>>> https://github.com/siemens/jailhouse/commit/add44a7a8431058ec9acb3db328166f8a8c34dcb
>

--

Antonios Motakis
Virtualization Engineer
Huawei Technologies Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

Jan Kiszka

unread,

Aug 1, 2016, 3:20:24 PM8/1/16

to Antonios Motakis, Mark Rutland, Marc Zyngier, Jailhouse, Jean-Philippe Brucker

Yes, definitely. For ARMv7, I've already a patch here that also flushes
the dcache from within the driver so that we can stop worrying about
this aspect in the hypervisor.

>
> Regarding break before make, since Mark confirmed that this applies also when changing intermediate levels of a mapping; we will need to take this into account also in the code paths in Jailhouse that split/merge hugepages as needed.
>
> Rather than a reordering of the cache flush functions called from the core, maybe the most straightforward way to do it is via a check in each set_pte call?

I started to implement BBM, but then I read the spec again and mapped
this on our model of modifications. To me it makes no sense anymore to
implement this complication, simply because we are not in the situation
that the spec describes: At the time of changes, from the first write to
the final TLB flush, there are no "multiple threads of execution" that
"can use the same translation tables".

Jailhouse pulls all CPUs that will be affected by the stage-2 changes
out of PL1/0 into HYP mode, only starts the modifications when they
arrived there, ensures the flush (which is nicely broadcast to all CPUs)
and only then releases the CPUs again into guest mode.

However, what we need to change according to my current understanding -
at least of ARMv7 - is making all stage-2 changes coherent, thus
enforcing dcache flushes for the affected table entries. That is needed
when we run the guest later on with caches off. IIUC, that also includes
stage-2 translations. That change is trivial, though.

>
> Jan, if I understood right you are already looking into implementing this on ARMv7?

Yes, I'm (slowly) on it. I will share first code soon and would welcome
any reviews, tests and also help/feedback with mapping things on ARM64
properly.

Thanks,
Jan

Reply all

Reply to author

Forward