v0.9 vs v1.1 interrupt latency raise

58 views
Skip to first unread message

Chung-Fan Yang

unread,
Oct 24, 2019, 4:17:06 AM10/24/19
to Jailhouse
Hello,

I observed that the interrupt latency raise from 20us to 50us (measures in an RTOS) after I upgraded from v0.9 to v1.1.

I am working on x86_64, so I am suspecting CPU bug mitigations.
I would like to ask that are there any CPU bugs mitigations in effect?

However, I do find out that in Root Linux, the latency is almost the same, comparing the 2 versions.

Are there any thing I should adjust my RTOS to adapt with?

Does anyone has ideas of the source of such difference?
Comments are welcome.

Yang

Jan Kiszka

unread,
Oct 24, 2019, 6:07:07 AM10/24/19
to Chung-Fan Yang, Jailhouse
On 24.10.19 10:17, Chung-Fan Yang wrote:
> Hello,
>
> I observed that the interrupt latency raise from 20us to 50us (measures
> in an RTOS) after I upgraded from v0.9 to v1.1.

Do you mean upgrading Jailhouse (and only that) from 0.9 to 0.11?

>
> I am working on x86_64, so I am suspecting CPU bug mitigations.
> I would like to ask that are there any CPU bugs mitigations in effect?
>
> However, I do find out that in Root Linux, the latency is almost the
> same, comparing the 2 versions.
>
> Are there any thing I should adjust my RTOS to adapt with?
>
> Does anyone has ideas of the source of such difference?
> Comments are welcome.

We refactored quite a bit between the two releases. Namely, 0.10 brought
per-cpu page tables.

You could first of try to narrow the reason down a bit: Do the exit
statistics look different for both versions? With Intel x86, you should
normally have no exit for an external or timer interrupt injection. From
that perspective, even 20 µs is too high. Try to identify the path that
causes the latency.

You could also try to bisect Jailhouse between the two versions, in
order to identify the causing commit. But that is only plan B I would say.

Jan

>
> Yang
>
> --
> You received this message because you are subscribed to the Google
> Groups "Jailhouse" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jailhouse-de...@googlegroups.com
> <mailto:jailhouse-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jailhouse-dev/a54a651c-13de-4aa1-9c32-475ebddc4e6f%40googlegroups.com
> <https://groups.google.com/d/msgid/jailhouse-dev/a54a651c-13de-4aa1-9c32-475ebddc4e6f%40googlegroups.com?utm_medium=email&utm_source=footer>.

Chung-Fan Yang

unread,
Oct 24, 2019, 9:17:01 PM10/24/19
to Jailhouse


Do you mean upgrading Jailhouse (and only that) from 0.9 to 0.11?


Yes

We refactored quite a bit between the two releases. Namely, 0.10 brought
per-cpu page tables.

Page table might be the cause, but I am not sure.
I did noticed that my my process updating the page table during CTX in RTOS has dramatically slowed down after upgrade.
It was < 10us, but 250 us after upgrade.
I did some optimization by saving the 2-level and 3-level page directory entries and managed to reduce it to 25us.
 

You could first of try to narrow the reason down a bit: Do the exit
statistics look different for both versions?

I don't see a large difference.

With Intel x86, you should
normally have no exit for an external or timer interrupt injection.

I did noticed in both version, I have a large amount of exit due to MSR access.
Doesn't x2APIC EOI cause an exit?
 
From
that perspective, even 20 µs is too high. Try to identify the path that
causes the latency.

You could also try to bisect Jailhouse between the two versions, in
order to identify the causing commit. But that is only plan B I would say.

I will work on this, but bisect Jailhouse can be difficult considering the ABI and config format change from time to time.

 Yang

Chung-Fan Yang

unread,
Oct 25, 2019, 3:04:36 AM10/25/19
to Jailhouse
Alright, I have test the latency from HW IRQ to application response.

I found out that there aren't any additional VM-Exit or IRQ, nor RTOS scheduling and house-keeping.

It feels like the processor is generally slower as everything takes longer to run.

The IRQ epilogue takes ~7.8us and iretq ~2.x us. In addition, the libc and syscall interface also have slow downed a bit.

I do notice after upgrading, even with CAT, my RTOS latency are prone to be influenced by the Linux side applications.
This was not observed during v0.9.1.

It's strange.


Yang.

Henning Schild

unread,
Oct 25, 2019, 9:53:00 AM10/25/19
to Chung-Fan Yang, Jailhouse
Well you only have soo many shared ressources and if it is not
additional exits/interrupts then it is contention on shared ressources.

We are probably talking about caches, TLBs and buses.

You should be able to use i.e. "perf" on Linux to read out hardware
performance counters. And there you might want to look for TLB and
cache misses.

But the bisecting might be the better idea. Jan already mentioned the
"features" that could be responsible. With a bit of educated guessing
you will get away with just a few tries.

Henning

Am Fri, 25 Oct 2019 00:04:36 -0700
schrieb Chung-Fan Yang <sonic...@gmail.com>:
--
Siemens AG
Corporate Technology
CT RDA IOT SES-DE
Otto-Hahn-Ring 6
81739 Muenchen, Germany
Mobile: +49 172 8378927
mailto: henning...@siemens.com

Jan Kiszka

unread,
Oct 25, 2019, 1:09:30 PM10/25/19
to Henning Schild, Chung-Fan Yang, Jailhouse
On 25.10.19 15:52, Henning Schild wrote:
> Well you only have soo many shared ressources and if it is not
> additional exits/interrupts then it is contention on shared ressources.
>
> We are probably talking about caches, TLBs and buses.
>
> You should be able to use i.e. "perf" on Linux to read out hardware
> performance counters. And there you might want to look for TLB and
> cache misses.
>
> But the bisecting might be the better idea. Jan already mentioned the
> "features" that could be responsible. With a bit of educated guessing
> you will get away with just a few tries.

BTW, does your RTOS happen to use anything of the inmate bootstrap code
to start in Jailhouse? That also changed.

Jan

>
> Henning
>
> Am Fri, 25 Oct 2019 00:04:36 -0700
> schrieb Chung-Fan Yang <sonic...@gmail.com>:
>
>> Alright, I have test the latency from HW IRQ to application response.
>>
>> I found out that there aren't any additional VM-Exit or IRQ, nor RTOS
>> scheduling and house-keeping.
>>
>> It feels like the processor is generally slower as everything takes
>> longer to run.
>>
>> The IRQ epilogue takes ~7.8us and iretq ~2.x us. In addition, the
>> libc and syscall interface also have slow downed a bit.
>>
>> I do notice after upgrading, even with CAT, my RTOS latency are prone
>> to be influenced by the Linux side applications.
>> This was not observed during v0.9.1.
>>
>> It's strange.
>>
>>
>> Yang.
>>
>
>
>

--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

michael....@gmail.com

unread,
Oct 26, 2019, 5:26:57 AM10/26/19
to Jailhouse
Hi all,

I'm not sure if this is relevant in this case, but I have noticed that on Intel x86-64, if hardware p-states (HWP) are enabled in the CPU (which they are by default if the CPU supports it), this introduces frequency scaling coupling between cores, even when the cores are isolated in separate cells. So you get this unexpected behavior where an inmate with 1 core will run faster if the other cores are very busy, and run a lot slower if the other cores are all idle. This is because the CPU itself does the frequency scaling automatically in hardware and doesn't know anything about what Jailhouse is doing.

Passing intel_pstate=no_hwp to the Linux kernel command line disables hardware p-states and gets rid of this coupling as far as I can tell. It appears that HWP is a relatively new feature (last couple of years) in Intel CPUs.

-Michael

Chung-Fan Yang

unread,
Oct 29, 2019, 8:42:39 PM10/29/19
to Jailhouse


2019年10月26日土曜日 18時26分57秒 UTC+9 michael...@gmail.com:
Hi all,

I'm not sure if this is relevant in this case, but I have noticed that on Intel x86-64, if hardware p-states (HWP) are enabled in the CPU (which they are by default if the CPU supports it), this introduces frequency scaling coupling between cores, even when the cores are isolated in separate cells. So you get this unexpected behavior where an inmate with 1 core will run faster if the other cores are very busy, and run a lot slower if the other cores are all idle. This is because the CPU itself does the frequency scaling automatically in hardware and doesn't know anything about what Jailhouse is doing.

Passing intel_pstate=no_hwp to the Linux kernel command line disables hardware p-states and gets rid of this coupling as far as I can tell. It appears that HWP is a relatively new feature (last couple of years) in Intel CPUs.

This looks like a possibility.

I did discovered that cores changing C-state in Root Cell can cause RTOS to become slower.
We had some disscussion a couple month before, I think.

Currently, I have disabled all the C-state, but not P-state.
Maybe the driver kicked in after I got a newer kernel?

I will check.
 

-Michael

Chung-Fan Yang

unread,
Oct 29, 2019, 8:45:24 PM10/29/19
to Jailhouse


2019年10月26日土曜日 2時09分30秒 UTC+9 Jan Kiszka:
On 25.10.19 15:52, Henning Schild wrote:
> Well you only have soo many shared ressources and if it is not
> additional exits/interrupts then it is contention on shared ressources.
>
> We are probably talking about caches, TLBs and buses.
>
> You should be able to use i.e. "perf" on Linux to read out hardware
> performance counters. And there you might want to look for TLB and
> cache misses.
>
> But the bisecting might be the better idea. Jan already mentioned the
> "features" that could be responsible. With a bit of educated guessing
> you will get away with just a few tries.

BTW, does your RTOS happen to use anything of the inmate bootstrap code
to start in Jailhouse? That also changed.

Jan

Henning,

I do think there are contentions related to memory, either TLB or Bus(assuming CAT is good).
I will do the bisect this month, got urgent things to do.


Jan,

Yes, I did steal some code.
I will check if they fit.


Yang
 

>
> Henning
>
> Am Fri, 25 Oct 2019 00:04:36 -0700
> schrieb Chung-Fan Yang <soni...@gmail.com>:

Chung-Fan Yang

unread,
Oct 30, 2019, 4:37:13 AM10/30/19
to Jailhouse
 
Alright, I did the bisect, checked the inmate library and PM.

I will summarize first, the fault seems like to be in the wip/ivshmem2.
P-state was already disabled (Silly me, forgot what I have already done.)
Code steal from inmate library seems fine.

Let me describe my setup first.
I am using the wip/ivshmem2, because my application favors unidirectional pipes and multiple interrupts between cells.
I have been using the v0.9.1 with an old version of ivshmem2(15ee5278), which has lstate/rstate, etc.

When I need to test with non-root-Linux, I upgraded to v0.11 for the new mmio decoder.
Along this process, I rebased the wip/ivshmem2 to v0.11 branch, which is the new, multi-peer ivshmem2(5c90e846).
I rewrite the drivers of RTOS, too.

Also, I changed the root cell Linux kernel version from 4.9 to 4.19. (Both wil PREEMPT_RT patch installed)

So I changed:
 * Linux kernel version
 * Jailhouse version
 * ivshmem2 version

Today, I cherry-picked the new multi-peer ivshmem2 onto:
 * v0.11
 * v0.10
 * v0.9.1
and tested with Linux 4.19.
All of them has a ~25.8us latency.

The baseline, Linux 4.9 /w v0.9.1 /w old ivshmem2 is 10.87us.

Then I tested Linux 4.9 /w v0.9.1 /w new multi-peer ivshmem2. It has a latency of 28.62us.

It seems like using the new ivshmem2 mechanism cause the execution to slow down.
I didn't find a specific hot-spot, so certain resource contention should be the cause.

BTW, I has code that "executes" on the ivshmem region, but I don't think this should be a problem, isn't it?

------
Yang

Jan Kiszka

unread,
Oct 30, 2019, 7:41:13 AM10/30/19
to Chung-Fan Yang, Jailhouse
Interesting findings already, but I'm afraid we will need to dig deeper:
Can you describe what all is part of your measured latency path? Do you
just run code in guest mode or do you also trigger VM exits, e.g. to
issue ivshmem interrupts to a remote side? Maybe you can sample some
latencies along the critical path so that we have a better picture about
where we lose time, overall or rather on specific actions.

>
> BTW, I has code that "executes" on the ivshmem region, but I don't think
> this should be a problem, isn't it?

It wasn't designed for it but it should work.

Jan

Chung-Fan Yang

unread,
Oct 31, 2019, 1:57:31 AM10/31/19
to Jailhouse

Interesting findings already, but I'm afraid we will need to dig deeper:
Can you describe what all is part of your measured latency path?

I measured using an oscillate scope and function generator.
I am using MMIO GPIOs. The application calls a system call and waits for an interrupt on a certain GPIO.
When I send a pulse to the GPIO, the IRQ handler release a semaphore, interm trigger the scheduler and wake-up the application, which send another pulse to another GPIO using MMIO.

FG -> Serial -> APIC -> RTOS IRQ Hnd -> Scheduler -> Application -> Serial -> OSC

The timing different of these 2 pulses are measured.

Because of the waiting mechanism used, receiving the pulse involved the system call / semaphore / interrupt handling of the RTOS.
On the other hand, sending doesn't use any of the RTOS feature.

Do you just run code in guest mode or do you also trigger VM exits, e.g. to
issue ivshmem interrupts to a remote side?

I tried to instrument the system.
So far there are no additional interrupts send, nor received during the whole process.
VMExit do exist for EOI(systick and serial IRQ) and when I fiddle the TSC_deadline timer enable/disable bit of APIC MSR.
The whole process is not related to any ivshmem operations.

Maybe you can sample some latencies along the critical path so that we have a better picture about 
where we lose time, overall or rather on specific actions.

Basically, it is an overall slowdown.
But code in the scheduler and application slowdown more than other places.

BTW, I tested the again with a partially working setup of <kernel 4.19/Jailhouse v0.11/old ivshmem2>.
Currently, I cannot get my application running, due to some mystery, but I am observing some slowdown.
Pinging the RTOS using ivshmem-net the RTT has about 2x latency:
 * <kernel 4.19/Jailhouse v0.11/old ivshmem2>: ~0.060ms
 * <kernel 4.19/Jailhouse v0.11/new ivshmem2>: ~0.130ms

----
Yang

Jan Kiszka

unread,
Nov 1, 2019, 3:36:16 AM11/1/19
to Chung-Fan Yang, Jailhouse
Use x2APIC in your guest, and you will get rid of those VMexits (due to
xAPIC MMIO interception). But that's an unrelated optimization.

>
> Maybe you can sample some latencies along the critical path so that
> we have a better picture about 
>
> where we lose time, overall or rather on specific actions.
>
>
> Basically, it is an overall slowdown.
> But code in the scheduler and application slowdown more than other places.
>
> BTW, I tested the again with a partially working setup of <kernel
> 4.19/Jailhouse v0.11/old ivshmem2>.
> Currently, I cannot get my application running, due to some mystery, but
> I am observing some slowdown.
> Pinging the RTOS using ivshmem-net the RTT has about 2x latency:
>  * <kernel 4.19/Jailhouse v0.11/old ivshmem2>: ~0.060ms
>  * <kernel 4.19/Jailhouse v0.11/new ivshmem2>: ~0.130ms
>

Sound like as if we have some caching related problem. You could enable
access to the perf MSRs (small patch to the MSR bitmap in vmx.c) and
check if you see excessive cache misses in the counters.

Chung-Fan Yang

unread,
Nov 14, 2019, 1:25:35 AM11/14/19
to Jailhouse


2019年11月1日金曜日 16時36分16秒 UTC+9 Jan Kiszka:
Thanks for the hint.
I override the BIOS and enabled it.
There are no VM exits now.
 
>
>     Maybe you can sample some latencies along the critical path so that
>     we have a better picture about 
>
>     where we lose time, overall or rather on specific actions.
>
>
> Basically, it is an overall slowdown.
> But code in the scheduler and application slowdown more than other places.
>
> BTW, I tested the again with a partially working setup of <kernel
> 4.19/Jailhouse v0.11/old ivshmem2>.
> Currently, I cannot get my application running, due to some mystery, but
> I am observing some slowdown.
> Pinging the RTOS using ivshmem-net the RTT has about 2x latency:
>  * <kernel 4.19/Jailhouse v0.11/old ivshmem2>: ~0.060ms
>  * <kernel 4.19/Jailhouse v0.11/new ivshmem2>: ~0.130ms
>

Sound like as if we have some caching related problem. You could enable
access to the perf MSRs (small patch to the MSR bitmap in vmx.c) and
check if you see excessive cache misses in the counters.


I am quite busy lately, so I might let this problem be as is and revisit it later.

I will update the thread when I made new discoveries.

Yang.
Reply all
Reply to author
Forward
0 new messages