I was trying to chase down a theory that my desktop machine (a core i7)
is running warm (the fan sounds like it's at full speed all the time,
and I think it's not always acted this way -- hence the theory).
powertop is never showing it spending any time in C3...
I compiled a kernel without USB/sound/radeon, and ran without X. I was
able to get the wakeups/sec down below 20, but no time is spent in C3.
sysfs looks to agree with powertop here (time = 0 on C3):
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc: CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency: 0
/sys/devices/system/cpu/cpu0/cpuidle/state0/name: C0
/sys/devices/system/cpu/cpu0/cpuidle/state0/power: 4294967295
/sys/devices/system/cpu/cpu0/cpuidle/state0/time: 457
/sys/devices/system/cpu/cpu0/cpuidle/state0/usage: 59
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc: ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency: 1
/sys/devices/system/cpu/cpu0/cpuidle/state1/name: C1
/sys/devices/system/cpu/cpu0/cpuidle/state1/power: 1000
/sys/devices/system/cpu/cpu0/cpuidle/state1/time: 308177
/sys/devices/system/cpu/cpu0/cpuidle/state1/usage: 3975
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc: ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency: 17
/sys/devices/system/cpu/cpu0/cpuidle/state2/name: C2
/sys/devices/system/cpu/cpu0/cpuidle/state2/power: 500
/sys/devices/system/cpu/cpu0/cpuidle/state2/time: 873440787
/sys/devices/system/cpu/cpu0/cpuidle/state2/usage: 239038
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc: ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu0/cpuidle/state3/latency: 17
/sys/devices/system/cpu/cpu0/cpuidle/state3/name: C3
/sys/devices/system/cpu/cpu0/cpuidle/state3/power: 350
/sys/devices/system/cpu/cpu0/cpuidle/state3/time: 0
/sys/devices/system/cpu/cpu0/cpuidle/state3/usage: 0
This may be a complete red herring, but I added some printk logic to
acpi_idle_bm_check(), and it is getting called often, but bm_status is
always 1. [I infer from this that the idle logic is trying to go into
C3, but this check is stopping it... Unless I misread something.]
Is this expected behavior or is this a legitimate problem?
How might I investigate this further?
Attaching dmesg, /proc/cpuinfo, powertop -d output.
Thanks,
Jeff Garrett
this is the info of my laptop(using core 2 processors):
powertop's output:
Cn Avg residency P-states (frequencies)
C0 (cpu running) (10.6%) 2.00 Ghz 1.9%
C0 0.0ms ( 0.0%) 1.67 Ghz 0.1%
C1 mwait 0.0ms ( 0.0%) 1333 Mhz 0.0%
C2 mwait 0.0ms ( 0.0%) 1000 Mhz 98.0%
C3 mwait 1.1ms (89.4%)
and power things:
huang@huang-laptop:~$ cat /proc/acpi/processor/CPU0/power
active state: C0
max_cstate: C8
maximum allowed latency: 2000000000 usec
states:
C1: type[C1] promotion[--] demotion[--]
latency[001] usage[00002364] duration[00000000000000000000]
C2: type[C2] promotion[--] demotion[--]
latency[001] usage[00070662] duration[00000000000006013816]
C3: type[C3] promotion[--] demotion[--]
latency[017] usage[04774185] duration[00000000010838418152]
you can see C3 with powertop,so i think your BIOS has enabled Deep
C-state.
-huang
2010-01-26 (火) の 02:47 -0600 に Jeff Garrett さんは書きました:
--
peng huang <huangpe...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Yea, I'm pretty sure my cpu usage is nearly zero. powertop shows fewer
than 20 wakeups per second, shows 99% C2 residency, top shows 100% idle,
perf top shows acpi_idle_enter_simple as the most common function
(~50%). Very little is running on the box, and I've compiled out the
heavier parts of the kernel (such as USB)...
(It still has ordinary userspace running, e.g. udev & hal, and still has
sshd and network traffic, as examples.)
This is all consistent with a very idle machine, I think.
> this is the info of my laptop(using core 2 processors):
> powertop's output:
> Cn Avg residency P-states (frequencies)
> C0 (cpu running) (10.6%) 2.00 Ghz 1.9%
> C0 0.0ms ( 0.0%) 1.67 Ghz 0.1%
> C1 mwait 0.0ms ( 0.0%) 1333 Mhz 0.0%
> C2 mwait 0.0ms ( 0.0%) 1000 Mhz 98.0%
> C3 mwait 1.1ms (89.4%)
Yea, my laptop also (also core 2) has 700-1000 wakeups/sec and spends
greater than 80% of its time in C3... That's partly why I'm curious
about what my core i7 desktop is doing.
> and power things:
> huang@huang-laptop:~$ cat /proc/acpi/processor/CPU0/power
> active state: C0
> max_cstate: C8
> maximum allowed latency: 2000000000 usec
> states:
> C1: type[C1] promotion[--] demotion[--]
> latency[001] usage[00002364] duration[00000000000000000000]
> C2: type[C2] promotion[--] demotion[--]
> latency[001] usage[00070662] duration[00000000000006013816]
> C3: type[C3] promotion[--] demotion[--]
> latency[017] usage[04774185] duration[00000000010838418152]
>
> you can see C3 with powertop,so i think your BIOS has enabled Deep
> C-state.
Here's my power files...
/proc/acpi/processor/CPU0/power:
active state: C0
max_cstate: C8
maximum allowed latency: 2000000000 usec
states:
C1: type[C1] promotion[--] demotion[--] latency[001] usage[00001470] duration[00000000000000000000]
C2: type[C2] promotion[--] demotion[--] latency[017] usage[00234416] duration[00000000017165798539]
C3: type[C3] promotion[--] demotion[--] latency[017] usage[00000000] duration[00000000000000000000]
/proc/acpi/processor/CPU1/power:
active state: C0
max_cstate: C8
maximum allowed latency: 2000000000 usec
states:
C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000481] duration[00000000000000000000]
C2: type[C2] promotion[--] demotion[--] latency[017] usage[00090169] duration[00000000017188463157]
C3: type[C3] promotion[--] demotion[--] latency[017] usage[00000000] duration[00000000000000000000]
/proc/acpi/processor/CPU2/power:
active state: C0
max_cstate: C8
maximum allowed latency: 2000000000 usec
states:
C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000418] duration[00000000000000000000]
C2: type[C2] promotion[--] demotion[--] latency[017] usage[00068874] duration[00000000017193805291]
C3: type[C3] promotion[--] demotion[--] latency[017] usage[00000000] duration[00000000000000000000]
/proc/acpi/processor/CPU3/power:
active state: C0
max_cstate: C8
maximum allowed latency: 2000000000 usec
states:
C1: type[C1] promotion[--] demotion[--] latency[001] usage[00001356] duration[00000000000000000000]
C2: type[C2] promotion[--] demotion[--] latency[017] usage[00362752] duration[00000000017156707397]
C3: type[C3] promotion[--] demotion[--] latency[017] usage[00000000] duration[00000000000000000000]
> Hi,
>
> I was trying to chase down a theory that my desktop machine (a core i7)
> is running warm (the fan sounds like it's at full speed all the time,
> and I think it's not always acted this way -- hence the theory).
>
> powertop is never showing it spending any time in C3...
>
> I compiled a kernel without USB/sound/radeon, and ran without X. I was
> able to get the wakeups/sec down below 20, but no time is spent in C3.
[...]
> This may be a complete red herring, but I added some printk logic to
> acpi_idle_bm_check(), and it is getting called often, but bm_status is
> always 1. [I infer from this that the idle logic is trying to go into
> C3, but this check is stopping it... Unless I misread something.]
Normally a Core i7 (or any modern Intel systems) should not use
bm_check at all. That's only for older systems that didn't support
MWAIT with c-state hint, but relied on the old port based interface.
So something is already confused there.
I think it should still work though.
Of course if you really have a lot of bus mastering in the background
then yes there will be no C3.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
yes,you processor is always in c2,it does means your system is nearly
ilde.
> > this is the info of my laptop(using core 2 processors):
> > powertop's output:
> > Cn Avg residency P-states (frequencies)
> > C0 (cpu running) (10.6%) 2.00 Ghz 1.9%
> > C0 0.0ms ( 0.0%) 1.67 Ghz 0.1%
> > C1 mwait 0.0ms ( 0.0%) 1333 Mhz 0.0%
> > C2 mwait 0.0ms ( 0.0%) 1000 Mhz 98.0%
> > C3 mwait 1.1ms (89.4%)
>
> Yea, my laptop also (also core 2) has 700-1000 wakeups/sec and spends
> greater than 80% of its time in C3... That's partly why I'm curious
> about what my core i7 desktop is doing.
So I think it is a core i7 thing.I have heard that some intel cpu have a
problem when in c2-state,maybe that is why you cpu cant enter c3-state.
I think there is some configuration about deep c-state in the bios,may
be you can try it(it cannot solve this problem...).
And in some bios there is a enhanced idle state configuration ,but i
dont known if it is the reason why the cpu cannot enter c3-state.You can
try it anyway.
With disable the deep c-state you BIOS will not give c3-info to the
OS,then you would see there is no c3-state in the OS.
> This may be a complete red herring, but I added some printk logic to
> acpi_idle_bm_check(), and it is getting called often, but bm_status is
> always 1. [I infer from this that the idle logic is trying to go into
> C3, but this check is stopping it... Unless I misread something.]
>
> Is this expected behavior or is this a legitimate problem?
>
> How might I investigate this further?
DMA keeps system awake? Possibly USB?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
bm_check = 1, bm_control = 0
I don't know what any of this means. :)
I tried changing processor_idle.c. It reads (for C3):
1106 state->enter = pr->flags.bm_check ?
1107 acpi_idle_enter_bm :
1108 acpi_idle_enter_simple;
So it always calls acpi_idle_enter_bm in my case. I tried modifying it
to call acpi_idle_enter_simple for entering C3 instead. When I did
this, it did make it into C3 according to powertop, but the wakeups per
second grew by at least 10x. I couldn't get that below ~400-800/s, and
the residency in C3 was limited to about ~50%, as reported by powertop.
> So something is already confused there.
Might just be me. :)
> I think it should still work though.
> Of course if you really have a lot of bus mastering in the background
> then yes there will be no C3.
>
> -Andi
I have no idea what counts as bus mastering (is it just DMA transfers to
PCI devices?)... But with a fairly idle system, with things like USB
configured out, what could be doing it if it exists? Would there be
some nice function I could instrument with a few printk's to, to see? I
compiled with PCI_DEBUG=y, and "bus master" doesn't show up in the
dmesg.
-Jeff
My BIOS has very few options. Almost nothing with regard to
power management. They do let me choose whether to enable
Intel SpeedStep. They also let me choose which state to use
for ACPI suspend (set to S3) but that appears to be it.
> And in some bios there is a enhanced idle state configuration ,but i
> dont known if it is the reason why the cpu cannot enter c3-state.You can
> try it anyway.
I would if I could. :)
> With disable the deep c-state you BIOS will not give c3-info to the
> OS,then you would see there is no c3-state in the OS.
I also found a thread (I think LKML) which said the max c-state is
limited when HT is disabled. I had HT disabled in the BIOS.
Re-enabling HT in the BIOS (with CONFIG_X86_HT=y) doesn't
appear to make any difference.
Thanks,
Jeff
X is not running. USB is not enabled at all (CONFIG_USB_SUPPORT is not
set), and likewise for sound & drm. What could be doing the DMA? :)
If there is DMA, it could be disk or network related. However, I
find it hard to believe that this activity is so constant on a nearly
idle system... Of course, I haven't thought of a way to test this yet.
Attaching my config & dmesg of my latest try.
Thanks,
-Jeff
Also, in addition to "powertop -d" to show what the kernel requests,
please run turbostat to show what the hardware actually did:
http://userweb.kernel.org/~lenb/acpi/utils/pmtools-latest/turbostat/turbostat.c
eg.
# turbostat -d -v sleep 5
thanks,
-Len Brown, Intel Open Source Technology Center
---
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 7c0441f..f528625 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -763,7 +763,7 @@ static const struct file_operations acpi_processor_power_fops = {
static int acpi_idle_bm_check(void)
{
u32 bm_status = 0;
-
+return bm_status;
acpi_read_bit_register(ACPI_BITREG_BUS_MASTER_STATUS, &bm_status);
if (bm_status)
acpi_write_bit_register(ACPI_BITREG_BUS_MASTER_STATUS, 1);
With the patch, powertop reports good C3 residency and wakeups
remain very low. Seems to work. :)
I attached the powertop & turbostat output with this patch.
However, this confuses me. In a previous experiment in the
acpi_processor_setup_cpuidle() function, I replaced the pointer
to acpi_idle_enter_bm() with a pointer to
acpi_idle_enter_simple() even when bm_check is nonzero. With
that, I was able to get into C3, but the wakeups ballooned. But
the difference between what I did, and what you did, is the
difference between acpi_idle_enter_bm() with acpi_idle_bm_check()
returning zero and acpi_idle_enter_simple(). Those code paths
look almost identical. The bm path calls acpi_unlazy_tlb(), and
doesn't appear to call the ACPI_FLUSH_CPU_CACHE(), and they call
sched_clock_idle_sleep_event() in different places. I don't
understand why any of these differences would have had any
significant effect on wakeups.
I'm left wondering if it's a problem on my part. I should repeat
that previous experiment and see if there really is something
significantly different there.
BTW, getting a bit off topic, but since the two code paths are
almost identical, is there any reason not to unite them?
Something like the attached patch might work?
Thanks,
Jeff
> Jeff,
> What do you see if you apply just the patch below?
>
> Also, in addition to "powertop -d" to show what the kernel requests,
> please run turbostat to show what the hardware actually did:
>
> http://userweb.kernel.org/~lenb/acpi/utils/pmtools-latest/turbostat/turbostat.c
>
> eg.
> # turbostat -d -v sleep 5
>
> thanks,
> -Len Brown, Intel Open Source Technology Center
> ---
To resurrect this thread...
I have a giga-byte GA-P55M-UD4 motherboard and I have this same problem
as well. Len's patch "works" in that I see C6 being used, but it also
cripples the system - if I do a make -j16 kernel build, I see most jobs
serialized onto one or two cores. Without the patch, I see the
full utilization of all 8 hyper-threads as expected.
Now, gigabyte have already b0rked these boards up by using the UHCI
controllers on the PCH instead of the rate matching hubs. Maybe that's
directly the cause of BM activity - maybe they screwed something else
up - is it possible for BIOS/ACPI mistakes to lead to this behaviour?
Jeff - is your board gigabyte too?
--phil
> diff --git a/drivers/acpi/processor_idle.c
> b/drivers/acpi/processor_idle.c index 7c0441f..f528625 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -763,7 +763,7 @@ static const struct file_operations
> acpi_processor_power_fops = { static int acpi_idle_bm_check(void)
> {
> u32 bm_status = 0;
> -
> +return bm_status;
> acpi_read_bit_register(ACPI_BITREG_BUS_MASTER_STATUS,
> &bm_status); if (bm_status)
> acpi_write_bit_register(ACPI_BITREG_BUS_MASTER_STATUS,
> 1);
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>
--phil
> On Fri, 05 Feb 2010 12:45:21 -0500 (EST)
> Len Brown <le...@kernel.org> wrote:
>
> > Jeff,
> > What do you see if you apply just the patch below?
> >
> > Also, in addition to "powertop -d" to show what the kernel requests,
> > please run turbostat to show what the hardware actually did:
> >
> > http://userweb.kernel.org/~lenb/acpi/utils/pmtools-latest/turbostat/turbostat.c
> >
> > eg.
> > # turbostat -d -v sleep 5
> >
> > thanks,
> > -Len Brown, Intel Open Source Technology Center
> > ---
>
> To resurrect this thread...
>
> I have a giga-byte GA-P55M-UD4 motherboard and I have this same problem
> as well. Len's patch "works" in that I see C6 being used, but it also
> cripples the system - if I do a make -j16 kernel build, I see most jobs
> serialized onto one or two cores. Without the patch, I see the
> full utilization of all 8 hyper-threads as expected.
Curious failure.
I could imagine that there is something in the design of this board
where we want to not enter a deep C-state, and thus the board and
Linux are doing the right thing by avoiding the C-state.
However, ignoring the bm-status check and blindly going to that state
I would expect to impact throughput and latency, but don't see
how that might 'serialize' the workload or otherwise cause it
to use some cores and not others.
It is possible that we jump into those deep states just to be
immediately forced to jump right back out. You'd see this in
high usage counts under /sys/devices/system/cpu/cpu*/cpuidle
turbostat, of course, would tell you the actual residency in those states.
Of course there is a twist... The hardware has a feature to recognize
thrashing and may demote an OS request for a deep state into
an actual hardware request for a shallower state. this is one reason
that the output of powertop (request) and turbostat (result)
may be different.
cheers,
-Len
My board identifies it as a Dell. No idea if they rebranded a gigabyte.
The patch seems to work for me as well, powertop shows 97.5% c3,
turbostat shows 93.6% c6 now. I do get weird latency spikes (on I/O)
from time to time.
When I was investigating, I completely configured USB off, and it still
wouldn't go into deep sleep. Not sure how well that meshes with your
UHCI theory.
-Jeff
>
>
> Curious failure.
> I could imagine that there is something in the design of this board
> where we want to not enter a deep C-state, and thus the board and
> Linux are doing the right thing by avoiding the C-state.
> However, ignoring the bm-status check and blindly going to that state
> I would expect to impact throughput and latency, but don't see
> how that might 'serialize' the workload or otherwise cause it
> to use some cores and not others.
Hmm - and now I can't reproduce it. I got proper parallelization across
the kernel compile. I guess some sort of runtime state was messed up,
and I obviously lost that then I rebooted. :-/
> It is possible that we jump into those deep states just to be
> immediately forced to jump right back out. You'd see this in
> high usage counts under /sys/devices/system/cpu/cpu*/cpuidle
>
> turbostat, of course, would tell you the actual residency in those
> states. Of course there is a twist... The hardware has a feature to
> recognize thrashing and may demote an OS request for a deep state into
> an actual hardware request for a shallower state. this is one reason
> that the output of powertop (request) and turbostat (result)
> may be different.
Without the patch, Turbostat showed C3 residency of 99% for most
hyper-threads with one or two getting ~15% C6 residency. PC3 was 75%.
Cores were at their lowest P state.
With the patch, C6 residency is 99%, PC6 is 75% and 7 hyper-threads at
lowest P state with one stubborning running at a higher level.
I have a very similarly configured machine with an asus motherboard and
it doesn't have this problem - which is another reason I'm wondering if
it's an OEM screwup.
--phil
> My board identifies it as a Dell. No idea if they rebranded a
> gigabyte.
>
> The patch seems to work for me as well, powertop shows 97.5% c3,
> turbostat shows 93.6% c6 now. I do get weird latency spikes (on I/O)
> from time to time.
>
> When I was investigating, I completely configured USB off, and it
> still wouldn't go into deep sleep. Not sure how well that meshes
> with your UHCI theory.
Does your board expose UHCI controllers or just EHCI with the rate
matching hubs? When you say 'configured USB off'?, do you mean off
in the BIOS or just no drivers?
--phil
I am hopeful that the "right thin to do" is to not look at bm-status
and that perhaps there is a bug where we are looking at it
"by mistake".
However, it is important that the system be operating properly
when it indeed using c6 as it does with that 1 line patch
earlier in this thread. So if that patch causes abnormal
operation, please let me know.
thanks,
Len Brown, Intel Open Source Technology Center
> I am hopeful that the "right thin to do" is to not look at bm-status
> and that perhaps there is a bug where we are looking at it
> "by mistake".
https://patchwork.kernel.org/patch/58962/ - it seems to be a win.
--
Matthew Garrett | mj...@srcf.ucam.org
Indeed. This patch does solve the C6 problem. I'm not in a position to
speak about whether there's any undesirable I/O latency, but it
passes the basic sanity check.
I have filed https://bugzilla.kernel.org/show_bug.cgi?id=15886 with
my acpi dump - assuming that's still useful.
--phil
Luming's patch above basically deletes acpi_idle_bm_check() --
the BM_STS check -- from the C3 path on all Intel SMP boxes.
This is effectively the same as my test patch
https://patchwork.kernel.org/patch/77370/
that made acpi_idle_bm_check() do nothing.
I'm told by the hardware guys that BM_STS is _not_ always
a NOP, and so we're not supposed to simply ignore it on C3 --
though it should be extremely rare that we see it set.
If it is ever set, it should go on and off depending on
activity on some latency sensitive device, like out on the LPC.
It may be possible for the BIOS writer to configure the chipset
so that BM_STS is enabled always, presumably to accomodate
some latency sensitve device -- or maybe by mistake.
(is it observed to be set always on your systems, or does
it ever change its value?)
The logic in Luming's patch doesn't make sense to me.
bm_check and bm_control are related to C3 need to flush
the cache or ability to invoke ARB_DIS. They are not
directly related to BM_STS -- which is a bit that tells
us if there has recently been bus master activity of
a type that would break us out of C3.
-Len Brown, Intel Open Source Technology Center
On some platforms like NHM-EX, I was told that it's a NOP,
But I might be given wrong information at that time when I wrote that patch.
IIRC, acpi spec just say it's optional..
Thanks,
Luming
> I'm told by the hardware guys that BM_STS is _not_ always
> a NOP, and so we're not supposed to simply ignore it on C3 --
> though it should be extremely rare that we see it set.
> If it is ever set, it should go on and off depending on
> activity on some latency sensitive device, like out on the LPC.
> It may be possible for the BIOS writer to configure the chipset
> so that BM_STS is enabled always, presumably to accomodate
> some latency sensitve device -- or maybe by mistake.
On some hardware we've seen BM_STS be enabled approximately 50% of the
time without any obvious cause.
--
Matthew Garrett | mj...@srcf.ucam.org
> On some platforms like NHM-EX, I was told that it's a NOP,
> But I might be given wrong information at that time when I wrote that patch.
>
> IIRC, acpi spec just say it's optional..
Implementing it is optional, but the spec implies that it should be used
if it's present.
--
Matthew Garrett | mj...@srcf.ucam.org
"OSPM uses the BM_STS bit to determine the power state to enter when
considering a transition to or from the C2/C3 power state. The BM_STS is
an optional bit that indicates when bus masters are active. OSPM uses
this bit to determine the policy between the C2 and C3 power states: a
lot of bus master activity demotes the CPU power state to the C2 (or C1
if C2 is not supported), no bus master activity promotes the CPU power
state to the C3 power state. OSPM keeps a running history of the BM_STS
bit to determine CPU power state policy."
while the description of the bit itself is:
"This is the bus master status bit. This bit is set any time a system
bus master requests the system bus, and can only be cleared by writing a
“1” to this bit position. Notice that this bit reflects bus master
activity, not CPU activity (this bit monitors any bus master that can
cause an incoherent cache for a processor in the C3 state when the bus
master performs a memory transaction)."
which implies that as long as you don't have any cache coherency
concerns, it's acceptable (if potentially suboptimal) to enter C3 even
if the bit is set.
--
Matthew Garrett | mj...@srcf.ucam.org
> On the other hand, the relevant section of spec is:
>
> "OSPM uses the BM_STS bit to determine the power state to enter when
> considering a transition to or from the C2/C3 power state. The BM_STS is
> an optional bit that indicates when bus masters are active. OSPM uses
> this bit to determine the policy between the C2 and C3 power states: a
> lot of bus master activity demotes the CPU power state to the C2 (or C1
> if C2 is not supported), no bus master activity promotes the CPU power
> state to the C3 power state. OSPM keeps a running history of the BM_STS
> bit to determine CPU power state policy."
>
> while the description of the bit itself is:
>
> "This is the bus master status bit. This bit is set any time a system
> bus master requests the system bus, and can only be cleared by writing a
> “1” to this bit position. Notice that this bit reflects bus master
> activity, not CPU activity (this bit monitors any bus master that can
> cause an incoherent cache for a processor in the C3 state when the bus
> master performs a memory transaction)."
>
> which implies that as long as you don't have any cache coherency
> concerns, it's acceptable (if potentially suboptimal) to enter C3 even
> if the bit is set.
As I wrote, the HW people tell me that implication is usually correct,
but there exist cases where it is incorrect. (Of course the way
it is supposed to work is that when BM_STS is not meaningful,
it always returns zero)
The ACPI spec talks about BM_STS being set by traffic that is incoherent
with the frozen cache of C3, requiring a wake up of the processor
from C3 to snoop the traffic. It was written 10 years before the
hardware started automatically snooping the L3 when the processor was off,
and before the hardware learned how to automatically flush the cache
to get into deep C-states. So the description is stale, but the
underlying issue is unchanged. There exist devices which can not
handle the wakeup latency of some deep C-states. The BM_STS bit
is a chip-set bit that the BIOS writer can use to prevent the OS
from using the deep C-states when those devices are active.
I'm told that the cases in question are some legacy devices
hanging off the LPC bus, which should be rare. More interesting
in isochronous traffic over some 1394 controllers -- though
I don't know if Linux runs into that. If we do, one option
would be to ignore BM_STS, but to use pm_qos to disable the
deep c-state when needed -- a mechanism we've used for
several devices in the past.
I believe that the BIOS writer also has the option to keep
BM_STS set always. However, that doesn't make sense to me
as it would be simpler to just disable the C-state in _CST
on that platform.
So if we see a nehalem system that has BM_STS *always* set,
even when no devices are active in the system, my guess is
that the BIOS mis-configured the chip-set and we should
ignore that bit. If BM_STS is changing at run time, then
that is a more interesting situation, and we should endeavor
to find what device activity is changing it.
Assuming it is modern hardware, please get the acpidump and lspci -vv
output from that harware to this bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=15886
thanks,
-Len Brown, Intel Open Source Technology Center
--
> So if we see a nehalem system that has BM_STS *always* set,
> even when no devices are active in the system, my guess is
> that the BIOS mis-configured the chip-set and we should
> ignore that bit. If BM_STS is changing at run time, then
> that is a more interesting situation, and we should endeavor
> to find what device activity is changing it.
Right. Determining that seems... awkward. FWIW, we've been shipping
Luming's patch for several months now without anything obviously
breaking in the process. This behaviour seems reasonably prevelant on
Nehalem-EX systems.
The BIOS exports deep C-states on modern Intel processors
as "C3-type" to satisfy various legacy Operating Systems.
However, the hardware actually supports C2-type, and does
not require the extra costs of C3-type.
One of the costs is to check the BM_STS (Bus Master Status)
bit before entering C3, and instead choose a shallower C-state
if there was "recent bus master activity".
We have found a number of systems in the field that erroneously
set BM_STS and prevent entry into deep C-states.
Re-define BIOS presented C3-type states as C2-type states
on modern processors to avoid this issue.
If a device in the system really does want to prevent use
of a deep C-state, its Linux driver should register its
constraints via pm_qos_add_request().
https://bugzilla.kernel.org/show_bug.cgi?id=15886
Signed-off-by: Len Brown <len....@intel.com>
---
drivers/acpi/processor_idle.c | 38 ++++++++++++++++++++++++++++++++++++++
1 files changed, 38 insertions(+), 0 deletions(-)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b1b3856..14d1a0c 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -607,6 +607,38 @@ static void acpi_processor_power_verify_c3(struct acpi_processor *pr,
return;
}
+/*
+ * Modern Intel processors support only ACPI C2-type C-states.
+ * But the BIOS tends to report its deepest C-state as C3-type
+ * to satisfy various old operating systems. We can skip
+ * C3 OS overhead by treating the deep-states as C2-type.
+ * Also, we can avoid checking BM_STS, which on some systems
+ * erroneously prevents entry into C3-type states.
+ */
+static int acpi_c3type_is_really_c2type(void) {
+
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ return 0;
+
+ if (boot_cpu_data.x86 != 6)
+ return 0;
+
+ switch(boot_cpu_data.x86_model) {
+ case 0x1A: /* Core i7, Xeon 5500 series */
+ case 0x1E: /* Core i7 and i5 Processor */
+ case 0x1F: /* Core i7 and i5 Processor */
+ case 0x2E: /* NHM-EX Xeon */
+ case 0x2F: /* WSM-EX Xeon */
+ case 0x25: /* WSM */
+ case 0x2C: /* WSM */
+ case 0x2A: /* SNB */
+ case 0x2D: /* SNB Xeon */
+ return 1;
+ default:
+ return 0;
+ }
+}
+
static int acpi_processor_power_verify(struct acpi_processor *pr)
{
unsigned int i;
@@ -617,6 +649,12 @@ static int acpi_processor_power_verify(struct acpi_processor *pr)
for (i = 1; i < ACPI_PROCESSOR_MAX_POWER && i <= max_cstate; i++) {
struct acpi_processor_cx *cx = &pr->power.states[i];
+ if ((cx->type == ACPI_STATE_C3)
+ && acpi_c3type_is_really_c2type()) {
+ ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Redefining C3-type to C2\n"));
+ cx->type = ACPI_STATE_C2;
+ }
+
switch (cx->type) {
case ACPI_STATE_C1:
cx->valid = 1;
--
1.7.2.rc3.43.g24e7a
Agree with the intent. But, I think its cleaner to keep all arch model
checks in arch/x86/kernel/acpi/cstate.c.
Thanks,
Venki
Please attach the output from acpidump for your dell to this bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=15886
I think we know how to cleanly solve the issue to make Philip's Gigabye
happy, and I'd like to know if your Dell is the same situation
or a different one.
Unfortunately, Matthew's HP fails for a different reason than Philip's
Gigabyete. (The BIOS asks for it - so we'll have to poke at the chipset
to find out the reason it is setting BM_STS)
thanks,
Len Brown, Intel Open Source Technology Center
--
I reviewed the patch and it looks good to me.
I would suggest to have a command line option for this too,
in case someone wants to run an older kernel on a new system not
known by your patch yet.
-Andi
As detailed in the bug report
https://bugzilla.kernel.org/show_bug.cgi?id=15886
we should be able to fix some of these boxes
by paying attention to an ACPI flag we didn't
realize existed until yesterday.
I'll follow-up with a new patch today.
However, we'll still have issues with systems
like the HP DL360 G6 which explicity set the
flag to ask for BM_STS checking and configure
the chipset such that BM_STS is active.
That may require a BIOS fix, or we may
have to run intel_idle on that box --
since intel_idle ignores BM_STS always
and instead relies on drivers to use pm_qos
to register device latency constraints.
thanks,
Len Brown, Intel Open Source Technology Center
--
It turns out that there is a bit in the _CST for Intel FFH C3
that tells the OS if we should be checking BM_STS or not.
Linux has been unconditionally checking BM_STS.
If the chip-set is configured to enable BM_STS,
it can retard or completely prevent entry into
deep C-states -- as illustrated by turbostat:
http://userweb.kernel.org/~lenb/acpi/utils/pmtools/turbostat/
ref: Intel Processor Vendor-Specific ACPI Interface Specification
table 4 "_CST FFH GAS Field Encoding"
Bit 1: Set to 1 if OSPM should use Bus Master avoidance for this C-state
https://bugzilla.kernel.org/show_bug.cgi?id=15886
Signed-off-by: Len Brown <len....@intel.com>
---
arch/x86/kernel/acpi/cstate.c | 9 +++++++++
drivers/acpi/processor_idle.c | 2 +-
include/acpi/processor.h | 3 ++-
3 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index 2e837f5..fb7a5f0 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -145,6 +145,15 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
percpu_entry->states[cx->index].eax = cx->address;
percpu_entry->states[cx->index].ecx = MWAIT_ECX_INTERRUPT_BREAK;
}
+
+ /*
+ * For _CST FFH on Intel, if GAS.access_size bit 1 is cleared,
+ * then we should skip checking BM_STS for this C-state.
+ * ref: "Intel Processor Vendor-Specific ACPI Interface Specification"
+ */
+ if ((c->x86_vendor == X86_VENDOR_INTEL) && !(reg->access_size & 0x2))
+ cx->bm_sts_skip = 1;
+
return retval;
}
EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b1b3856..b351342 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -947,7 +947,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
if (acpi_idle_suspend)
return(acpi_idle_enter_c1(dev, state));
- if (acpi_idle_bm_check()) {
+ if (!cx->bm_sts_skip && acpi_idle_bm_check()) {
if (dev->safe_state) {
dev->last_state = dev->safe_state;
return dev->safe_state->enter(dev, dev->safe_state);
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index da565a4..a68ca8a 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -48,7 +48,7 @@ struct acpi_power_register {
u8 space_id;
u8 bit_width;
u8 bit_offset;
- u8 reserved;
+ u8 access_size;
u64 address;
} __attribute__ ((packed));
@@ -63,6 +63,7 @@ struct acpi_processor_cx {
u32 power;
u32 usage;
u64 time;
+ u8 bm_sts_skip;
char desc[ACPI_CX_DESC_LEN];
};
--
1.7.2
I'm curious as to why you see a problem with the DL380G6 as the one I have here happily sits in C6 when idle.
your turbostat util shows:
CPU GHz TSC %c0 %c1 %c3 %c6 %pc3 %pc6
avg 1.64 2.27 0.16 0.12 0.00 99.71 0.00 90.15
and powertop has results like:
Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0,1%) Turbo Mode 0,0%
polling 0,0ms ( 0,0%) 2,27 Ghz 0,0%
C1 mwait 0,1ms ( 0,0%) 2,13 Ghz 0,0%
C2 mwait 1,0ms ( 0,0%) 2,00 Ghz 0,0%
C3 mwait 90,4ms (99,9%) 1,60 Ghz 100,0%
this is with v2.6.35-rc5-176-gcd5b8f8 and using acpi_idle. I've deliberately disabled intel_idle to test, however using intel_idle
gives almost identical results.
Looking at the bug 15886, the Access Size 0x03 entries you mentioned are all 0x01 on this system. I've also uploaded the acpidump
from this DL380G6 to that bug so that you can check I've not just looked in the wrong place.
Did the first acpidump come from a system with the 'HP Power Regulator' setting in the bios set to OS Control mode ? My system is
set this way and it seems to work as expected.
The other settings for this option appear to be designed to override OS power management controls, for example the description of
the 'Static High Performance' option suggests it'll somehow force the CPU to operate in the highest performance mode all of the
time: "HP Static High Performance Mode: Processors will run in their maximum power/performance state at all times regardless of the
OS power management policy".
If this does turn out to be as simple as a bios setting, should we really be trying to workaround what may be a legitimate decision
by the servers admin ?
Iain
processor.bm_check_disable=1" prevents Linux from checking BM_STS
before entering C3-type cpu power states.
This may be useful for a system running acpi_idle
where the BIOS exports FADT C-states, _CST IO C-states,
or _CST FFH C-states with the BM_STS bit set;
while configuring the chipset to set BM_STS
more frequently than perhaps is optimal.
Note that such systems may have been developed
using a tickful OS that would quickly clear BM_STS,
rather than a tickless OS that may go for some time
between checking and clearing BM_STS.
Note also that an alternative for newer systems
is to use the intel_idle driver, which always
ignores BM_STS, relying Linux device drivers
to register constraints explicitly via PM_QOS.
https://bugzilla.kernel.org/show_bug.cgi?id=15886
Signed-off-by: Len Brown <len....@intel.com>
---
drivers/acpi/processor_idle.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b351342..1d41048 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -76,6 +76,8 @@ static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER;
module_param(max_cstate, uint, 0000);
static unsigned int nocst __read_mostly;
module_param(nocst, uint, 0000);
+static int bm_check_disable __read_mostly;
+module_param(bm_check_disable, uint, 0000);
static unsigned int latency_factor __read_mostly = 2;
module_param(latency_factor, uint, 0644);
@@ -763,6 +765,9 @@ static int acpi_idle_bm_check(void)
{
u32 bm_status = 0;
+ if (bm_check_disable)
+ return 0;
+
acpi_read_bit_register(ACPI_BITREG_BUS_MASTER_STATUS, &bm_status);
if (bm_status)
acpi_write_bit_register(ACPI_BITREG_BUS_MASTER_STATUS, 1);
Please ignore me, apologies for the noise.
Just noticed the problem system was a DL360 and mine is a DL380. Long day spent working with both 360's and 380's - I don't seem to
be able to tell them apart anymore..
yay!
> your turbostat util shows:
>
> CPU GHz TSC %c0 %c1 %c3 %c6 %pc3 %pc6
> avg 1.64 2.27 0.16 0.12 0.00 99.71 0.00 90.15
> Looking at the bug 15886, the Access Size 0x03 entries you mentioned are all
> 0x01 on this system. I've also uploaded the acpidump from this DL380G6 to that
> bug so that you can check I've not just looked in the wrong place.
You read it correctly, your BIOS does not request BM_STS, mjg59's does.
> Did the first acpidump come from a system with the 'HP Power Regulator'
> setting in the bios set to OS Control mode ? My system is set this way and it
> seems to work as expected.
I expect that is to enable PCC, which would change P-states,
but unlikely would have an effect on C-states.
If you can try it both ways that might be good to know.
(include powertop display once again)
Of course, the default setting is what 99% of customers use...
> The other settings for this option appear to be designed to override OS power
> management controls, for example the description of the 'Static High
> Performance' option suggests it'll somehow force the CPU to operate in the
> highest performance mode all of the time: "HP Static High Performance Mode:
> Processors will run in their maximum power/performance state at all times
> regardless of the OS power management policy".
This is BIOS writer "value add".
Unclear how it migh be an improvement over what Linux has been shipping
for years.
> If this does turn out to be as simple as a bios setting, should we really be
> trying to workaround what may be a legitimate decision by the servers admin ?
Ideally we will do exactly as the BIOS requests.
However, somtimes what they request makes lots of sense
on some version of Windows, and may make less sense
when running Linux.
Please upload the output from dmidecode to the bug report.
I am hopeful that you have a current BIOS and that
Matthew may have an pre-production BIOS.
thanks,
Len Brown, Intel Open Source Technology Center
--
Right, and on a DL360 G6 with the 07-24-2009 bios version I saw the same.
> I expect that is to enable PCC, which would change P-states,
> but unlikely would have an effect on C-states.
I found another option in the bios to limit or disable the C-states today, so plenty of opportunity to configure the system into an
odd state.
> If you can try it both ways that might be good to know.
> (include powertop display once again)
> Of course, the default setting is what 99% of customers use...
I'll upload an archive to the bugzilla entry with the details. What seems to happen is that when you set the default Balanced Power
and Performance mode the CST code vanishes completely and the processor manages to get to c6 some of the time. Enable OS Control
mode and the bad CST code appears.
> This is BIOS writer "value add".
> Unclear how it migh be an improvement over what Linux has been shipping
> for years.
Well yes, having Linux and the bios fighting for control probably isn't going to help.
> Please upload the output from dmidecode to the bug report.
> I am hopeful that you have a current BIOS and that
> Matthew may have an pre-production BIOS.
I've uploaded an archive with dmidecode, turbostat and powertop dumps. There are dumps with the bios set to the default, and to OS
Control mode.
The original bios on my DL360G6 was 07-24-2009 and has the same issue as Matthew. I upgraded the machine to the latest 2010.05.15
and repeated the tests.
Good news is that the new bios has fixed the CST code so that the Access length values are all 0x01 when they're present and the
dumps show the processor getting into c6 much more.
So you were correct, bios fix was needed.
Iain
Thanks. I don't fully understand why the check for this option
is in a different place than the register check in the earlier patch?
This needs to be also documented in Documentation/kernel-parameters.txt
Other than that it looks good.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
upstream 718be4aaf3613cf7c2d097f925abc3d3553c0605
It turns out that there is a bit in the _CST for Intel FFH C3
that tells the OS if we should be checking BM_STS or not.
Linux has been unconditionally checking BM_STS.
If the chip-set is configured to enable BM_STS,
it can retard or completely prevent entry into
deep C-states -- as illustrated by turbostat:
http://userweb.kernel.org/~lenb/acpi/utils/pmtools/turbostat/
ref: Intel Processor Vendor-Specific ACPI Interface Specification
table 4 "_CST FFH GAS Field Encoding"
Bit 1: Set to 1 if OSPM should use Bus Master avoidance for this C-state
https://bugzilla.kernel.org/show_bug.cgi?id=15886
Signed-off-by: Len Brown <len....@intel.com>
---
this backport applies cleanly to 2.6.32.y.
It applies w/ a small offset to 2.6.33.y and 2.34.y
arch/x86/kernel/acpi/cstate.c | 9 +++++++++
drivers/acpi/processor_idle.c | 2 +-
include/acpi/processor.h | 3 ++-
3 files changed, 12 insertions(+), 2 deletions(-)
Index: linux-2.6.32.y/arch/x86/kernel/acpi/cstate.c
===================================================================
--- linux-2.6.32.y.orig/arch/x86/kernel/acpi/cstate.c
+++ linux-2.6.32.y/arch/x86/kernel/acpi/cstate.c
@@ -145,6 +145,15 @@ int acpi_processor_ffh_cstate_probe(unsi
percpu_entry->states[cx->index].eax = cx->address;
percpu_entry->states[cx->index].ecx = MWAIT_ECX_INTERRUPT_BREAK;
}
+
+ /*
+ * For _CST FFH on Intel, if GAS.access_size bit 1 is cleared,
+ * then we should skip checking BM_STS for this C-state.
+ * ref: "Intel Processor Vendor-Specific ACPI Interface Specification"
+ */
+ if ((c->x86_vendor == X86_VENDOR_INTEL) && !(reg->access_size & 0x2))
+ cx->bm_sts_skip = 1;
+
return retval;
}
EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
Index: linux-2.6.32.y/drivers/acpi/processor_idle.c
===================================================================
--- linux-2.6.32.y.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.32.y/drivers/acpi/processor_idle.c
@@ -962,7 +962,7 @@ static int acpi_idle_enter_bm(struct cpu
if (acpi_idle_suspend)
return(acpi_idle_enter_c1(dev, state));
- if (acpi_idle_bm_check()) {
+ if (!cx->bm_sts_skip && acpi_idle_bm_check()) {
if (dev->safe_state) {
dev->last_state = dev->safe_state;
return dev->safe_state->enter(dev, dev->safe_state);
Index: linux-2.6.32.y/include/acpi/processor.h
===================================================================
--- linux-2.6.32.y.orig/include/acpi/processor.h
+++ linux-2.6.32.y/include/acpi/processor.h
@@ -48,7 +48,7 @@ struct acpi_power_register {
u8 space_id;
u8 bit_width;
u8 bit_offset;
- u8 reserved;
+ u8 access_size;
u64 address;
} __attribute__ ((packed));
@@ -74,6 +74,7 @@ struct acpi_processor_cx {
u32 power;
u32 usage;
u64 time;
+ u8 bm_sts_skip;
struct acpi_processor_cx_policy promotion;
struct acpi_processor_cx_policy demotion;
char desc[ACPI_CX_DESC_LEN];
Technically, it could have been.
There are a comple of constraints in the layout of this code.
The _CST flag is x86 (actually Intel) specific -- so the detection
went into arch/x86/kernel/acpi/cstate.c
However, the operation of the that flag is per C-state,
not necessarily per system -- so we remember the flag
in in a cx->bm_sts_skip flag and check it in the
'acpi generic' drivers/acpi/processor_idle.c
But we can't test a per cx flag inside acpi_idle_bm_check()
because it doesn't have access to the cx, so i put that
test at the site of its only caller.
In this 2nd patch...
we added a 'generic' ACPI bootparam that applies
to all C-states. So it overrides any per-cstate flag
and it is static to the processor_idle.c file,
so it seemed cleanest (to me)
to push it down inside acpi_idle_bm_check()
rather than in its only caller.
> This needs to be also documented in Documentation/kernel-parameters.txt
I thought about that and decided against it.
While we do document some driver specific modparams
in kernel-parameters.txt, I do not expect this one to
be used that often -- mostly for diagnosis of BIOS bugs.
I know of two machines that need it,
and both of those machines have a BIOS update
or a BIOS update in progress that make it unnecessary.
thanks for caring.
Len Brown, Intel Open Source Technology Center.
I think even obscure parameters should be documented.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
> > > This needs to be also documented in Documentation/kernel-parameters.txt
> >
> > I thought about that and decided against it.
> > While we do document some driver specific modparams
> > in kernel-parameters.txt, I do not expect this one to
> > be used that often -- mostly for diagnosis of BIOS bugs.
> > I know of two machines that need it,
> > and both of those machines have a BIOS update
> > or a BIOS update in progress that make it unnecessary.
>
> I think even obscure parameters should be documented.
Where?
kernel-parameters.txt seems to be mostly about core kernel parameters.
While there are some key driver parameters in there, they
appear to be the exception.
-Len
Can you elaborate on this? I though the only difference between
C2-type and C3-type is busmastering...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
This is better than cpu whitelist. Thanks!
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
The main difference is latency.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.