Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Ryzen public erratas

44 views
Skip to first unread message

Konstantin Belousov

unread,
Jun 13, 2018, 6:39:31 AM6/13/18
to
Today I noted that AMD published the public errata document for Ryzens,
https://developer.amd.com/wp-content/resources/55449_1.12.pdf

Some of the issues listed there looks quite relevant to the potential
hangs that some people still experience with the machines. I wrote
a script which should apply the recommended workarounds to the erratas
that I find interesting.

To run it, kldload cpuctl, then apply the latest firmware update to your
CPU, then run the following shell script. Comments indicate the errata
number for the workarounds.

Please report the results. If the script helps, I will code the kernel
change to apply the workarounds.

#!/bin/sh

# Enable workarounds for erratas listed in
# https://developer.amd.com/wp-content/resources/55449_1.12.pdf

# 1057, 1109
sysctl machdep.idle_mwait=0
sysctl machdep.idle=hlt

for x in /dev/cpuctl*; do
# 1021
cpucontrol -m '0xc0011029|=0x2000' $x
# 1033
cpucontrol -m '0xc0011020|=0x10' $x
# 1049
cpucontrol -m '0xc0011028|=0x10' $x
# 1095
cpucontrol -m '0xc0011020|=0x200000000000000' $x
done

_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Johannes Lundberg

unread,
Jun 13, 2018, 7:10:33 AM6/13/18
to
Hi

Thanks for the fix! I'm trying it now on my Ryzen 3 2200G which does
experience some random occasional resets.

About updating to latest firmware, is this something that's done from BIOS or
from FreeBSD? If the latter, how?

Eitan Adler

unread,
Jun 13, 2018, 7:21:47 AM6/13/18
to
On 13 June 2018 at 03:35, Konstantin Belousov <kost...@gmail.com> wrote:
> Today I noted that AMD published the public errata document for Ryzens,
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Some of the issues listed there looks quite relevant to the potential
> hangs that some people still experience with the machines. I wrote
> a script which should apply the recommended workarounds to the erratas
> that I find interesting.
>
> To run it, kldload cpuctl, then apply the latest firmware update to your
> CPU, then run the following shell script. Comments indicate the errata
> number for the workarounds.
>
> Please report the results. If the script helps, I will code the kernel
> change to apply the workarounds.
>
> #!/bin/sh
>
> # Enable workarounds for erratas listed in
> # https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> # 1057, 1109
> sysctl machdep.idle_mwait=0
> sysctl machdep.idle=hlt


Is this needed if it was previously machdep.idle: acpi ?


--
Eitan Adler

Konstantin Belousov

unread,
Jun 13, 2018, 7:51:43 AM6/13/18
to
From FreeBSD, install sysutils/devcpu-data then do
service microcode_update start
and of course, you must flash latest BIOS.

The microcode_update must be applied before running this script.

Gary Jennejohn

unread,
Jun 13, 2018, 10:57:05 AM6/13/18
to
I added before and after outputs to my version of the script and
saw that my BIOS is setting all the relevant bits at start up.

So, a BIOS update might help.

--
Gary Jennejohn

Mike Tancsa

unread,
Jun 13, 2018, 4:45:09 PM6/13/18
to
On 6/13/2018 6:35 AM, Konstantin Belousov wrote:
> Today I noted that AMD published the public errata document for Ryzens,
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Some of the issues listed there looks quite relevant to the potential
> hangs that some people still experience with the machines. I wrote
> a script which should apply the recommended workarounds to the erratas
> that I find interesting.
>
> To run it, kldload cpuctl, then apply the latest firmware update to your
> CPU, then run the following shell script. Comments indicate the errata
> number for the workarounds.

Hi,

tl;dr: The Microcode changes seem to fix a hard lockup I was able to
reliable reproduce back in Feb.



The BIOS on my AMD is pretty up to date. I think it has the same
microcode as whats in the ports. x86info -a shows

root@ryzenbsd11:/home/mdtancsa # x86info -a | grep -i microc
Microcode patch level: 0x8001137
root@ryzenbsd11:/home/mdtancsa #

after running the microcode update and


root@ryzenbsd11:/home/mdtancsa # /usr/local/etc/rc.d/microcode_update
onestart
Updating CPU Microcode...
Done.
root@ryzenbsd11:/home/mdtancsa # x86info -a | grep -i microc
Microcode patch level: 0x8001137
root@ryzenbsd11:/home/mdtancsa #

However, the dmesg after the microcode update adds this line

AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr>




CPU: AMD Ryzen 5 1600X Six-Core Processor (3593.36-MHz
K8-class CPU)
Origin="AuthenticAMD" Id=0x800f11 Family=0x17 Model=0x1 Stepping=1

Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>

Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD
Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
Structured Extended
Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
TSC: P-state invariant, performance statistics

I ran the script

root@ryzenbsd11:/home/mdtancsa # cat fix.sh
#!/bin/sh

# Enable workarounds for erratas listed in
# https://developer.amd.com/wp-content/resources/55449_1.12.pdf

# 1057, 1109
sysctl machdep.idle_mwait=0
sysctl machdep.idle=hlt

for x in /dev/cpuctl*; do
# 1021
cpucontrol -m '0xc0011029|=0x2000' $x
# 1033
cpucontrol -m '0xc0011020|=0x10' $x
# 1049
cpucontrol -m '0xc0011028|=0x10' $x
# 1095
cpucontrol -m '0xc0011020|=0x200000000000000' $x
echo $x
done
root@ryzenbsd11:/home/mdtancsa # sh ./fix.sh
machdep.idle_mwait: 1 -> 0
machdep.idle: acpi -> hlt
/dev/cpuctl0
/dev/cpuctl1
/dev/cpuctl10
/dev/cpuctl11
/dev/cpuctl2
/dev/cpuctl3
/dev/cpuctl4
/dev/cpuctl5
/dev/cpuctl6
/dev/cpuctl7
/dev/cpuctl8
/dev/cpuctl9
root@ryzenbsd11:/home/mdtancsa #

Using a FreeBSD stable from back in Feb, I was able to crash Ryzen and
Epyc based systems
(https://lists.freebsd.org/pipermail/freebsd-stable/2018-February/088439.html)
by generating a lot of traffic between the hypervisor and guests. The
same tests on an intel based box ran just fine.

e.g. start 3 guests in bhyve (amd64) and run combos of iperf3 between
them. It would not take too long, but the box would hard lock-- i.e.
blank screen, no crash dump etc.

With the latest micro code update, I have been running the same sort of
tests and so far so good. I will let them run overnight to see if things
are now stable on STABLE.

---Mike






>
> Please report the results. If the script helps, I will code the kernel
> change to apply the workarounds.
>
> #!/bin/sh
>
> # Enable workarounds for erratas listed in
> # https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> # 1057, 1109
> sysctl machdep.idle_mwait=0
> sysctl machdep.idle=hlt
>
> for x in /dev/cpuctl*; do
> # 1021
> cpucontrol -m '0xc0011029|=0x2000' $x
> # 1033
> cpucontrol -m '0xc0011020|=0x10' $x
> # 1049
> cpucontrol -m '0xc0011028|=0x10' $x
> # 1095
> cpucontrol -m '0xc0011020|=0x200000000000000' $x
> done
>
> _______________________________________________
> freebsd...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"
>
>


--
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, mi...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada

Eric van Gyzen

unread,
Jun 14, 2018, 9:40:00 AM6/14/18
to
On 06/13/2018 05:35, Konstantin Belousov wrote:
> Today I noted that AMD published the public errata document for Ryzens,
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Some of the issues listed there looks quite relevant to the potential
> hangs that some people still experience with the machines. I wrote
> a script which should apply the recommended workarounds to the erratas
> that I find interesting.
>
> To run it, kldload cpuctl, then apply the latest firmware update to your
> CPU, then run the following shell script. Comments indicate the errata
> number for the workarounds.
>
> Please report the results. If the script helps, I will code the kernel
> change to apply the workarounds.
Kostik: This thread on the -stable list has a lot of positive feedback:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-June/089110.html

Eric

Mike Tancsa

unread,
Jun 14, 2018, 10:28:23 AM6/14/18
to
On 6/14/2018 9:36 AM, Eric van Gyzen wrote:
> On 06/13/2018 05:35, Konstantin Belousov wrote:
>> Today I noted that AMD published the public errata document for Ryzens,
>> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>>
>> Some of the issues listed there looks quite relevant to the potential
>> hangs that some people still experience with the machines. I wrote
>> a script which should apply the recommended workarounds to the erratas
>> that I find interesting.
>>
>> To run it, kldload cpuctl, then apply the latest firmware update to your
>> CPU, then run the following shell script. Comments indicate the errata
>> number for the workarounds.
>>
>> Please report the results. If the script helps, I will code the kernel
>> change to apply the workarounds.
> Kostik: This thread on the -stable list has a lot of positive feedback:
>
> https://lists.freebsd.org/pipermail/freebsd-stable/2018-June/089110.html

I have a couple of Epyc boxes that showed the same lockup behaviour. I
will re-install FreeBSD on them and see if their microcode updates fix
this issue as well...

Should I run the same cpuctl commands on those CPUs ? BTW, I am happy
to loan one out to you in the FreeBSD netperf cluster for a few weeks

---Mike



--
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, mi...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada

Konstantin Belousov

unread,
Jun 14, 2018, 11:08:34 AM6/14/18
to
On Thu, Jun 14, 2018 at 10:24:17AM -0400, Mike Tancsa wrote:
> On 6/14/2018 9:36 AM, Eric van Gyzen wrote:
> > On 06/13/2018 05:35, Konstantin Belousov wrote:
> >> Today I noted that AMD published the public errata document for Ryzens,
> >> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> >>
> >> Some of the issues listed there looks quite relevant to the potential
> >> hangs that some people still experience with the machines. I wrote
> >> a script which should apply the recommended workarounds to the erratas
> >> that I find interesting.
> >>
> >> To run it, kldload cpuctl, then apply the latest firmware update to your
> >> CPU, then run the following shell script. Comments indicate the errata
> >> number for the workarounds.
> >>
> >> Please report the results. If the script helps, I will code the kernel
> >> change to apply the workarounds.
> > Kostik: This thread on the -stable list has a lot of positive feedback:
> >
> > https://lists.freebsd.org/pipermail/freebsd-stable/2018-June/089110.html
>
> I have a couple of Epyc boxes that showed the same lockup behaviour. I
> will re-install FreeBSD on them and see if their microcode updates fix
> this issue as well...
I am not sure about only microcode update. Depending on the BIOS
vendor and current BIOS, you may need all three: BIOS update, microcode
update using cpucontrol/devcpu-data, and running the script I posted.
In the best case, some of this is just redundand.

Mike Tancsa

unread,
Jun 14, 2018, 11:15:55 AM6/14/18
to
On 6/14/2018 11:03 AM, Konstantin Belousov wrote:
> I am not sure about only microcode update. Depending on the BIOS
> vendor and current BIOS, you may need all three: BIOS update, microcode
> update using cpucontrol/devcpu-data, and running the script I posted.
> In the best case, some of this is just redundand.

Thanks, I will run the tests on the Epyc system over the next few days.
It took a little longer to crash the Epyc than the Ryzen. The Ryzen is
still going now for 20hrs. Previously 5-10 min were enough to trigger
the hard lockup.

Mike Tancsa

unread,
Jun 14, 2018, 5:16:52 PM6/14/18
to
On 6/14/2018 11:03 AM, Konstantin Belousov wrote:
> I am not sure about only microcode update. Depending on the BIOS
> vendor and current BIOS, you may need all three: BIOS update, microcode
> update using cpucontrol/devcpu-data, and running the script I posted.
> In the best case, some of this is just redundand.

OK, before and after shows the same microcode rev

CPU: AMD EPYC 7281 16-Core Processor (2100.06-MHz
K8-class CPU)
Origin="AuthenticAMD" Id=0x800f12 Family=0x17 Model=0x1 Stepping=2

Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>

Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD
Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
Structured Extended
Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr>
SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
TSC: P-state invariant, performance statistics

# x86info -a | grep -i micro
Microcode patch level: 0x8001227
#

I then ran the fix script. I will let the box grind away over the
weekend to see if it survives. Previously, a couple of hours would lock
it up. I am running it now. One thing I did notice is a bunch of these
showing up

Jun 14 17:11:18 r11epyc kernel: fpudna: fpcurthread == curthread

Oliver Pinter

unread,
Jun 14, 2018, 5:20:24 PM6/14/18
to
This is a side effect of enabled eager FPU switch, it's orthogonal and
already fixed - the printf has been removed - in current.

Mike Tancsa

unread,
Jun 18, 2018, 10:13:56 AM6/18/18
to
On 6/13/2018 6:35 AM, Konstantin Belousov wrote:
>
> Please report the results. If the script helps, I will code the kernel
> change to apply the workarounds.

The hard lockups I was seeing on Ryzen and Epyc boxes are now gone with
the microcode and script below.

Not sure if its one or some combo of the settings, but all the steps
below have made my 2 test systems stable on RELENG_11 anyways.

This was on a Ryzen 5 1600X (ASUS PRIME X370-PRO BIOS from 04/19/2018)
CPU Microcode patch level: 0x8001137

And
EPYC 7281 16-Core (Supermicro H11SSL-i BIOS 04/27/2018 )
Microcode patch level: 0x8001227



Details of the issue were discussed at

https://lists.freebsd.org/pipermail/freebsd-virtualization/2018-March/006187.html
and
https://lists.freebsd.org/pipermail/freebsd-stable/2018-January/088174.html

TL;DR : Generating traffic via iperf3 between VMs either on bhyve or
VirtualBox would make the box lockup-- no crash, just a blank screen

---Mike


>
> #!/bin/sh
>
> # Enable workarounds for erratas listed in
> # https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> # 1057, 1109
> sysctl machdep.idle_mwait=0
> sysctl machdep.idle=hlt
>
> for x in /dev/cpuctl*; do
> # 1021
> cpucontrol -m '0xc0011029|=0x2000' $x
> # 1033
> cpucontrol -m '0xc0011020|=0x10' $x
> # 1049
> cpucontrol -m '0xc0011028|=0x10' $x
> # 1095
> cpucontrol -m '0xc0011020|=0x200000000000000' $x
> done
>
> _______________________________________________
> freebsd...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"
>
>


--
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, mi...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada

Eitan Adler

unread,
Jun 19, 2018, 1:48:35 AM6/19/18
to
On 13 June 2018 at 04:16, Eitan Adler <li...@eitanadler.com> wrote:
> On 13 June 2018 at 03:35, Konstantin Belousov <kost...@gmail.com> wrote:
>> Today I noted that AMD published the public errata document for Ryzens,
>> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>>
>> Some of the issues listed there looks quite relevant to the potential
>> hangs that some people still experience with the machines. I wrote
>> a script which should apply the recommended workarounds to the erratas
>> that I find interesting.
>>
>> To run it, kldload cpuctl, then apply the latest firmware update to your
>> CPU, then run the following shell script. Comments indicate the errata
>> number for the workarounds.
>>
>> Please report the results. If the script helps, I will code the kernel
>> change to apply the workarounds.
>>
>> #!/bin/sh
>>
>> # Enable workarounds for erratas listed in
>> # https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>>
>> # 1057, 1109
>> sysctl machdep.idle_mwait=0
>> sysctl machdep.idle=hlt
>
>
> Is this needed if it was previously machdep.idle: acpi ?

This might explain why I've never seen the lockup issues mentioned by
other people. What would cause my machine to differ from others?

Gary Jennejohn

unread,
Jun 19, 2018, 5:55:03 AM6/19/18
to
On Mon, 18 Jun 2018 22:44:13 -0700
Eitan Adler <li...@eitanadler.com> wrote:

> On 13 June 2018 at 04:16, Eitan Adler <li...@eitanadler.com> wrote:
> > On 13 June 2018 at 03:35, Konstantin Belousov <kost...@gmail.com> wrote:
> >> Today I noted that AMD published the public errata document for Ryzens,
> >> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> >>
> >> Some of the issues listed there looks quite relevant to the potential
> >> hangs that some people still experience with the machines. I wrote
> >> a script which should apply the recommended workarounds to the erratas
> >> that I find interesting.
> >>
> >> To run it, kldload cpuctl, then apply the latest firmware update to your
> >> CPU, then run the following shell script. Comments indicate the errata
> >> number for the workarounds.
> >>
> >> Please report the results. If the script helps, I will code the kernel
> >> change to apply the workarounds.
> >>
> >> #!/bin/sh
> >>
> >> # Enable workarounds for erratas listed in
> >> # https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> >>
> >> # 1057, 1109
> >> sysctl machdep.idle_mwait=0
> >> sysctl machdep.idle=hlt
> >
> >
> > Is this needed if it was previously machdep.idle: acpi ?
>
> This might explain why I've never seen the lockup issues mentioned by
> other people. What would cause my machine to differ from others?
>

I had sysctl machdep.idle_mwait=1 and machdep.idle=acpi before
applying the shell script. I had multiple lockups every week,
sometimes multiple lockups per day.

With the idle settings from the script it still locks up, but
not as often.

I suspect I also need to update the CPU firmware, although I
expect that the new BIOS version I installed last week would
have done that already.

--
Gary Jennejohn
0 new messages