[PATCH v11 0/6] Add RAS virtualization support for armv8 SEA and SEI

Dongjiu Geng

unread,

Aug 18, 2017, 10:10:33 AM8/18/17

to

In the armv8 platform, the mainly processor hardware error notification
type are synchronous external abort(SEA) and SError Interrupt (SEI), For
the ARMv8 SEA/SEI, KVM or host kernel will deliver SIGBUS or use other
interface to notify user space. After user space gets the notification,
it will record the CPER to simulate GHES for guest OS and inject the a
exception(SEA/SEI) to KVM.

This series patch has two parts, one part handles synchronous external
abort(SEA) exception and SError Interrupt (SEI) exception; another part
is generating APEI table when guest OS boot up, and dynamically record
CPER for the guest OS about the generic hardware errors. Currently the
userspace only handles the memory section hardware errors. Before Qemu
record the CPER, it needs to check the ACK value written by the guest
OS to avoid read-write race condition. In the simulated APEI/GHESV2/CPER
table, the max number of error soure is 11, which is classified by
notification type, now only enable the SEA/SEI notification type error
source to avoid OS boot warning.

About the whole solution we ever discuessed it in here before:
https://patchwork.kernel.org/patch/9633105/

Below is the APEI/GHESV2/CPER table layout, the max number of error soure is 11:

etc/acpi/tables etc/hardware_errors
==================== ==========================================
+ +--------------------------+ +------------------+
| | HEST | | address | +--------------+
| +--------------------------+ | registers | | Error Status |
| | GHES0 | | +----------------+ | Data Block 0 |
| +--------------------------+ +--------->| |status_address0 |------------->| +------------+
| | ................. | | | +----------------+ | | CPER |
| | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER |
| | ................. | | | +----------------+ | | | .... |
| | read_ack_register--------+-+ | | ............. | | | | CPER |
| | read_ack_preserve | | | +------------------+ | | +------------+
| | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status |
+ +--------------------------+ | | | | +----------------+ | | | Data Block 1 |
| | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+
+ +--------------------------+ | | | +----------------+ | | | CPER |
| | ................. | | | +--->| | ack_value1 | | | | CPER |
| | error_status_address-----+---+ | | | +----------------+ | | | .... |
| | ................. | | | | | ............. | | | | CPER |
| | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+
| | read_ack_preserve | | +->| | ack_value10 | | | |.......... |
| | read_ack_write | | | | +----------------+ | | +------------+
+ +--------------------------| | | | | Error Status |
| | ............... | | | | | Data Block 10|
+ +--------------------------+ | | +---->| +------------+
| | GHES10 | | | | | CPER |
+ +--------------------------+ | | | | CPER |
| | ................. | | | | | .... |
| | error_status_address-----+-----+ | | | CPER |
| | ................. | | +-+------------+
| | read_ack_register--------+---------+
| | read_ack_preserve |
| | read_ack_write |
+ +--------------------------+

----------------------------------------------------------------------------------------------
How to test guest OS do SEA/SEI recovery:

1. In the guest OS, trigger a SEA or SEI.
2. Then you will see below error log that printed by the memory failure
3. Memory failure will do the recovery for the error.

Such as the below shown kernel log:
[ 21.101216] Synchronous External Abort: synchronous external abort (0x96000010) at 0xffffff8008064018
[ 21.104969] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 8
[ 21.106918] {1}[Hardware Error]: event severity: recoverable
[ 21.109027] {1}[Hardware Error]: Error 0, type: recoverable
[ 21.110362] {1}[Hardware Error]: section_type: memory error
[ 21.111705] {1}[Hardware Error]: physical_address: 0x000000007a200000
[ 21.113255] {1}[Hardware Error]: error_type: 3, multi-bit ECC
[ 21.118528] Internal error: : 96000010 [#1] SMP
[ 21.119587] Modules linked in:
[ 21.120307] CPU: 0 PID: 509 Comm: devmem Not tainted 4.12.0-rc4ajb-00990-g954379b-dirty #67
[ 21.122307] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 21.123915] task: ffffffc03da32900 task.stack: ffffffc03dbbc000
[ 21.125302] PC is at __do_user_fault+0x58/0x110
[ 21.126370] LR is at __do_user_fault+0x54/0x110
[ 21.127433] pc : [<ffffff8008097528>] lr : [<ffffff8008097524>] pstate: 80000145
[ 21.129164] sp : ffffffc03dbbfd20
[ 21.129940] x29: ffffffc03dbbfd20 x28: ffffffc03da32900
[ 21.131204] x27: 0000000000000000 x26: 0000007f7edc5001
[ 21.132439] x25: ffffff8008648438 x24: ffffffc03dbbfec0
[ 21.133689] x23: 0000000000030001 x22: 0000007f7edc5001
[ 21.134934] x21: 0000000000000007 x20: 0000000092000021
[ 21.136195] x19: ffffffc03da32900 x18: 0000007fdd4c18f0
[ 21.137439] x17: 0000007f7ecb9ebc x16: 0000000000412058

------------------------------------------------------------------------------------------------
how to test guest OS APTI/GHES:
1. In the guest OS, use this command to dump the APEI table:
"iasl -p ./HEST -d /sys/firmware/acpi/tables/HEST"
2. And find the address for the generic error status block
according to the notification type
3. then find the CPER record through the generic error status block.

For example(notification type is SEA):

(1) root@genericarmv8:~# iasl -p ./HEST -d /sys/firmware/acpi/tables/HEST
(2) root@genericarmv8:~# cat HEST.dsl
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20170728 (64-bit version)
* Copyright (c) 2000 - 2017 Intel Corporation
*
* Disassembly of /sys/firmware/acpi/tables/HEST, Mon Sep 5 07:59:17 2016
*
* ACPI Data Table [HEST]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/

..................................................................................
[308h 0776 2] Subtable Type : 000A [Generic Hardware Error Source V2]
[30Ah 0778 2] Source Id : 0008
[30Ch 0780 2] Related Source Id : FFFF
[30Eh 0782 1] Reserved : 00
[30Fh 0783 1] Enabled : 01
[310h 0784 4] Records To Preallocate : 00000001
[314h 0788 4] Max Sections Per Record : 00000001
[318h 0792 4] Max Raw Data Length : 00001000

[31Ch 0796 12] Error Status Address : [Generic Address Structure]
[31Ch 0796 1] Space ID : 00 [SystemMemory]
[31Dh 0797 1] Bit Width : 40
[31Eh 0798 1] Bit Offset : 00
[31Fh 0799 1] Encoded Access Width : 04 [QWord Access:64]
[320h 0800 8] Address : 00000000785D0040

[328h 0808 28] Notify : [Hardware Error Notification Structure]
[328h 0808 1] Notify Type : 08 [SEA]
[329h 0809 1] Notify Length : 1C
[32Ah 0810 2] Configuration Write Enable : 0000
[32Ch 0812 4] PollInterval : 00000000
[330h 0816 4] Vector : 00000000
[334h 0820 4] Polling Threshold Value : 00000000
[338h 0824 4] Polling Threshold Window : 00000000
[33Ch 0828 4] Error Threshold Value : 00000000
[340h 0832 4] Error Threshold Window : 00000000

[344h 0836 4] Error Status Block Length : 00001000
[348h 0840 12] Read Ack Register : [Generic Address Structure]
[348h 0840 1] Space ID : 00 [SystemMemory]
[349h 0841 1] Bit Width : 40
[34Ah 0842 1] Bit Offset : 00
[34Bh 0843 1] Encoded Access Width : 04 [QWord Access:64]
[34Ch 0844 8] Address : 00000000785D0098

[354h 0852 8] Read Ack Preserve : 00000000FFFFFFFE
[35Ch 0860 8] Read Ack Write : 0000000000000001

[364h 0868 2] Subtable Type : 000A [Generic Hardware Error Source V2]
[366h 0870 2] Source Id : 0009
[368h 0872 2] Related Source Id : FFFF
[36Ah 0874 1] Reserved : 00
[36Bh 0875 1] Enabled : 01
[36Ch 0876 4] Records To Preallocate : 00000001
[370h 0880 4] Max Sections Per Record : 00000001
[374h 0884 4] Max Raw Data Length : 00001000

[378h 0888 12] Error Status Address : [Generic Address Structure]
[378h 0888 1] Space ID : 00 [SystemMemory]
[379h 0889 1] Bit Width : 40
[37Ah 0890 1] Bit Offset : 00
[37Bh 0891 1] Encoded Access Width : 04 [QWord Access:64]
[37Ch 0892 8] Address : 00000000785D0048

[384h 0900 28] Notify : [Hardware Error Notification Structure]
[384h 0900 1] Notify Type : 09 [SEI]
[385h 0901 1] Notify Length : 1C
[386h 0902 2] Configuration Write Enable : 0000
[388h 0904 4] PollInterval : 00000000
[38Ch 0908 4] Vector : 00000000
[390h 0912 4] Polling Threshold Value : 00000000
[394h 0916 4] Polling Threshold Window : 00000000
[398h 0920 4] Error Threshold Value : 00000000
[39Ch 0924 4] Error Threshold Window : 00000000

[3A0h 0928 4] Error Status Block Length : 00001000
[3A4h 0932 12] Read Ack Register : [Generic Address Structure]
[3A4h 0932 1] Space ID : 00 [SystemMemory]
[3A5h 0933 1] Bit Width : 40
[3A6h 0934 1] Bit Offset : 00
[3A7h 0935 1] Encoded Access Width : 04 [QWord Access:64]
[3A8h 0936 8] Address : 00000000785D00A0

[3B0h 0944 8] Read Ack Preserve : 00000000FFFFFFFE
[3B8h 0952 8] Read Ack Write : 0000000000000001
.....................................................................................
(3) according to above table, the address that contains the physical address of a block
of memory that holds the error status data for SEA notification error source is 0x00000000785D0040
(4) the address for SEA notification error source is 0x785d8108
(qemu) xp /1 0x00000000785D0040
00000000785d0040: 0x785d80b0

(5) check the content of generic error status block and generic error data entry
(qemu) xp /100x 0x785d80b0
00000000785d80b0: 0x00000001 0x00000000 0x00000000 0x00000098
00000000785d80c0: 0x00000000 0xa5bc1114 0x4ede6f64 0x833e63b8
00000000785d80d0: 0xb1837ced 0x00000000 0x00000300 0x00000050
00000000785d80e0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d80f0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8100: 0x00000000 0x00000000 0x00000000 0x00004002
00000000785d8110: 0x00000000 0x00000000 0x00000000 0x00001111
00000000785d8120: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8130: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8140: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8150: 0x00000000 0x00000003 0x00000000 0x00000000
00000000785d8160: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8170: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8180: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8190: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d81a0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d81b0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d81c0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d81d0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d81e0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d81f0: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8200: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8210: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8220: 0x00000000 0x00000000 0x00000000 0x00000000
00000000785d8230: 0x00000000 0x00000000 0x00000000 0x00000000
(6) check the OSPM's ACK value(for example SEA)
/* Before OSPM acknowledges the error, check the ACK value */
(qemu) xp /1 0x00000000785D0098
00000000785d00f0: 0x00000000

/* After OSPM acknowledges the error, check the ACK value */
(qemu) xp /1 0x00000000785D0098
00000000785d00f0: 0x00000001

Dongjiu Geng (6):
ACPI: add APEI/HEST/CPER structures and macros
ACPI: Add APEI GHES Table Generation support
ACPI: build and enable APEI GHES in the Makefile and configuration
target-arm: kvm64: detect guest RAS EXTENSION feature
target-arm: kvm64: handle SIGBUS signal for synchronous External Abort
target-arm: kvm64: Handle SError interrupt from the guest OS

default-configs/arm-softmmu.mak | 1 +
hw/acpi/Makefile.objs | 1 +
hw/acpi/aml-build.c | 2 +
hw/acpi/hest_ghes.c | 345 ++++++++++++++++++++++++++++++++++++++++
hw/arm/virt-acpi-build.c | 6 +
include/hw/acpi/acpi-defs.h | 193 ++++++++++++++++++++++
include/hw/acpi/aml-build.h | 1 +
include/hw/acpi/hest_ghes.h | 47 ++++++
include/sysemu/kvm.h | 2 +-
linux-headers/asm-arm64/kvm.h | 5 +
linux-headers/linux/kvm.h | 2 +
target/arm/cpu.h | 3 +
target/arm/internals.h | 14 ++
target/arm/kvm.c | 34 ++++
target/arm/kvm64.c | 186 ++++++++++++++++++++++
target/arm/kvm_arm.h | 1 +
16 files changed, 842 insertions(+), 1 deletion(-)
create mode 100644 hw/acpi/hest_ghes.c
create mode 100644 include/hw/acpi/hest_ghes.h

--
1.8.3.1

Dongjiu Geng

unread,

Aug 18, 2017, 10:10:35 AM8/18/17

to

(1) Add related APEI/HEST table structures and macros, these
definition refer to ACPI 6.1 and UEFI 2.6 spec.
(2) Add generic error status block and CPER memory section
definition, user space only handle memory section errors.

Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
---
include/hw/acpi/acpi-defs.h | 193 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 193 insertions(+)

diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h
index 72be675..3b4bad7 100644
--- a/include/hw/acpi/acpi-defs.h
+++ b/include/hw/acpi/acpi-defs.h
@@ -297,6 +297,44 @@ typedef struct AcpiMultipleApicTable AcpiMultipleApicTable;
#define ACPI_APIC_GENERIC_TRANSLATOR 15
#define ACPI_APIC_RESERVED 16 /* 16 and greater are reserved */

+/* UEFI Spec 2.6, "N.2.5 Memory Error Section */
+#define UEFI_CPER_MEM_VALID_ERROR_STATUS 0x0001
+#define UEFI_CPER_MEM_VALID_PA 0x0002
+#define UEFI_CPER_MEM_VALID_PA_MASK 0x0004
+#define UEFI_CPER_MEM_VALID_NODE 0x0008
+#define UEFI_CPER_MEM_VALID_CARD 0x0010
+#define UEFI_CPER_MEM_VALID_MODULE 0x0020
+#define UEFI_CPER_MEM_VALID_BANK 0x0040
+#define UEFI_CPER_MEM_VALID_DEVICE 0x0080
+#define UEFI_CPER_MEM_VALID_ROW 0x0100
+#define UEFI_CPER_MEM_VALID_COLUMN 0x0200
+#define UEFI_CPER_MEM_VALID_BIT_POSITION 0x0400
+#define UEFI_CPER_MEM_VALID_REQUESTOR 0x0800
+#define UEFI_CPER_MEM_VALID_RESPONDER 0x1000
+#define UEFI_CPER_MEM_VALID_TARGET 0x2000
+#define UEFI_CPER_MEM_VALID_ERROR_TYPE 0x4000
+#define UEFI_CPER_MEM_VALID_RANK_NUMBER 0x8000
+#define UEFI_CPER_MEM_VALID_CARD_HANDLE 0x10000
+#define UEFI_CPER_MEM_VALID_MODULE_HANDLE 0x20000
+#define UEFI_CPER_MEM_ERROR_TYPE_MULTI_ECC 3
+
+/* From the ACPI 6.1 spec, "18.3.2.9 Hardware Error Notification" */
+
+enum AcpiHestNotifyType {
+ ACPI_HEST_NOTIFY_POLLED = 0,
+ ACPI_HEST_NOTIFY_EXTERNAL = 1,
+ ACPI_HEST_NOTIFY_LOCAL = 2,
+ ACPI_HEST_NOTIFY_SCI = 3,
+ ACPI_HEST_NOTIFY_NMI = 4,
+ ACPI_HEST_NOTIFY_CMCI = 5, /* ACPI 5.0 */
+ ACPI_HEST_NOTIFY_MCE = 6, /* ACPI 5.0 */
+ ACPI_HEST_NOTIFY_GPIO = 7, /* ACPI 6.0 */
+ ACPI_HEST_NOTIFY_SEA = 8, /* ACPI 6.1 */
+ ACPI_HEST_NOTIFY_SEI = 9, /* ACPI 6.1 */
+ ACPI_HEST_NOTIFY_GSIV = 10, /* ACPI 6.1 */
+ ACPI_HEST_NOTIFY_RESERVED = 11 /* 11 and greater are reserved */
+};
+
/*
* MADT sub-structures (Follow MULTIPLE_APIC_DESCRIPTION_TABLE)
*/
@@ -474,6 +512,161 @@ struct AcpiSystemResourceAffinityTable {
} QEMU_PACKED;
typedef struct AcpiSystemResourceAffinityTable AcpiSystemResourceAffinityTable;

+/* Hardware Error Notification, from the ACPI 6.1
+ * spec, "18.3.2.9 Hardware Error Notification"
+ */
+struct AcpiHestNotify {
+ uint8_t type;
+ uint8_t length;
+ uint16_t config_write_enable;
+ uint32_t poll_interval;
+ uint32_t vector;
+ uint32_t polling_threshold_value;
+ uint32_t polling_threshold_window;
+ uint32_t error_threshold_value;
+ uint32_t error_threshold_window;
+} QEMU_PACKED;
+typedef struct AcpiHestNotify AcpiHestNotify;
+
+/* From ACPI 6.1, sections "18.3.2.1 IA-32 Architecture Machine
+ * Check Exception" through "18.3.2.8 Generic Hardware Error Source version 2".
+ */
+enum AcpiHestSourceType {
+ ACPI_HEST_SOURCE_IA32_CHECK = 0,
+ ACPI_HEST_SOURCE_IA32_CORRECTED_CHECK = 1,
+ ACPI_HEST_SOURCE_IA32_NMI = 2,
+ ACPI_HEST_SOURCE_AER_ROOT_PORT = 6,
+ ACPI_HEST_SOURCE_AER_ENDPOINT = 7,
+ ACPI_HEST_SOURCE_AER_BRIDGE = 8,
+ ACPI_HEST_SOURCE_GENERIC_ERROR = 9,
+ ACPI_HEST_SOURCE_GENERIC_ERROR_V2 = 10,
+ ACPI_HEST_SOURCE_RESERVED = 11 /* 11 and greater are reserved */
+};
+
+/* Block status bitmasks from ACPI 6.1, "18.3.2.7.1 Generic Error Data" */
+#define ACPI_GEBS_UNCORRECTABLE (1)
+#define ACPI_GEBS_CORRECTABLE (1 << 1)
+#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE (1 << 2)
+#define ACPI_GEBS_MULTIPLE_CORRECTABLE (1 << 3)
+/* 10 bits, error data entry count */
+#define ACPI_GEBS_ERROR_ENTRY_COUNT (0x3FF << 4)
+
+/* Generic Hardware Error Source Structure, refer to ACPI 6.1
+ * "18.3.2.7 Generic Hardware Error Source". in this struct the
+ * "type" field has to be ACPI_HEST_SOURCE_GENERIC_ERROR
+ */
+
+struct AcpiGenericHardwareErrorSource {
+ uint16_t type;
+ uint16_t source_id;
+ uint16_t related_source_id;
+ uint8_t flags;
+ uint8_t enabled;
+ uint32_t number_of_records;
+ uint32_t max_sections_per_record;
+ uint32_t max_raw_data_length;
+ struct AcpiGenericAddress error_status_address;
+ struct AcpiHestNotify notify;
+ uint32_t error_status_block_length;
+} QEMU_PACKED;
+typedef struct AcpiGenericHardwareErrorSource AcpiGenericHardwareErrorSource;
+
+/* Generic Hardware Error Source, version 2, ACPI 6.1, "18.3.2.8 Generic
+ * Hardware Error Source version 2", in this struct the "type" field has to
+ * be ACPI_HEST_SOURCE_GENERIC_ERROR_V2
+ */
+struct AcpiGenericHardwareErrorSourceV2 {
+ uint16_t type;
+ uint16_t source_id;
+ uint16_t related_source_id;
+ uint8_t flags;
+ uint8_t enabled;
+ uint32_t number_of_records;
+ uint32_t max_sections_per_record;
+ uint32_t max_raw_data_length;
+ struct AcpiGenericAddress error_status_address;
+ struct AcpiHestNotify notify;
+ uint32_t error_status_block_length;
+ struct AcpiGenericAddress read_ack_register;
+ uint64_t read_ack_preserve;
+ uint64_t read_ack_write;
+} QEMU_PACKED;
+typedef struct AcpiGenericHardwareErrorSourceV2
+ AcpiGenericHardwareErrorSourceV2;
+
+/* Generic Error Status block, from ACPI 6.1,
+ * "18.3.2.7.1 Generic Error Data"
+ */
+struct AcpiGenericErrorStatus {
+ /* It is a bitmask composed of ACPI_GEBS_xxx macros */
+ uint32_t block_status;
+ uint32_t raw_data_offset;
+ uint32_t raw_data_length;
+ uint32_t data_length;
+ uint32_t error_severity;
+} QEMU_PACKED;
+typedef struct AcpiGenericErrorStatus AcpiGenericErrorStatus;
+
+enum AcpiGenericErrorSeverity {
+ ACPI_CPER_SEV_RECOVERABLE,
+ ACPI_CPER_SEV_FATAL,
+ ACPI_CPER_SEV_CORRECTED,
+ ACPI_CPER_SEV_NONE,
+};
+
+/* Generic Error Data entry, revision number is 0x0300,
+ * ACPI 6.1, "18.3.2.7.1 Generic Error Data"
+ */
+struct AcpiGenericErrorData {
+ uint8_t section_type_le[16];
+ /* The "error_severity" fields that they take their
+ * values from AcpiGenericErrorSeverity
+ */
+ uint32_t error_severity;
+ uint16_t revision;
+ uint8_t validation_bits;
+ uint8_t flags;
+ uint32_t error_data_length;
+ uint8_t fru_id[16];
+ uint8_t fru_text[20];
+ uint64_t time_stamp;
+} QEMU_PACKED;
+typedef struct AcpiGenericErrorData AcpiGenericErrorData;
+
+/* From UEFI 2.6, "N.2.5 Memory Error Section" */
+struct UefiCperSecMemErr {
+ uint64_t validation_bits;
+ uint64_t error_status;
+ uint64_t physical_addr;
+ uint64_t physical_addr_mask;
+ uint16_t node;
+ uint16_t card;
+ uint16_t module;
+ uint16_t bank;
+ uint16_t device;
+ uint16_t row;
+ uint16_t column;
+ uint16_t bit_pos;
+ uint64_t requestor_id;
+ uint64_t responder_id;
+ uint64_t target_id;
+ uint8_t error_type;
+ uint8_t reserved;
+ uint16_t rank;
+ uint16_t mem_array_handle; /* card handle in UEFI 2.4 */
+ uint16_t mem_dev_handle; /* module handle in UEFI 2.4 */
+} QEMU_PACKED;
+typedef struct UefiCperSecMemErr UefiCperSecMemErr;
+
+/*
+ * HEST Description Table
+ */
+struct AcpiHardwareErrorSourceTable {
+ ACPI_TABLE_HEADER_DEF /* ACPI common table header */
+ uint32_t error_source_count;
+} QEMU_PACKED;
+typedef struct AcpiHardwareErrorSourceTable AcpiHardwareErrorSourceTable;
+
#define ACPI_SRAT_PROCESSOR_APIC 0
#define ACPI_SRAT_MEMORY 1
#define ACPI_SRAT_PROCESSOR_x2APIC 2
--
1.8.3.1

Dongjiu Geng

unread,

Aug 18, 2017, 10:10:39 AM8/18/17

to

This implements APEI GHES Table by passing the error CPER info
to the guest via a fw_cfg_blob. After a CPER info is recorded, an
SEA(Synchronous External Abort)/SEI(SError Interrupt) exception
will be injected into the guest OS.

Below is the table layout, the max number of error soure is 11,
which is classified by notification type.

etc/acpi/tables etc/hardware_errors
==================== ==========================================
+ +--------------------------+ +------------------+
| | HEST | | address | +--------------+
| +--------------------------+ | registers | | Error Status |
| | GHES0 | | +----------------+ | Data Block 0 |
| +--------------------------+ +--------->| |status_address0 |------------->| +------------+
| | ................. | | | +----------------+ | | CPER |
| | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER |
| | ................. | | | +----------------+ | | | .... |
| | read_ack_register--------+-+ | | ............. | | | | CPER |

| | read_ack_preserve | | | +------------------+ | | +-+------------+

| | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status |
+ +--------------------------+ | | | | +----------------+ | | | Data Block 1 |
| | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+
+ +--------------------------+ | | | +----------------+ | | | CPER |
| | ................. | | | +--->| | ack_value1 | | | | CPER |
| | error_status_address-----+---+ | | | +----------------+ | | | .... |
| | ................. | | | | | ............. | | | | CPER |
| | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+
| | read_ack_preserve | | +->| | ack_value10 | | | |.......... |
| | read_ack_write | | | | +----------------+ | | +------------+
+ +--------------------------| | | | | Error Status |
| | ............... | | | | | Data Block 10|
+ +--------------------------+ | | +---->| +------------+
| | GHES10 | | | | | CPER |
+ +--------------------------+ | | | | CPER |
| | ................. | | | | | .... |
| | error_status_address-----+-----+ | | | CPER |
| | ................. | | +-+------------+
| | read_ack_register--------+---------+
| | read_ack_preserve |
| | read_ack_write |
+ +--------------------------+

For GHESv2 error source, the OSPM must acknowledges the error via Read Ack register.
so user space must check the ack value to avoid read-write race condition.

Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
---

hw/acpi/aml-build.c | 2 +
hw/acpi/hest_ghes.c | 345 ++++++++++++++++++++++++++++++++++++++++++++
hw/arm/virt-acpi-build.c | 6 +

include/hw/acpi/aml-build.h | 1 +
include/hw/acpi/hest_ghes.h | 47 ++++++

5 files changed, 401 insertions(+)

create mode 100644 hw/acpi/hest_ghes.c
create mode 100644 include/hw/acpi/hest_ghes.h

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 36a6cc4..6849e5f 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1561,6 +1561,7 @@ void acpi_build_tables_init(AcpiBuildTables *tables)
tables->table_data = g_array_new(false, true /* clear */, 1);
tables->tcpalog = g_array_new(false, true /* clear */, 1);
tables->vmgenid = g_array_new(false, true /* clear */, 1);
+ tables->hardware_errors = g_array_new(false, true /* clear */, 1);
tables->linker = bios_linker_loader_init();
}

@@ -1571,6 +1572,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre)
g_array_free(tables->table_data, true);
g_array_free(tables->tcpalog, mfre);
g_array_free(tables->vmgenid, mfre);
+ g_array_free(tables->hardware_errors, mfre);
}

/* Build rsdt table */
diff --git a/hw/acpi/hest_ghes.c b/hw/acpi/hest_ghes.c
new file mode 100644
index 0000000..ff6b5ef
--- /dev/null
+++ b/hw/acpi/hest_ghes.c
@@ -0,0 +1,345 @@
+/*
+ * APEI GHES table Generation
+ *
+ * Copyright (C) 2017 huawei.
+ *
+ * Author: Dongjiu Geng <gengd...@huawei.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qmp-commands.h"
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/acpi/hest_ghes.h"
+#include "hw/nvram/fw_cfg.h"
+#include "sysemu/sysemu.h"
+#include "qemu/error-report.h"
+
+/* The structure that stands for the layout
+ * GHES_ERRORS_FW_CFG_FILE fw_cfg blob
+ *
+ * etc/hardware_errors
+ * ==========================================
+ * +------------------+
+ * | address | +--------------+
+ * | registers | | Error Status |
+ * | +----------------+ | Data Block 0 |
+ * | |status_address0 |------------->| +------------+
+ * | +----------------+ | | CPER |
+ * | |status_address1 |----------+ | | CPER |
+ * | +----------------+ | | | .... |
+ * | |............. | | | | CPER |
+ * | +----------------+ | | +------------+
+ * | |status_address10|-----+ | | Error Status |
+ * | +----------------+ | | | Data Block 1 |
+ * | |ack_value0 | | +-->| +------------+
+ * | +----------------+ | | | CPER |
+ * | |ack_value1 | | | | CPER |
+ * | +----------------+ | | | .... |
+ * | | ............. | | | | CPER |
+ * | +----------------+ | +-+------------+
+ * | |ack_value10 | | | |.......... |
+ * | +----------------+ | | +------------+
+ * | | Error Status |
+ * | | Data Block10 |
+ * +------->+------------+
+ * | | CPER |
+ * | | CPER |
+ * | | .... |
+ * | | CPER |
+ * +-+------------+
+ */
+struct hardware_errors_buffer {
+ /* Generic Error Status Block register */
+ uint64_t gesb_address[GHES_ACPI_HEST_NOTIFY_RESERVED];
+ uint64_t ack_value[GHES_ACPI_HEST_NOTIFY_RESERVED];
+ char gesb[GHES_MAX_RAW_DATA_LENGTH][GHES_ACPI_HEST_NOTIFY_RESERVED];
+};
+
+static int ghes_record_cper(uint64_t error_block_address,
+ uint64_t error_physical_addr)
+{
+ AcpiGenericErrorStatus block;
+ AcpiGenericErrorData *gdata;
+ UefiCperSecMemErr *mem_err;
+ uint64_t current_block_length;
+ unsigned char *buffer;
+ /* memory section */
+ char mem_section_id_le[] = {0x14, 0x11, 0xBC, 0xA5, 0x64, 0x6F, 0xDE,
+ 0x4E, 0xB8, 0x63, 0x3E, 0x83, 0xED, 0x7C,
+ 0x83, 0xB1};
+
+ cpu_physical_memory_read(error_block_address, &block,
+ sizeof(AcpiGenericErrorStatus));
+
+ /* Get the current generic error status block length */
+ current_block_length = sizeof(AcpiGenericErrorStatus) +
+ le32_to_cpu(block.data_length);
+
+ /* If the Generic Error Status Block is NULL, update
+ * the block header
+ */
+ if (!block.block_status) {
+ block.block_status = ACPI_GEBS_UNCORRECTABLE;
+ block.error_severity = ACPI_CPER_SEV_RECOVERABLE;
+ }
+
+ block.data_length += cpu_to_le32(sizeof(AcpiGenericErrorData));
+ block.data_length += cpu_to_le32(sizeof(UefiCperSecMemErr));
+
+ /* check whether it runs out of the preallocated memory */
+ if ((le32_to_cpu(block.data_length) + sizeof(AcpiGenericErrorStatus)) >
+ GHES_MAX_RAW_DATA_LENGTH) {
+ error_report("Record CPER out of boundary!!!");
+ return GHES_CPER_FAIL;
+ }
+
+ /* Write back the Generic Error Status Block to guest memory */
+ cpu_physical_memory_write(error_block_address, &block,
+ sizeof(AcpiGenericErrorStatus));
+
+ /* Fill in Generic Error Data Entry */
+ buffer = g_malloc0(sizeof(AcpiGenericErrorData) +
+ sizeof(UefiCperSecMemErr));
+
+
+ memset(buffer, 0, sizeof(AcpiGenericErrorData) + sizeof(UefiCperSecMemErr));
+ gdata = (AcpiGenericErrorData *)buffer;
+
+ /* Memory section */
+ memcpy(&(gdata->section_type_le), &mem_section_id_le,
+ sizeof(mem_section_id_le));
+
+ /* error severity is recoverable */
+ gdata->error_severity = ACPI_CPER_SEV_RECOVERABLE;
+ gdata->revision = 0x300; /* the revision number is 0x300 */
+ gdata->error_data_length = cpu_to_le32(sizeof(UefiCperSecMemErr));
+
+ mem_err = (UefiCperSecMemErr *) (gdata + 1);
+
+ /* User space only handle the memory section CPER */
+
+ /* Hard code to Multi-bit ECC error */
+ mem_err->validation_bits |= cpu_to_le32(UEFI_CPER_MEM_VALID_ERROR_TYPE);
+ mem_err->error_type = cpu_to_le32(UEFI_CPER_MEM_ERROR_TYPE_MULTI_ECC);
+
+ /* Record the physical address at which the memory error occurred */
+ mem_err->validation_bits |= cpu_to_le32(UEFI_CPER_MEM_VALID_PA);
+ mem_err->physical_addr = cpu_to_le32(error_physical_addr);
+
+ /* Write back the Generic Error Data Entry to guest memory */
+ cpu_physical_memory_write(error_block_address + current_block_length,
+ buffer, sizeof(AcpiGenericErrorData) + sizeof(UefiCperSecMemErr));
+
+ g_free(buffer);
+ return GHES_CPER_OK;
+}
+
+static void
+build_address(GArray *table_data, BIOSLinker *linker,
+ uint32_t dst_patched_offset, uint32_t src_offset,
+ uint8_t address_space_id , uint8_t register_bit_width,
+ uint8_t register_bit_offset, uint8_t access_size)
+{
+ uint32_t address_size = sizeof(struct AcpiGenericAddress) -
+ offsetof(struct AcpiGenericAddress, address);
+
+ /* Address space */
+ build_append_int_noprefix(table_data, address_space_id, 1);
+ /* register bit width */
+ build_append_int_noprefix(table_data, register_bit_width, 1);
+ /* register bit offset */
+ build_append_int_noprefix(table_data, register_bit_offset, 1);
+ /* access size */
+ build_append_int_noprefix(table_data, access_size, 1);
+ acpi_data_push(table_data, address_size);
+
+ /* Patch address of ERRORS fw_cfg blob into the TABLE fw_cfg blob so OSPM
+ * can retrieve and read it. the address size is 64 bits.
+ */
+ bios_linker_loader_add_pointer(linker,
+ ACPI_BUILD_TABLE_FILE, dst_patched_offset, sizeof(uint64_t),
+ GHES_ERRORS_FW_CFG_FILE, src_offset);
+}
+
+void ghes_build_acpi(GArray *table_data, GArray *hardware_error,
+ BIOSLinker *linker)
+{
+ uint32_t ghes_start = table_data->len;
+ uint32_t address_size, error_status_address_offset;
+ uint32_t read_ack_register_offset, i;
+
+ address_size = sizeof(struct AcpiGenericAddress) -
+ offsetof(struct AcpiGenericAddress, address);
+
+ error_status_address_offset = ghes_start +
+ sizeof(AcpiHardwareErrorSourceTable) +
+ offsetof(AcpiGenericHardwareErrorSourceV2, error_status_address) +
+ offsetof(struct AcpiGenericAddress, address);
+
+ read_ack_register_offset = ghes_start +
+ sizeof(AcpiHardwareErrorSourceTable) +
+ offsetof(AcpiGenericHardwareErrorSourceV2, read_ack_register) +
+ offsetof(struct AcpiGenericAddress, address);
+
+ acpi_data_push(hardware_error,
+ offsetof(struct hardware_errors_buffer, ack_value));
+ for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++)
+ /* Initialize read ack register */
+ build_append_int_noprefix((void *)hardware_error, 1, 8);
+
+ /* Reserved the total size for ERRORS fw_cfg blob
+ */
+ acpi_data_push(hardware_error, sizeof(struct hardware_errors_buffer));
+
+ /* Allocate guest memory for the Data fw_cfg blob */
+ bios_linker_loader_alloc(linker, GHES_ERRORS_FW_CFG_FILE, hardware_error,
+ 1, false);
+ /* Reserve table header size */
+ acpi_data_push(table_data, sizeof(AcpiTableHeader));
+
+ build_append_int_noprefix(table_data, GHES_ACPI_HEST_NOTIFY_RESERVED, 4);
+
+ for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++) {
+ build_append_int_noprefix(table_data,
+ ACPI_HEST_SOURCE_GENERIC_ERROR_V2, 2); /* type */
+ /* source id */
+ build_append_int_noprefix(table_data, cpu_to_le16(i), 2);
+ /* related source id */
+ build_append_int_noprefix(table_data, 0xffff, 2);
+ build_append_int_noprefix(table_data, 0, 1); /* flags */
+
+ /* Currently only enable SEA notification type to avoid the kernel
+ * warning, reserve the space for other notification error source
+ */
+ if (i == ACPI_HEST_NOTIFY_SEA) {
+ build_append_int_noprefix(table_data, 1, 1); /* enabled */
+ } else {
+ build_append_int_noprefix(table_data, 0, 1); /* enabled */
+ }
+
+ /* The number of error status block per generic hardware error source */
+ build_append_int_noprefix(table_data, 1, 4);
+ /* Max sections per record */
+ build_append_int_noprefix(table_data, 1, 4);
+ /* Max raw data length */
+ build_append_int_noprefix(table_data, GHES_MAX_RAW_DATA_LENGTH, 4);
+
+ /* Build error status address*/
+ build_address(table_data, linker, error_status_address_offset + i *
+ sizeof(AcpiGenericHardwareErrorSourceV2), i * address_size,
+ AML_SYSTEM_MEMORY, 0x40, 0, 4 /* QWord access */);
+
+ /* Hardware error notification structure */
+ build_append_int_noprefix(table_data, i, 1); /* type */
+ /* length */
+ build_append_int_noprefix(table_data, sizeof(AcpiHestNotify), 1);
+ build_append_int_noprefix(table_data, 0, 26);
+
+ /* Error Status Block Length */
+ build_append_int_noprefix(table_data,
+ cpu_to_le32(GHES_MAX_RAW_DATA_LENGTH), 4);
+
+ /* Build read ack register */
+ build_address(table_data, linker, read_ack_register_offset + i *
+ sizeof(AcpiGenericHardwareErrorSourceV2),
+ offsetof(struct hardware_errors_buffer, ack_value) +
+ i * address_size, AML_SYSTEM_MEMORY, 0x40, 0,
+ 4 /* QWord access */);
+
+ /* Read ack preserve */
+ build_append_int_noprefix(table_data, cpu_to_le64(0xfffffffe), 8);
+
+ /* Read ack write */
+ build_append_int_noprefix(table_data, cpu_to_le64(0x1), 8);
+ }
+
+ for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++)
+ /* Patch address of generic error status block into
+ * the address register so OSPM can retrieve and read it.
+ */
+ bios_linker_loader_add_pointer(linker,
+ GHES_ERRORS_FW_CFG_FILE, address_size * i, address_size,
+ GHES_ERRORS_FW_CFG_FILE,
+ offsetof(struct hardware_errors_buffer, gesb) +
+ i * GHES_MAX_RAW_DATA_LENGTH);
+
+ /* Patch address of ERRORS fw_cfg blob into the ADDR fw_cfg blob
+ * so QEMU can write the ERRORS there. The address is expected to be
+ * < 4GB, but write 64 bits anyway.
+ */
+ bios_linker_loader_write_pointer(linker, GHES_DATA_ADDR_FW_CFG_FILE,
+ 0, address_size, GHES_ERRORS_FW_CFG_FILE,
+ offsetof(struct hardware_errors_buffer, gesb));
+
+ build_header(linker, table_data,
+ (void *)(table_data->data + ghes_start), "HEST",
+ table_data->len - ghes_start, 1, NULL, "GHES");
+}
+
+static GhesState ges;
+void ghes_add_fw_cfg(FWCfgState *s, GArray *hardware_error)
+{
+
+ size_t request_block_size = sizeof(uint64_t) + GHES_MAX_RAW_DATA_LENGTH;
+ size_t size = GHES_ACPI_HEST_NOTIFY_RESERVED * request_block_size;
+
+ /* Create a read-only fw_cfg file for GHES */
+ fw_cfg_add_file(s, GHES_ERRORS_FW_CFG_FILE, hardware_error->data,
+ size);
+ /* Create a read-write fw_cfg file for Address */
+ fw_cfg_add_file_callback(s, GHES_DATA_ADDR_FW_CFG_FILE, NULL, NULL,
+ &ges.ghes_addr_le, sizeof(ges.ghes_addr_le), false);
+}
+
+bool ghes_update_guest(uint32_t notify, uint64_t physical_address)
+{
+ uint64_t error_block_addr;
+ uint64_t ack_value_addr, ack_value = 0;
+ int loop = 0, ack_value_size;
+ bool ret = GHES_CPER_FAIL;
+
+ ack_value_size = (offsetof(struct hardware_errors_buffer, gesb) -
+ offsetof(struct hardware_errors_buffer, ack_value)) /
+ GHES_ACPI_HEST_NOTIFY_RESERVED;
+
+ if (physical_address && notify < GHES_ACPI_HEST_NOTIFY_RESERVED) {
+ error_block_addr = ges.ghes_addr_le + notify * GHES_MAX_RAW_DATA_LENGTH;
+ error_block_addr = le32_to_cpu(error_block_addr);
+
+ ack_value_addr = ges.ghes_addr_le -
+ (GHES_ACPI_HEST_NOTIFY_RESERVED - notify) * ack_value_size;
+retry:
+ cpu_physical_memory_read(ack_value_addr, &ack_value, ack_value_size);
+ if (!ack_value) {
+ if (loop < 3) {
+ usleep(100 * 1000);
+ loop++;
+ goto retry;
+ } else {
+ error_report("Last time OSPM does not acknowledge the error,"
+ " record CPER failed this time, set the ack value to"
+ " avoid blocking next time CPER record! exit");
+ ack_value = 1;
+ cpu_physical_memory_write(ack_value_addr,
+ &ack_value, ack_value_size);
+ return ret;
+ }
+ } else {
+ /* A zero value in ghes_addr means that BIOS has not yet written
+ * the address
+ */
+ if (error_block_addr) {
+ ack_value = 0;
+ cpu_physical_memory_write(ack_value_addr,
+ &ack_value, ack_value_size);
+ ret = ghes_record_cper(error_block_addr, physical_address);
+ }
+ }
+ }
+ return ret;
+}
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 3d78ff6..def1ec1 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -45,6 +45,7 @@
#include "hw/arm/virt.h"
#include "sysemu/numa.h"
#include "kvm_arm.h"
+#include "hw/acpi/hest_ghes.h"

#define ARM_SPI_BASE 32
#define ACPI_POWER_BUTTON_DEVICE "PWRB"
@@ -771,6 +772,9 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
acpi_add_table(table_offsets, tables_blob);
build_spcr(tables_blob, tables->linker, vms);

+ acpi_add_table(table_offsets, tables_blob);
+ ghes_build_acpi(tables_blob, tables->hardware_errors, tables->linker);
+
if (nb_numa_nodes > 0) {
acpi_add_table(table_offsets, tables_blob);
build_srat(tables_blob, tables->linker, vms);
@@ -887,6 +891,8 @@ void virt_acpi_setup(VirtMachineState *vms)
fw_cfg_add_file(vms->fw_cfg, ACPI_BUILD_TPMLOG_FILE, tables.tcpalog->data,
acpi_data_len(tables.tcpalog));

+ ghes_add_fw_cfg(vms->fw_cfg, tables.hardware_errors);
+
build_state->rsdp_mr = acpi_add_rom_blob(build_state, tables.rsdp,
ACPI_BUILD_RSDP_FILE, 0);

diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 88d0738..7f7b55c 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -211,6 +211,7 @@ struct AcpiBuildTables {
GArray *rsdp;
GArray *tcpalog;
GArray *vmgenid;
+ GArray *hardware_errors;
BIOSLinker *linker;
} AcpiBuildTables;

diff --git a/include/hw/acpi/hest_ghes.h b/include/hw/acpi/hest_ghes.h
new file mode 100644
index 0000000..0772756
--- /dev/null
+++ b/include/hw/acpi/hest_ghes.h
@@ -0,0 +1,47 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Dongjiu Geng <gengd...@huawei.com>
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef ACPI_GHES_H
+#define ACPI_GHES_H
+
+#include "hw/acpi/bios-linker-loader.h"
+
+#define GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors"
+#define GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
+
+#define GHES_GAS_ADDRESS_OFFSET 4
+#define GHES_ERROR_STATUS_ADDRESS_OFFSET 20
+#define GHES_NOTIFICATION_STRUCTURE 32
+
+#define GHES_CPER_OK 1
+#define GHES_CPER_FAIL 0
+
+#define GHES_ACPI_HEST_NOTIFY_RESERVED 11
+/* The max size in Bytes for one error block */
+#define GHES_MAX_RAW_DATA_LENGTH 0x1000
+
+
+typedef struct GhesState {
+ uint64_t ghes_addr_le;
+} GhesState;
+
+void ghes_build_acpi(GArray *table_data, GArray *hardware_error,
+ BIOSLinker *linker);
+void ghes_add_fw_cfg(FWCfgState *s, GArray *hardware_errors);
+bool ghes_update_guest(uint32_t notify, uint64_t error_physical_addr);
+#endif
--
1.8.3.1

Dongjiu Geng

unread,

Aug 18, 2017, 10:10:41 AM8/18/17

to

Add SIGBUS signal handler. In this handler, it checks
the exception type, translates the host VA which is
delivered by host or KVM to guest PA, then fills this
PA to CPER, finally injects a Error to guest OS through
KVM.

Add synchronous external abort injection logic, setup
spsr_elx, esr_elx, PSTATE, far_elx, elr_elx etc, when
switch to guest OS, it will jump to the synchronous
external abort vector table entry.

Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
Signed-off-by: Quanming Wu <wuqua...@huawei.com>
---
include/sysemu/kvm.h | 2 +-
linux-headers/asm-arm64/kvm.h | 5 ++
target/arm/internals.h | 13 ++++
target/arm/kvm.c | 34 ++++++++++
target/arm/kvm64.c | 150 ++++++++++++++++++++++++++++++++++++++++++
target/arm/kvm_arm.h | 1 +
6 files changed, 204 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 3a458f5..90c1605 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -361,7 +361,7 @@ bool kvm_vcpu_id_is_valid(int vcpu_id);
/* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */
unsigned long kvm_arch_vcpu_id(CPUState *cpu);

-#ifdef TARGET_I386
+#if defined(TARGET_I386) || defined(TARGET_AARCH64)
#define KVM_HAVE_MCE_INJECTION 1
void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
#endif
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index d254700..5909c30 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -181,6 +181,11 @@ struct kvm_arch_memory_slot {
#define KVM_REG_ARM64_SYSREG_OP2_MASK 0x0000000000000007
#define KVM_REG_ARM64_SYSREG_OP2_SHIFT 0

+/* AArch64 fault registers */
+#define KVM_REG_ARM64_FAULT (0x0014 << KVM_REG_ARM_COPROC_SHIFT)
+#define KVM_REG_ARM64_FAULT_ESR_EC (0)
+#define KVM_REG_ARM64_FAULT_FAR (1)
+
#define ARM64_SYS_REG_SHIFT_MASK(x,n) \
(((x) << KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) & \
KVM_REG_ARM64_SYSREG_ ## n ## _MASK)
diff --git a/target/arm/internals.h b/target/arm/internals.h
index 1f6efef..fc0ad6d 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -235,6 +235,19 @@ enum arm_exception_class {
#define ARM_EL_ISV_SHIFT 24
#define ARM_EL_IL (1 << ARM_EL_IL_SHIFT)
#define ARM_EL_ISV (1 << ARM_EL_ISV_SHIFT)
+#define ARM_EL_EC_MASK ((0x3F) << ARM_EL_EC_SHIFT)
+#define ARM_EL_FSC_TYPE (0x3C)
+
+#define FSC_SEA (0x10)
+#define FSC_SEA_TTW0 (0x14)
+#define FSC_SEA_TTW1 (0x15)
+#define FSC_SEA_TTW2 (0x16)
+#define FSC_SEA_TTW3 (0x17)
+#define FSC_SECC (0x18)
+#define FSC_SECC_TTW0 (0x1c)
+#define FSC_SECC_TTW1 (0x1d)
+#define FSC_SECC_TTW2 (0x1e)
+#define FSC_SECC_TTW3 (0x1f)

/* Utility functions for constructing various kinds of syndrome value.
* Note that in general we follow the AArch64 syndrome values; in a
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 7c17f0d..2e1716a 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -129,6 +129,39 @@ void kvm_arm_destroy_scratch_host_vcpu(int *fdarray)
}
}

+typedef struct HWPoisonPage {
+ ram_addr_t ram_addr;
+ QLIST_ENTRY(HWPoisonPage) list;
+} HWPoisonPage;
+
+static QLIST_HEAD(, HWPoisonPage) hwpoison_page_list =
+ QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+static void kvm_unpoison_all(void *param)
+{
+ HWPoisonPage *page, *next_page;
+
+ QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
+ QLIST_REMOVE(page, list);
+ qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+ g_free(page);
+ }
+}
+
+void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+ HWPoisonPage *page;
+
+ QLIST_FOREACH(page, &hwpoison_page_list, list) {
+ if (page->ram_addr == ram_addr) {
+ return;
+ }
+ }
+ page = g_new(HWPoisonPage, 1);
+ page->ram_addr = ram_addr;
+ QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
+}
+
static void kvm_arm_host_cpu_class_init(ObjectClass *oc, void *data)
{
ARMHostCPUClass *ahcc = ARM_HOST_CPU_CLASS(oc);
@@ -182,6 +215,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)

cap_has_mp_state = kvm_check_extension(s, KVM_CAP_MP_STATE);

+ qemu_register_reset(kvm_unpoison_all, NULL);
type_register_static(&host_arm_cpu_type_info);

return 0;
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 0781367..d3bdab2 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -27,6 +27,8 @@
#include "kvm_arm.h"
#include "internals.h"
#include "hw/arm/arm.h"
+#include "hw/acpi/acpi-defs.h"
+#include "hw/acpi/hest_ghes.h"

static bool have_guest_debug;

@@ -590,6 +592,79 @@ int kvm_arm_cpreg_level(uint64_t regidx)
return KVM_PUT_RUNTIME_STATE;
}

+static int kvm_arm_cpreg_value(ARMCPU *cpu, ptrdiff_t fieldoffset)
+{
+ int i;
+
+ for (i = 0; i < cpu->cpreg_array_len; i++) {
+ uint32_t regidx = kvm_to_cpreg_id(cpu->cpreg_indexes[i]);
+ const ARMCPRegInfo *ri;
+ ri = get_arm_cp_reginfo(cpu->cp_regs, regidx);
+ if (!ri) {
+ continue;
+ }
+
+ if (ri->type & ARM_CP_NO_RAW) {
+ continue;
+ }
+
+ if (ri->fieldoffset == fieldoffset) {
+ cpu->cpreg_values[i] = read_raw_cp_reg(&cpu->env, ri);
+ return 0;
+ }
+ }
+ return -EINVAL;
+}
+
+/* Inject synchronous external abort */
+static int kvm_inject_arm_sea(CPUState *c)
+{
+ ARMCPU *cpu = ARM_CPU(c);
+ CPUARMState *env = &cpu->env;
+ unsigned long cpsr = pstate_read(env);
+ uint32_t esr = 0;
+ int ret;
+
+ c->exception_index = EXCP_DATA_ABORT;
+ /* Inject the exception to El1 */
+ env->exception.target_el = 1;
+ CPUClass *cc = CPU_GET_CLASS(c);
+
+ esr |= (EC_DATAABORT << ARM_EL_EC_SHIFT);
+ /* This exception syndrome includes {I,D}FSC in the bits [5:0]
+ */
+ esr |= (env->exception.syndrome & 0x3f);
+
+ /* This exception is EL0 or EL1 fault. */
+ if ((cpsr & 0xf) == PSTATE_MODE_EL0t) {
+ esr |= (EC_INSNABORT << ARM_EL_EC_SHIFT);
+ } else {
+ esr |= (EC_INSNABORT_SAME_EL << ARM_EL_EC_SHIFT);
+ }
+
+ /* In the aarch64, there is only 32-bit instruction*/
+ esr |= ARM_EL_IL;
+ env->exception.syndrome = esr;
+ cc->do_interrupt(c);
+
+ /* set ESR_EL1 */
+ ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.esr_el[1]));
+
+ if (ret) {
+ fprintf(stderr, "<%s> failed to set esr_el1\n", __func__);
+ abort();
+ }
+
+ /* set FAR_EL1 */
+ ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.far_el[1]));
+ if (ret) {
+ fprintf(stderr, "<%s> failed to set far_el1\n", __func__);
+ abort();
+ }
+
+ return 0;
+}
+
#define AARCH64_CORE_REG(x) (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \
KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(x))

@@ -599,6 +674,9 @@ int kvm_arm_cpreg_level(uint64_t regidx)
#define AARCH64_SIMD_CTRL_REG(x) (KVM_REG_ARM64 | KVM_REG_SIZE_U32 | \
KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(x))

+#define AARCH64_FAULT_REG(x) (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \
+ KVM_REG_ARM64_FAULT | (x))
+
int kvm_arch_put_registers(CPUState *cs, int level)
{
struct kvm_one_reg reg;
@@ -873,6 +951,22 @@ int kvm_arch_get_registers(CPUState *cs)
}
vfp_set_fpcr(env, fpr);

+ if (is_a64(env)) {
+ reg.id = AARCH64_FAULT_REG(KVM_REG_ARM64_FAULT_ESR_EC);
+ reg.addr = (uintptr_t)(&env->exception.syndrome);
+ ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
+ if (ret) {

+ return ret;
+ }
+

+ reg.id = AARCH64_FAULT_REG(KVM_REG_ARM64_FAULT_FAR);
+ reg.addr = (uintptr_t)(&env->exception.vaddress);
+ ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
+ if (ret) {

+ return ret;
+ }
+ }

+
if (!write_kvmstate_to_list(cpu)) {
return EINVAL;
}
@@ -887,6 +981,62 @@ int kvm_arch_get_registers(CPUState *cs)
return ret;
}

+static bool is_abort_sea(unsigned long syndrome)
+{
+ unsigned long fault_status;
+ uint8_t ec = ((syndrome & ARM_EL_EC_MASK) >> ARM_EL_EC_SHIFT);
+ if ((ec != EC_INSNABORT) && (ec != EC_DATAABORT)) {
+ return false;
+ }
+
+ fault_status = syndrome & ARM_EL_FSC_TYPE;
+ switch (fault_status) {
+ case FSC_SEA:
+ case FSC_SEA_TTW0:
+ case FSC_SEA_TTW1:
+ case FSC_SEA_TTW2:
+ case FSC_SEA_TTW3:
+ case FSC_SECC:
+ case FSC_SECC_TTW0:
+ case FSC_SECC_TTW1:
+ case FSC_SECC_TTW2:
+ case FSC_SECC_TTW3:
+ return true;
+ default:
+ return false;
+ }
+}
+
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+{
+ ram_addr_t ram_addr;
+ hwaddr paddr;
+
+ ARMCPU *cpu = ARM_CPU(c);
+ CPUARMState *env = &cpu->env;
+ assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
+ if (addr) {
+ ram_addr = qemu_ram_addr_from_host(addr);
+ if (ram_addr != RAM_ADDR_INVALID &&
+ kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+ kvm_cpu_synchronize_state(c);
+ kvm_hwpoison_page_add(ram_addr);
+ if (is_abort_sea(env->exception.syndrome)) {
+ ghes_update_guest(ACPI_HEST_NOTIFY_SEA, paddr);
+ kvm_inject_arm_sea(c);
+ }
+ return;
+ }
+ fprintf(stderr, "Hardware memory error for memory used by "
+ "QEMU itself instead of guest system!\n");
+ }
+
+ if (code == BUS_MCEERR_AR) {
+ fprintf(stderr, "Hardware memory error!\n");
+ exit(1);
+ }
+}
+
/* C6.6.29 BRK instruction */
static const uint32_t brk_insn = 0xd4200000;

diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 633d088..7cdde97 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -288,4 +288,5 @@ static inline const char *its_class_name(void)
}
}

+void kvm_hwpoison_page_add(ram_addr_t ram_addr);
#endif
--
1.8.3.1

Shannon Zhao

unread,

Aug 24, 2017, 8:40:09 AM8/24/17

to

On 2017/8/18 22:23, Dongjiu Geng wrote:
> (1) Add related APEI/HEST table structures and macros, these
> definition refer to ACPI 6.1 and UEFI 2.6 spec.
> (2) Add generic error status block and CPER memory section
> definition, user space only handle memory section errors.
>
> Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
> ---
> include/hw/acpi/acpi-defs.h | 193 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 193 insertions(+)
>
> diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h
> index 72be675..3b4bad7 100644
> --- a/include/hw/acpi/acpi-defs.h
> +++ b/include/hw/acpi/acpi-defs.h
> @@ -297,6 +297,44 @@ typedef struct AcpiMultipleApicTable AcpiMultipleApicTable;
> #define ACPI_APIC_GENERIC_TRANSLATOR 15
> #define ACPI_APIC_RESERVED 16 /* 16 and greater are reserved */
>
> +/* UEFI Spec 2.6, "N.2.5 Memory Error Section */

missing "

> +#define UEFI_CPER_MEM_VALID_ERROR_STATUS 0x0001
> +#define UEFI_CPER_MEM_VALID_PA 0x0002
> +#define UEFI_CPER_MEM_VALID_PA_MASK 0x0004
> +#define UEFI_CPER_MEM_VALID_NODE 0x0008
> +#define UEFI_CPER_MEM_VALID_CARD 0x0010
> +#define UEFI_CPER_MEM_VALID_MODULE 0x0020
> +#define UEFI_CPER_MEM_VALID_BANK 0x0040
> +#define UEFI_CPER_MEM_VALID_DEVICE 0x0080
> +#define UEFI_CPER_MEM_VALID_ROW 0x0100
> +#define UEFI_CPER_MEM_VALID_COLUMN 0x0200
> +#define UEFI_CPER_MEM_VALID_BIT_POSITION 0x0400
> +#define UEFI_CPER_MEM_VALID_REQUESTOR 0x0800
> +#define UEFI_CPER_MEM_VALID_RESPONDER 0x1000
> +#define UEFI_CPER_MEM_VALID_TARGET 0x2000
> +#define UEFI_CPER_MEM_VALID_ERROR_TYPE 0x4000
> +#define UEFI_CPER_MEM_VALID_RANK_NUMBER 0x8000
> +#define UEFI_CPER_MEM_VALID_CARD_HANDLE 0x10000
> +#define UEFI_CPER_MEM_VALID_MODULE_HANDLE 0x20000
> +#define UEFI_CPER_MEM_ERROR_TYPE_MULTI_ECC 3
> +
> +/* From the ACPI 6.1 spec, "18.3.2.9 Hardware Error Notification" */
> +

It's better to refer to the first spec version of this structure and
same with others you define.

> +enum AcpiHestNotifyType {
> + ACPI_HEST_NOTIFY_POLLED = 0,
> + ACPI_HEST_NOTIFY_EXTERNAL = 1,
> + ACPI_HEST_NOTIFY_LOCAL = 2,
> + ACPI_HEST_NOTIFY_SCI = 3,
> + ACPI_HEST_NOTIFY_NMI = 4,
> + ACPI_HEST_NOTIFY_CMCI = 5, /* ACPI 5.0 */
> + ACPI_HEST_NOTIFY_MCE = 6, /* ACPI 5.0 */
> + ACPI_HEST_NOTIFY_GPIO = 7, /* ACPI 6.0 */
> + ACPI_HEST_NOTIFY_SEA = 8, /* ACPI 6.1 */
> + ACPI_HEST_NOTIFY_SEI = 9, /* ACPI 6.1 */
> + ACPI_HEST_NOTIFY_GSIV = 10, /* ACPI 6.1 */
> + ACPI_HEST_NOTIFY_RESERVED = 11 /* 11 and greater are reserved */

In ACPI 6.2, 11 is for Software Delegated Exception, is this useful for
your patchset?

> +};
> +
> /*
> * MADT sub-structures (Follow MULTIPLE_APIC_DESCRIPTION_TABLE)
> */
> @@ -474,6 +512,161 @@ struct AcpiSystemResourceAffinityTable {
> } QEMU_PACKED;
> typedef struct AcpiSystemResourceAffinityTable AcpiSystemResourceAffinityTable;
>
> +/* Hardware Error Notification, from the ACPI 6.1
> + * spec, "18.3.2.9 Hardware Error Notification"
> + */

Use below style for multiple comment lines
/*
* XXX

*/

> +struct AcpiHestNotify {
> + uint8_t type;
> + uint8_t length;
> + uint16_t config_write_enable;
> + uint32_t poll_interval;
> + uint32_t vector;
> + uint32_t polling_threshold_value;
> + uint32_t polling_threshold_window;
> + uint32_t error_threshold_value;
> + uint32_t error_threshold_window;
> +} QEMU_PACKED;
> +typedef struct AcpiHestNotify AcpiHestNotify;
> +
> +/* From ACPI 6.1, sections "18.3.2.1 IA-32 Architecture Machine
> + * Check Exception" through "18.3.2.8 Generic Hardware Error Source version 2".
> + */
> +enum AcpiHestSourceType {
> + ACPI_HEST_SOURCE_IA32_CHECK = 0,
> + ACPI_HEST_SOURCE_IA32_CORRECTED_CHECK = 1,
> + ACPI_HEST_SOURCE_IA32_NMI = 2,

What's 3, 4, 5 for?

Shannon

Shannon Zhao

unread,

Aug 24, 2017, 9:10:10 AM8/24/17

to

On 2017/8/18 22:23, Dongjiu Geng wrote:

Don't need to add the new file to hw/acpi/Makefile.objs?

Please unify this of this file and hest_ghes.h by refering to other files.

> +
> +#include "qemu/osdep.h"
> +#include "qmp-commands.h"

unnecessary including

So we add this table unconditionally. Is there any bad impact if QEMU
runs on old kvm? Does it need to check whether KVM supports RAS?

Shannon

gengdongjiu

unread,

Aug 25, 2017, 6:40:07 AM8/25/17

to

Shannon,
Thanks for the review. please see my reply.

On 2017/8/24 20:33, Shannon Zhao wrote:
>
>
> On 2017/8/18 22:23, Dongjiu Geng wrote:
>> (1) Add related APEI/HEST table structures and macros, these
>> definition refer to ACPI 6.1 and UEFI 2.6 spec.
>> (2) Add generic error status block and CPER memory section
>> definition, user space only handle memory section errors.
>>
>> Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
>> ---
>> include/hw/acpi/acpi-defs.h | 193 ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 193 insertions(+)
>>
>> diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h
>> index 72be675..3b4bad7 100644
>> --- a/include/hw/acpi/acpi-defs.h
>> +++ b/include/hw/acpi/acpi-defs.h
>> @@ -297,6 +297,44 @@ typedef struct AcpiMultipleApicTable AcpiMultipleApicTable;
>> #define ACPI_APIC_GENERIC_TRANSLATOR 15
>> #define ACPI_APIC_RESERVED 16 /* 16 and greater are reserved */
>>
>> +/* UEFI Spec 2.6, "N.2.5 Memory Error Section */
> missing "

thanks for the pointing out.

>
>> +#define UEFI_CPER_MEM_VALID_ERROR_STATUS 0x0001
>> +#define UEFI_CPER_MEM_VALID_PA 0x0002
>> +#define UEFI_CPER_MEM_VALID_PA_MASK 0x0004
>> +#define UEFI_CPER_MEM_VALID_NODE 0x0008
>> +#define UEFI_CPER_MEM_VALID_CARD 0x0010
>> +#define UEFI_CPER_MEM_VALID_MODULE 0x0020
>> +#define UEFI_CPER_MEM_VALID_BANK 0x0040
>> +#define UEFI_CPER_MEM_VALID_DEVICE 0x0080
>> +#define UEFI_CPER_MEM_VALID_ROW 0x0100
>> +#define UEFI_CPER_MEM_VALID_COLUMN 0x0200
>> +#define UEFI_CPER_MEM_VALID_BIT_POSITION 0x0400
>> +#define UEFI_CPER_MEM_VALID_REQUESTOR 0x0800
>> +#define UEFI_CPER_MEM_VALID_RESPONDER 0x1000
>> +#define UEFI_CPER_MEM_VALID_TARGET 0x2000
>> +#define UEFI_CPER_MEM_VALID_ERROR_TYPE 0x4000
>> +#define UEFI_CPER_MEM_VALID_RANK_NUMBER 0x8000
>> +#define UEFI_CPER_MEM_VALID_CARD_HANDLE 0x10000
>> +#define UEFI_CPER_MEM_VALID_MODULE_HANDLE 0x20000
>> +#define UEFI_CPER_MEM_ERROR_TYPE_MULTI_ECC 3
>> +
>> +/* From the ACPI 6.1 spec, "18.3.2.9 Hardware Error Notification" */
>> +
> It's better to refer to the first spec version of this structure and
> same with others you define.

do you mean which spec version? the definition is aligned with the linux kernel.

>
>> +enum AcpiHestNotifyType {
>> + ACPI_HEST_NOTIFY_POLLED = 0,
>> + ACPI_HEST_NOTIFY_EXTERNAL = 1,
>> + ACPI_HEST_NOTIFY_LOCAL = 2,
>> + ACPI_HEST_NOTIFY_SCI = 3,
>> + ACPI_HEST_NOTIFY_NMI = 4,
>> + ACPI_HEST_NOTIFY_CMCI = 5, /* ACPI 5.0 */
>> + ACPI_HEST_NOTIFY_MCE = 6, /* ACPI 5.0 */
>> + ACPI_HEST_NOTIFY_GPIO = 7, /* ACPI 6.0 */
>> + ACPI_HEST_NOTIFY_SEA = 8, /* ACPI 6.1 */
>> + ACPI_HEST_NOTIFY_SEI = 9, /* ACPI 6.1 */
>> + ACPI_HEST_NOTIFY_GSIV = 10, /* ACPI 6.1 */
>> + ACPI_HEST_NOTIFY_RESERVED = 11 /* 11 and greater are reserved */
> In ACPI 6.2, 11 is for Software Delegated Exception, is this useful for
> your patchset?

it is usefull, for all the error source, I reserved the space for them.
Because the space is allocated one time, is not dynamically allocated.
so I use the ACPI_HEST_NOTIFY_RESERVED to specify that there is 11 error source.

>
>> +};
>> +
>> /*
>> * MADT sub-structures (Follow MULTIPLE_APIC_DESCRIPTION_TABLE)
>> */
>> @@ -474,6 +512,161 @@ struct AcpiSystemResourceAffinityTable {
>> } QEMU_PACKED;
>> typedef struct AcpiSystemResourceAffinityTable AcpiSystemResourceAffinityTable;
>>
>> +/* Hardware Error Notification, from the ACPI 6.1
>> + * spec, "18.3.2.9 Hardware Error Notification"
>> + */
> Use below style for multiple comment lines
> /*
> * XXX
> */

you are right, thanks for the pointing out.

>
>> +struct AcpiHestNotify {
>> + uint8_t type;
>> + uint8_t length;
>> + uint16_t config_write_enable;
>> + uint32_t poll_interval;
>> + uint32_t vector;
>> + uint32_t polling_threshold_value;
>> + uint32_t polling_threshold_window;
>> + uint32_t error_threshold_value;
>> + uint32_t error_threshold_window;
>> +} QEMU_PACKED;
>> +typedef struct AcpiHestNotify AcpiHestNotify;
>> +
>> +/* From ACPI 6.1, sections "18.3.2.1 IA-32 Architecture Machine
>> + * Check Exception" through "18.3.2.8 Generic Hardware Error Source version 2".
>> + */
>> +enum AcpiHestSourceType {
>> + ACPI_HEST_SOURCE_IA32_CHECK = 0,
>> + ACPI_HEST_SOURCE_IA32_CORRECTED_CHECK = 1,
>> + ACPI_HEST_SOURCE_IA32_NMI = 2,
> What's 3, 4, 5 for?

the ACPI spec do not use 3, 4, 5, so we not define them.

gengdongjiu

unread,

Aug 25, 2017, 7:30:08 AM8/25/17

to

Hi Shannon,

I modified the Makefile.objs in another patch.

Ok, thanks.

>
>> +
>> +#include "qemu/osdep.h"
>> +#include "qmp-commands.h"
> unnecessary including

I will remove it.

this table is added before guest OS boot. so can not use KVM to check it.
if the old kvm does not support RAS, it does not have bad impact. only waste table memory.
May be we can make it as device? if this device is enabled in the qemu
boot parameters, then we will add this table?

Shannon Zhao

unread,

Aug 25, 2017, 9:10:10 PM8/25/17

to

On 2017/8/25 18:37, gengdongjiu wrote:
>>> +
>>> >> +/* From the ACPI 6.1 spec, "18.3.2.9 Hardware Error Notification" */
>>> >> +
>> > It's better to refer to the first spec version of this structure and
>> > same with others you define.
> do you mean which spec version? the definition is aligned with the linux kernel.

What I mean here is that it's better to refer to the ACPI spec version
which introduces Hardware Error Notification first time.

>> >
>>> >> +enum AcpiHestNotifyType {
>>> >> + ACPI_HEST_NOTIFY_POLLED = 0,
>>> >> + ACPI_HEST_NOTIFY_EXTERNAL = 1,
>>> >> + ACPI_HEST_NOTIFY_LOCAL = 2,
>>> >> + ACPI_HEST_NOTIFY_SCI = 3,
>>> >> + ACPI_HEST_NOTIFY_NMI = 4,
>>> >> + ACPI_HEST_NOTIFY_CMCI = 5, /* ACPI 5.0 */
>>> >> + ACPI_HEST_NOTIFY_MCE = 6, /* ACPI 5.0 */
>>> >> + ACPI_HEST_NOTIFY_GPIO = 7, /* ACPI 6.0 */
>>> >> + ACPI_HEST_NOTIFY_SEA = 8, /* ACPI 6.1 */
>>> >> + ACPI_HEST_NOTIFY_SEI = 9, /* ACPI 6.1 */
>>> >> + ACPI_HEST_NOTIFY_GSIV = 10, /* ACPI 6.1 */
>>> >> + ACPI_HEST_NOTIFY_RESERVED = 11 /* 11 and greater are reserved */
>> > In ACPI 6.2, 11 is for Software Delegated Exception, is this useful for
>> > your patchset?
> it is usefull, for all the error source, I reserved the space for them.
> Because the space is allocated one time, is not dynamically allocated.
> so I use the ACPI_HEST_NOTIFY_RESERVED to specify that there is 11 error source.
>

I mean whether the new type Software Delegated Exception is useful for
RAS. If so, we could add this new type here.

Thanks,
--
Shannon

Shannon Zhao

unread,

Aug 25, 2017, 9:10:10 PM8/25/17

to

On 2017/8/25 19:20, gengdongjiu wrote:
>>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>>> >> index 3d78ff6..def1ec1 100644
>>> >> --- a/hw/arm/virt-acpi-build.c
>>> >> +++ b/hw/arm/virt-acpi-build.c
>>> >> @@ -45,6 +45,7 @@
>>> >> #include "hw/arm/virt.h"
>>> >> #include "sysemu/numa.h"
>>> >> #include "kvm_arm.h"
>>> >> +#include "hw/acpi/hest_ghes.h"
>>> >>
>>> >> #define ARM_SPI_BASE 32
>>> >> #define ACPI_POWER_BUTTON_DEVICE "PWRB"
>>> >> @@ -771,6 +772,9 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>>> >> acpi_add_table(table_offsets, tables_blob);
>>> >> build_spcr(tables_blob, tables->linker, vms);
>>> >>
>>> >> + acpi_add_table(table_offsets, tables_blob);
>>> >> + ghes_build_acpi(tables_blob, tables->hardware_errors, tables->linker);
>>> >> +
>> > So we add this table unconditionally. Is there any bad impact if QEMU
>> > runs on old kvm? Does it need to check whether KVM supports RAS?
> this table is added before guest OS boot. so can not use KVM to check it.

No, we can check the RAS capability when we create vcpus like you done
in another patch ans can use that in table generation.

> if the old kvm does not support RAS, it does not have bad impact. only waste table memory.
> May be we can make it as device? if this device is enabled in the qemu
> boot parameters, then we will add this table?
>

And you need to add a option to virt machine for (migration)
compatibility. On new virt machine it's on by default while off for old
ones.

Thanks,
--
Shannon

gengdongjiu

unread,

Aug 25, 2017, 9:50:06 PM8/25/17

to

On 2017/8/26 9:00, Shannon Zhao wrote:
>
>
> On 2017/8/25 18:37, gengdongjiu wrote:
>>>> +
>>>>>> +/* From the ACPI 6.1 spec, "18.3.2.9 Hardware Error Notification" */
>>>>>> +
>>>> It's better to refer to the first spec version of this structure and
>>>> same with others you define.
>> do you mean which spec version? the definition is aligned with the linux kernel.
> What I mean here is that it's better to refer to the ACPI spec version
> which introduces Hardware Error Notification first time.

Ok, I basically understand your meaning. I will clear that. thanks.

>
>>>>
>>>>>> +enum AcpiHestNotifyType {
>>>>>> + ACPI_HEST_NOTIFY_POLLED = 0,
>>>>>> + ACPI_HEST_NOTIFY_EXTERNAL = 1,
>>>>>> + ACPI_HEST_NOTIFY_LOCAL = 2,
>>>>>> + ACPI_HEST_NOTIFY_SCI = 3,
>>>>>> + ACPI_HEST_NOTIFY_NMI = 4,
>>>>>> + ACPI_HEST_NOTIFY_CMCI = 5, /* ACPI 5.0 */
>>>>>> + ACPI_HEST_NOTIFY_MCE = 6, /* ACPI 5.0 */
>>>>>> + ACPI_HEST_NOTIFY_GPIO = 7, /* ACPI 6.0 */
>>>>>> + ACPI_HEST_NOTIFY_SEA = 8, /* ACPI 6.1 */
>>>>>> + ACPI_HEST_NOTIFY_SEI = 9, /* ACPI 6.1 */
>>>>>> + ACPI_HEST_NOTIFY_GSIV = 10, /* ACPI 6.1 */
>>>>>> + ACPI_HEST_NOTIFY_RESERVED = 11 /* 11 and greater are reserved */
>>>> In ACPI 6.2, 11 is for Software Delegated Exception, is this useful for
>>>> your patchset?
>> it is usefull, for all the error source, I reserved the space for them.
>> Because the space is allocated one time, is not dynamically allocated.
>> so I use the ACPI_HEST_NOTIFY_RESERVED to specify that there is 11 error source.
>>
> I mean whether the new type Software Delegated Exception is useful for
> RAS. If so, we could add this new type here.

Just now I check the ACPI 6.2 spec, it indeed introduced the new type SDEI.

currently we do not use the type Software Delegated Exception which introduced by ACPI 6.2,
so may not need to add a new type.

>
> Thanks,
>

gengdongjiu

unread,

Aug 25, 2017, 11:00:08 PM8/25/17

to

Hi Shannon,

On 2017/8/26 9:08, Shannon Zhao wrote:
>
>
> On 2017/8/25 19:20, gengdongjiu wrote:
>>>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>>>>>> index 3d78ff6..def1ec1 100644
>>>>>> --- a/hw/arm/virt-acpi-build.c
>>>>>> +++ b/hw/arm/virt-acpi-build.c
>>>>>> @@ -45,6 +45,7 @@
>>>>>> #include "hw/arm/virt.h"
>>>>>> #include "sysemu/numa.h"
>>>>>> #include "kvm_arm.h"
>>>>>> +#include "hw/acpi/hest_ghes.h"
>>>>>>
>>>>>> #define ARM_SPI_BASE 32
>>>>>> #define ACPI_POWER_BUTTON_DEVICE "PWRB"
>>>>>> @@ -771,6 +772,9 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>>>>>> acpi_add_table(table_offsets, tables_blob);
>>>>>> build_spcr(tables_blob, tables->linker, vms);
>>>>>>
>>>>>> + acpi_add_table(table_offsets, tables_blob);
>>>>>> + ghes_build_acpi(tables_blob, tables->hardware_errors, tables->linker);
>>>>>> +
>>>> So we add this table unconditionally. Is there any bad impact if QEMU
>>>> runs on old kvm? Does it need to check whether KVM supports RAS?
>> this table is added before guest OS boot. so can not use KVM to check it.
> No, we can check the RAS capability when we create vcpus like you done
> in another patch ans can use that in table generation.

understand your meaning.

ARM James ever have below comments about the table generation.
----------------------------------------------------------------------------------
But you can use APEI in a guest on CPUs without the RAS extensions: the host may
signal memory errors to Qemu for any number of reasons, user-space shouldn't
care how it knows. Examples are PCI-AER, any APEI event notified by polling or
one of the flavours of irq.

I would expect Qemu to generate a HEST based on its abilities, i.e. if it
supports any mechanism of notifying the guest about errors. Choosing the
mechanism then depends on the type of error.

Ideally the Qemu code for HEST/GHES/CPER generation code using some of the irqs
and polling could be shared with x86, as these should be possible using common
KVM APIs.
-----------------------------------------------------------------------------------

He means we can use APEI on CPUs without RAS and may be share this code with x86,
if Qemu can support any mechanism of notifying the guest about errors, it should be
generate the table.
Now we depend on the macro KVM_HAVE_MCE_INJECTION to decide whether Qemu can support
notifying the guest.

what do you think which we should be dependent on to generate the table?

>
>> if the old kvm does not support RAS, it does not have bad impact. only waste table memory.
>> May be we can make it as device? if this device is enabled in the qemu
>> boot parameters, then we will add this table?
>>
>
> And you need to add a option to virt machine for (migration)
> compatibility. On new virt machine it's on by default while off for old
> ones.

ok.

>
> Thanks,
>

Igor Mammedov

unread,

Aug 29, 2017, 6:30:09 AM8/29/17

to

On Fri, 18 Aug 2017 22:23:43 +0800
Dongjiu Geng <gengd...@huawei.com> wrote:

> This implements APEI GHES Table by passing the error CPER info
> to the guest via a fw_cfg_blob. After a CPER info is recorded, an
> SEA(Synchronous External Abort)/SEI(SError Interrupt) exception
> will be injected into the guest OS.

it's a bit complex patch/functionality so I've just mosty skimmed and
commented only on structure of the patch and changes I'd like to see
so it would be more structured and review-able.

I'd suggest to add doc patch first which will describe how it's
supposed to work between QEMU/firmware/guest OS with expected
flows.

these diagram shows relations between tables which not necessarily bad
but as layout it's useless.
* Probably there is not much sense to have HEST table here, it's described
well enough in spec. You might just put reference here.
* these diagrams should go into doc/spec patch
* when you describe layout you need to show what and at what offsets
in which order in which blob/file is located. See ACPI spec for example
and/or docs/specs/acpi_nvdimm.txt docs/specs/acpi_mem_hotplug.txt for inspiration.

looks redundant, g_malloc0 does it for you

> + gdata = (AcpiGenericErrorData *)buffer;
> +
> + /* Memory section */
> + memcpy(&(gdata->section_type_le), &mem_section_id_le,
> + sizeof(mem_section_id_le));
> +
> + /* error severity is recoverable */
> + gdata->error_severity = ACPI_CPER_SEV_RECOVERABLE;
> + gdata->revision = 0x300; /* the revision number is 0x300 */
> + gdata->error_data_length = cpu_to_le32(sizeof(UefiCperSecMemErr));
> +
> + mem_err = (UefiCperSecMemErr *) (gdata + 1);
> +
> + /* User space only handle the memory section CPER */
> +
> + /* Hard code to Multi-bit ECC error */
> + mem_err->validation_bits |= cpu_to_le32(UEFI_CPER_MEM_VALID_ERROR_TYPE);
> + mem_err->error_type = cpu_to_le32(UEFI_CPER_MEM_ERROR_TYPE_MULTI_ECC);
> +
> + /* Record the physical address at which the memory error occurred */
> + mem_err->validation_bits |= cpu_to_le32(UEFI_CPER_MEM_VALID_PA);
> + mem_err->physical_addr = cpu_to_le32(error_physical_addr);

I'd prefer for you to use build_append_int_noprefix() API to compose
whole error status block

and try to get rid of most structures you introduce in patch 1/6,
as they will be left unused after that.

It's mostly generic GAS structure with linker addition.
I'd suggest to reuse something like
https://github.com/imammedo/qemu/commit/3d2fd6d13a3ea298d2ee814835495ce6241d085c
to build GAS and use bios_linker_loader_add_pointer() directly in ghes_build_acpi().

> +void ghes_build_acpi(GArray *table_data, GArray *hardware_error,
> + BIOSLinker *linker)
> +{
> + uint32_t ghes_start = table_data->len;
> + uint32_t address_size, error_status_address_offset;
> + uint32_t read_ack_register_offset, i;
> +
> + address_size = sizeof(struct AcpiGenericAddress) -
> + offsetof(struct AcpiGenericAddress, address);

it's confusing name for var,
AcpiGenericAddress::address is fixed unsigned 64 bit integer per spec
also, I'm not sure why it's needed at all.

> +
> + error_status_address_offset = ghes_start +
> + sizeof(AcpiHardwareErrorSourceTable) +
> + offsetof(AcpiGenericHardwareErrorSourceV2, error_status_address) +
> + offsetof(struct AcpiGenericAddress, address);
> +
> + read_ack_register_offset = ghes_start +
> + sizeof(AcpiHardwareErrorSourceTable) +
> + offsetof(AcpiGenericHardwareErrorSourceV2, read_ack_register) +
> + offsetof(struct AcpiGenericAddress, address);

it's really hard to get why you use offsetof() so much in this function,
to me above code totally unreadable.

> + acpi_data_push(hardware_error,
> + offsetof(struct hardware_errors_buffer, ack_value));

it looks like you are trying to build several tables within one function,
so it's hard to get what's going on.
I'd suggest to build separate table independently where it's possible.

i.e. build independent tables first
and only then build dependent tables passing to it pointers
to previously build table if necessary.

> + for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++)
> + /* Initialize read ack register */
> + build_append_int_noprefix((void *)hardware_error, 1, 8);
> +
> + /* Reserved the total size for ERRORS fw_cfg blob
> + */
> + acpi_data_push(hardware_error, sizeof(struct hardware_errors_buffer));
> +
> + /* Allocate guest memory for the Data fw_cfg blob */
> + bios_linker_loader_alloc(linker, GHES_ERRORS_FW_CFG_FILE, hardware_error,
> + 1, false);
> + /* Reserve table header size */
> + acpi_data_push(table_data, sizeof(AcpiTableHeader));
> +
> + build_append_int_noprefix(table_data, GHES_ACPI_HEST_NOTIFY_RESERVED, 4);

GHES_ACPI_HEST_NOTIFY_RESERVED - name doesn't actually tell what it is
I'd suggest to use spec field name wit table prefix, ex:
ACPI_HEST_ERROR_SOURCE_COUNT

also, beside build_append_int_noprefix() you need to at least
add comment that exactly matches field from spec.

the same applies to other fields you are adding in this patch

just do something like this instead of build_address():
build_append_gas()
bios_linker_loader_add_pointer()

also register width 0x40 looks suspicious, where does it come from?
While at it do you have a real hardware which has HEST table that you re trying to model after?
I'd like to see HEST and other related tables from it.

gengdongjiu

unread,

Aug 29, 2017, 7:30:09 AM8/29/17

to

Igor,
Thank you very much for your review and comments, I will check your comments in detail and reply to you.

> .
>

gengdongjiu

unread,

Sep 1, 2017, 6:00:07 AM9/1/17

to

Hi Igor,

On 2017/8/29 18:20, Igor Mammedov wrote:

> On Fri, 18 Aug 2017 22:23:43 +0800
> Dongjiu Geng <gengd...@huawei.com> wrote:
>
>> This implements APEI GHES Table by passing the error CPER info
>> to the guest via a fw_cfg_blob. After a CPER info is recorded, an
>> SEA(Synchronous External Abort)/SEI(SError Interrupt) exception
>> will be injected into the guest OS.
>
> it's a bit complex patch/functionality so I've just mosty skimmed and
> commented only on structure of the patch and changes I'd like to see
> so it would be more structured and review-able.

I will make it more structured and review-able, so that it is easily readable.

>
> I'd suggest to add doc patch first which will describe how it's
> supposed to work between QEMU/firmware/guest OS with expected
> flows.

It is sure, I will add a doc patch.

Maybe not yet. it shows how the table is generated and how the CPER is recorded through fw_cfg blob
between Qemu and guest firmware. for example: etc/acpi/tables and etc/hardware_errors
anyway I will move it to the spec/doc.

> * Probably there is not much sense to have HEST table here, it's described
> well enough in spec. You might just put reference here.
> * these diagrams should go into doc/spec patch

I will move these diagrams to doc/spec patch

> * when you describe layout you need to show what and at what offsets
> in which order in which blob/file is located. See ACPI spec for example
> and/or docs/specs/acpi_nvdimm.txt docs/specs/acpi_mem_hotplug.txt for inspiration.

Ok, thanks for the suggestion.

>
>> For GHESv2 error source, the OSPM must acknowledges the error via Read Ack register.
>> so user space must check the ack value to avoid read-write race condition.
>>
>> Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
>> ---

cut----

>> + /* Write back the Generic Error Status Block to guest memory */
>> + cpu_physical_memory_write(error_block_address, &block,
>> + sizeof(AcpiGenericErrorStatus));
>> +
>> + /* Fill in Generic Error Data Entry */
>> + buffer = g_malloc0(sizeof(AcpiGenericErrorData) +
>> + sizeof(UefiCperSecMemErr));
>> +
>> +
>> + memset(buffer, 0, sizeof(AcpiGenericErrorData) + sizeof(UefiCperSecMemErr));
> looks redundant, g_malloc0 does it for you

I will remove it.

>
>> + gdata = (AcpiGenericErrorData *)buffer;
>> +
>> + /* Memory section */
>> + memcpy(&(gdata->section_type_le), &mem_section_id_le,
>> + sizeof(mem_section_id_le));
>> +
>> + /* error severity is recoverable */
>> + gdata->error_severity = ACPI_CPER_SEV_RECOVERABLE;
>> + gdata->revision = 0x300; /* the revision number is 0x300 */
>> + gdata->error_data_length = cpu_to_le32(sizeof(UefiCperSecMemErr));
>> +
>> + mem_err = (UefiCperSecMemErr *) (gdata + 1);
>> +
>> + /* User space only handle the memory section CPER */
>> +
>> + /* Hard code to Multi-bit ECC error */
>> + mem_err->validation_bits |= cpu_to_le32(UEFI_CPER_MEM_VALID_ERROR_TYPE);
>> + mem_err->error_type = cpu_to_le32(UEFI_CPER_MEM_ERROR_TYPE_MULTI_ECC);
>> +
>> + /* Record the physical address at which the memory error occurred */
>> + mem_err->validation_bits |= cpu_to_le32(UEFI_CPER_MEM_VALID_PA);
>> + mem_err->physical_addr = cpu_to_le32(error_physical_addr);
>
> I'd prefer for you to use build_append_int_noprefix() API to compose
> whole error status block

I will use build_append_int_noprefix() API to compose the block status

>
> and try to get rid of most structures you introduce in patch 1/6,
> as they will be left unused after that.

I will clear the structures and remove the unused structures that this patch does not used.

thanks, OK.

>
>> +void ghes_build_acpi(GArray *table_data, GArray *hardware_error,
>> + BIOSLinker *linker)
>> +{
>> + uint32_t ghes_start = table_data->len;
>> + uint32_t address_size, error_status_address_offset;
>> + uint32_t read_ack_register_offset, i;
>> +
>> + address_size = sizeof(struct AcpiGenericAddress) -
>> + offsetof(struct AcpiGenericAddress, address);
> it's confusing name for var,
> AcpiGenericAddress::address is fixed unsigned 64 bit integer per spec
> also, I'm not sure why it's needed at all.

it is because other people have concern about where does the "unsigned 64 bit integer"
come from, they are confused about the "unsigned 64 bit integer"
so they suggested use sizeof. anyway I will directly use unsigned 64 bit integer.

>
>> +
>> + error_status_address_offset = ghes_start +
>> + sizeof(AcpiHardwareErrorSourceTable) +
>> + offsetof(AcpiGenericHardwareErrorSourceV2, error_status_address) +
>> + offsetof(struct AcpiGenericAddress, address);
>> +
>> + read_ack_register_offset = ghes_start +
>> + sizeof(AcpiHardwareErrorSourceTable) +
>> + offsetof(AcpiGenericHardwareErrorSourceV2, read_ack_register) +
>> + offsetof(struct AcpiGenericAddress, address);
> it's really hard to get why you use offsetof() so much in this function,
> to me above code totally unreadable.

I will find method to make it easily readable.

>
>> + acpi_data_push(hardware_error,
>> + offsetof(struct hardware_errors_buffer, ack_value));
> it looks like you are trying to build several tables within one function,
> so it's hard to get what's going on.
> I'd suggest to build separate table independently where it's possible.
>
> i.e. build independent tables first
> and only then build dependent tables passing to it pointers
> to previously build table if necessary.

thanks for the suggestion, I will make the table independently as far as possible

>
>> + for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++)
>> + /* Initialize read ack register */
>> + build_append_int_noprefix((void *)hardware_error, 1, 8);
>> +
>> + /* Reserved the total size for ERRORS fw_cfg blob
>> + */
>> + acpi_data_push(hardware_error, sizeof(struct hardware_errors_buffer));
>> +
>> + /* Allocate guest memory for the Data fw_cfg blob */
>> + bios_linker_loader_alloc(linker, GHES_ERRORS_FW_CFG_FILE, hardware_error,
>> + 1, false);
>> + /* Reserve table header size */
>> + acpi_data_push(table_data, sizeof(AcpiTableHeader));
>> +
>> + build_append_int_noprefix(table_data, GHES_ACPI_HEST_NOTIFY_RESERVED, 4);
> GHES_ACPI_HEST_NOTIFY_RESERVED - name doesn't actually tell what it is
> I'd suggest to use spec field name wit table prefix, ex:
> ACPI_HEST_ERROR_SOURCE_COUNT

thanks for the suggestion, I will use your suggested name.

>
> also, beside build_append_int_noprefix() you need to at least
> add comment that exactly matches field from spec.
>
> the same applies to other fields you are adding in this patch

OK, will add it.

Thanks for the suggestion.

>
> also register width 0x40 looks suspicious, where does it come from?
> While at it do you have a real hardware which has HEST table that you re trying to model after?
> I'd like to see HEST and other related tables from it.

Igor, what is your suspicious point? The register width 0x40 come from our host BIOS record to the System Memory space.

For the SEA/SEI, the Qemu(user space) only handle the memory section hardware error, not include processor error. so it may not
involve other Address Space, such as System I/O space

it is sure we have hardware. I share our BIOS code that generate HEST and other related table here.
https://github.com/hisilicon/uefi.git
Now we have also submitted the host BIOS code to open source for review, this code generated the HEST and other related table
thanks.

>
>> +
>> + /* Hardware error notification structure */
>> + build_append_int_noprefix(table_data, i, 1); /* type */
>> + /* length */
>> + build_append_int_noprefix(table_data, sizeof(AcpiHestNotify), 1);
>> + build_append_int_noprefix(table_data, 0, 26);
>> +
>> + /* Error Status Block Length */
>> + build_append_int_noprefix(table_data,
>> + cpu_to_le32(GHES_MAX_RAW_DATA_LENGTH), 4);
>> +

cut -----

Igor Mammedov

unread,

Sep 1, 2017, 8:00:09 AM9/1/17

to

On Fri, 1 Sep 2017 17:58:55 +0800
gengdongjiu <gengd...@huawei.com> wrote:

> Hi Igor,
>
> On 2017/8/29 18:20, Igor Mammedov wrote:
> > On Fri, 18 Aug 2017 22:23:43 +0800
> > Dongjiu Geng <gengd...@huawei.com> wrote:

[...]

> >
> >> +void ghes_build_acpi(GArray *table_data, GArray *hardware_error,
> >> + BIOSLinker *linker)
> >> +{
> >> + uint32_t ghes_start = table_data->len;
> >> + uint32_t address_size, error_status_address_offset;
> >> + uint32_t read_ack_register_offset, i;
> >> +
> >> + address_size = sizeof(struct AcpiGenericAddress) -
> >> + offsetof(struct AcpiGenericAddress, address);
> > it's confusing name for var,
> > AcpiGenericAddress::address is fixed unsigned 64 bit integer per spec
> > also, I'm not sure why it's needed at all.
> it is because other people have concern about where does the "unsigned 64 bit integer"
> come from, they are confused about the "unsigned 64 bit integer"
> so they suggested use sizeof. anyway I will directly use unsigned 64 bit integer.

Maybe properly named macro instead of sizeof(foo) would do the job

[...]

> >> +
> >> + /* Build error status address*/
> >> + build_address(table_data, linker, error_status_address_offset + i *
> >> + sizeof(AcpiGenericHardwareErrorSourceV2), i * address_size,
> >> + AML_SYSTEM_MEMORY, 0x40, 0, 4 /* QWord access */);
> > just do something like this instead of build_address():
> > build_append_gas()
> > bios_linker_loader_add_pointer()
> Thanks for the suggestion.
>
> >
> > also register width 0x40 looks suspicious, where does it come from?
> > While at it do you have a real hardware which has HEST table that you re trying to model after?
> > I'd like to see HEST and other related tables from it.
>
> Igor, what is your suspicious point? The register width 0x40 come from our host BIOS record to the System Memory space.

maybe s/0x40/ERROR_STATUS_BLOCK_POINTER_SIZE/

[...]

Peter Maydell

unread,

Sep 5, 2017, 1:50:08 PM9/5/17

to

On 18 August 2017 at 15:23, Dongjiu Geng <gengd...@huawei.com> wrote:
> Add SIGBUS signal handler. In this handler, it checks
> the exception type, translates the host VA which is
> delivered by host or KVM to guest PA, then fills this
> PA to CPER, finally injects a Error to guest OS through
> KVM.
>
> Add synchronous external abort injection logic, setup
> spsr_elx, esr_elx, PSTATE, far_elx, elr_elx etc, when
> switch to guest OS, it will jump to the synchronous
> external abort vector table entry.
>
> Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
> Signed-off-by: Quanming Wu <wuqua...@huawei.com>
> ---
> include/sysemu/kvm.h | 2 +-
> linux-headers/asm-arm64/kvm.h | 5 ++
> target/arm/internals.h | 13 ++++
> target/arm/kvm.c | 34 ++++++++++
> target/arm/kvm64.c | 150 ++++++++++++++++++++++++++++++++++++++++++
> target/arm/kvm_arm.h | 1 +
> 6 files changed, 204 insertions(+), 1 deletion(-)

Have you tested whether this patchset builds OK on aarch32 ?

Again, linux-headers changes need to go in their own header sync patch.

This code has all just been copied-and-pasted from target/i386/kvm.c.
Please instead abstract it out properly into a cpu-independent
source file.

What is this ??? You should never need to look up things in
the cpreg arrays by fieldoffset.

The code for handling debug exits (software step, watchpoint, etc)
is probably a good place to look for how to deal with register state.

> +}
> +
> +/* Inject synchronous external abort */
> +static int kvm_inject_arm_sea(CPUState *c)
> +{
> + ARMCPU *cpu = ARM_CPU(c);
> + CPUARMState *env = &cpu->env;
> + unsigned long cpsr = pstate_read(env);
> + uint32_t esr = 0;
> + int ret;
> +
> + c->exception_index = EXCP_DATA_ABORT;
> + /* Inject the exception to El1 */
> + env->exception.target_el = 1;
> + CPUClass *cc = CPU_GET_CLASS(c);
> +
> + esr |= (EC_DATAABORT << ARM_EL_EC_SHIFT);

We have functions in internals.h for constructing ESR values,
please use them.

This looks dubious. exception.syndrome and exception.vaddress
are just internal information QEMU uses, not guest visible things.
And only synchronizing them if the CPU happens to be in AArch64
at the point when this function is called is also odd.

> + if (ret) {
> + return ret;
> + }
> + }
> +
> if (!write_kvmstate_to_list(cpu)) {
> return EINVAL;
> }
> @@ -887,6 +981,62 @@ int kvm_arch_get_registers(CPUState *cs)
> return ret;
> }
>
> +static bool is_abort_sea(unsigned long syndrome)
> +{
> + unsigned long fault_status;

Don't use "unsigned long" when you really mean uint32_t.

This looks a bit odd. There are cases where we hwpoison the page
but don't tell the guest about it? When would we get this kind
of sigbus when the exception syndrome wasn't the right kind ?

Are we guaranteed not to get this kind of signal if we told
the kernel not to expose RAS to the guest ?

> + return;
> + }
> + fprintf(stderr, "Hardware memory error for memory used by "
> + "QEMU itself instead of guest system!\n");
> + }
> +
> + if (code == BUS_MCEERR_AR) {
> + fprintf(stderr, "Hardware memory error!\n");
> + exit(1);
> + }
> +}
> +
> /* C6.6.29 BRK instruction */
> static const uint32_t brk_insn = 0xd4200000;
>
> diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
> index 633d088..7cdde97 100644
> --- a/target/arm/kvm_arm.h
> +++ b/target/arm/kvm_arm.h
> @@ -288,4 +288,5 @@ static inline const char *its_class_name(void)
> }
> }
>
> +void kvm_hwpoison_page_add(ram_addr_t ram_addr);

Any new globally-visible function prototype in a header should
have a doc-comment formatted documentation comment, please.

> #endif
> --
> 1.8.3.1

thanks
-- PMM

gengdongjiu

unread,

Sep 8, 2017, 12:21:14 PM9/8/17

to

Hi peter,
Sorry for the late response.

>
> On 18 August 2017 at 15:23, Dongjiu Geng <gengd...@huawei.com> wrote:
> > Add SIGBUS signal handler. In this handler, it checks the exception
> > type, translates the host VA which is delivered by host or KVM to
> > guest PA, then fills this PA to CPER, finally injects a Error to guest
> > OS through KVM.
> >
> > Add synchronous external abort injection logic, setup spsr_elx,
> > esr_elx, PSTATE, far_elx, elr_elx etc, when switch to guest OS, it
> > will jump to the synchronous external abort vector table entry.
> >
> > Signed-off-by: Dongjiu Geng <gengd...@huawei.com>
> > Signed-off-by: Quanming Wu <wuqua...@huawei.com>
> > ---
> > include/sysemu/kvm.h | 2 +-
> > linux-headers/asm-arm64/kvm.h | 5 ++
> > target/arm/internals.h | 13 ++++
> > target/arm/kvm.c | 34 ++++++++++
> > target/arm/kvm64.c | 150 ++++++++++++++++++++++++++++++++++++++++++
> > target/arm/kvm_arm.h | 1 +
> > 6 files changed, 204 insertions(+), 1 deletion(-)
>
> Have you tested whether this patchset builds OK on aarch32 ?

Sorry, I have not tested the build on aarch32, because we only support RAS extension on aarch64 in software.
I will fix the build issue on aarch32.

Ok.

>
> > diff --git a/target/arm/internals.h b/target/arm/internals.h index
> > 1f6efef..fc0ad6d 100644
> > --- a/target/arm/internals.h
> > +++ b/target/arm/internals.h
> > @@ -235,6 +235,19 @@ enum arm_exception_class { #define
> > ARM_EL_ISV_SHIFT 24 #define ARM_EL_IL (1 << ARM_EL_IL_SHIFT) #define
> > ARM_EL_ISV (1 << ARM_EL_ISV_SHIFT)

> > +#define ARM_EL_EC_MASK ((0x3F) << ARM_EL_EC_SHIFT) #define
> > +ARM_EL_FSC_TYPE (0x3C)

> > + HWPoisonPage *page, *next_page;
> > +
> > + QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
> > + QLIST_REMOVE(page, list);
> > + qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> > + g_free(page);
> > + }
> > +}
> > +
> > +void kvm_hwpoison_page_add(ram_addr_t ram_addr) {

> > + HWPoisonPage *page;
> > +
> > + QLIST_FOREACH(page, &hwpoison_page_list, list) {
> > + if (page->ram_addr == ram_addr) {
> > + return;
> > + }
> > + }
> > + page = g_new(HWPoisonPage, 1);
> > + page->ram_addr = ram_addr;
> > + QLIST_INSERT_HEAD(&hwpoison_page_list, page, list); }
>

> This code has all just been copied-and-pasted from target/i386/kvm.c.
> Please instead abstract it out properly into a cpu-independent source file.

Yes, it copied from x86.
Do you mean abstracting this code to a common folder so that i386 and arm platform share it?

>
> > static void kvm_arm_host_cpu_class_init(ObjectClass *oc, void *data)
> > {
> > ARMHostCPUClass *ahcc = ARM_HOST_CPU_CLASS(oc); @@ -182,6 +215,7
> > @@ int kvm_arch_init(MachineState *ms, KVMState *s)
> >
> > cap_has_mp_state = kvm_check_extension(s, KVM_CAP_MP_STATE);
> >

[...]

> > +static int kvm_arm_cpreg_value(ARMCPU *cpu, ptrdiff_t fieldoffset) {

> > + int i;
> > +
> > + for (i = 0; i < cpu->cpreg_array_len; i++) {
> > + uint32_t regidx = kvm_to_cpreg_id(cpu->cpreg_indexes[i]);
> > + const ARMCPRegInfo *ri;
> > + ri = get_arm_cp_reginfo(cpu->cp_regs, regidx);
> > + if (!ri) {
> > + continue;
> > + }
> > +
> > + if (ri->type & ARM_CP_NO_RAW) {
> > + continue;
> > + }
> > +
> > + if (ri->fieldoffset == fieldoffset) {
> > + cpu->cpreg_values[i] = read_raw_cp_reg(&cpu->env, ri);
> > + return 0;
> > + }
> > + }
> > + return -EINVAL;
>
> What is this ??? You should never need to look up things in the cpreg arrays by fieldoffset.

I used it to set the esr_el1's and far_el1's value, for example:

/* set ESR_EL1 */

ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.esr_el[1]));

/* set FAR_EL1 */

ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.far_el[1]));

other people suggests me injecting the synchronous exception abort in the user space, so I need to set the esr_el1 and far_el1's value
So use the added API kvm_arm_cpreg_value() to set their value. If not used it, do you better method to set their value?

>
> The code for handling debug exits (software step, watchpoint, etc) is probably a good place to look for how to deal with register state.
>
> > +}
> > +

> > +/* Inject synchronous external abort */ static int
> > +kvm_inject_arm_sea(CPUState *c) {

> > + ARMCPU *cpu = ARM_CPU(c);
> > + CPUARMState *env = &cpu->env;
> > + unsigned long cpsr = pstate_read(env);
> > + uint32_t esr = 0;
> > + int ret;
> > +
> > + c->exception_index = EXCP_DATA_ABORT;
> > + /* Inject the exception to El1 */
> > + env->exception.target_el = 1;
> > + CPUClass *cc = CPU_GET_CLASS(c);
> > +
> > + esr |= (EC_DATAABORT << ARM_EL_EC_SHIFT);
>
> We have functions in internals.h for constructing ESR values, please use them.

Ok, thanks for the reminder, I will check it in internals.h

>
> > + /* This exception syndrome includes {I,D}FSC in the bits [5:0]
> > + */
> > + esr |= (env->exception.syndrome & 0x3f);
> > +
> > + /* This exception is EL0 or EL1 fault. */
> > + if ((cpsr & 0xf) == PSTATE_MODE_EL0t) {
> > + esr |= (EC_INSNABORT << ARM_EL_EC_SHIFT);
> > + } else {
> > + esr |= (EC_INSNABORT_SAME_EL << ARM_EL_EC_SHIFT);
> > + }
> > +
> > + /* In the aarch64, there is only 32-bit instruction*/
> > + esr |= ARM_EL_IL;
> > + env->exception.syndrome = esr;
> > + cc->do_interrupt(c);
> > +
> > + /* set ESR_EL1 */
> > + ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState,

> > + cp15.esr_el[1]));

Now I needs to get the exception.syndrome(esr_el2) to judge whether the exception is a synchronization about or asynchronous abort,
and get the exception.vaddress(esr_el2) value to inject the fault address for the synchronization exception about.
But currently the two EL2 values are not exposed to userspace. So I need to call IOCTL to get their values.
Meanwhile, we only support RAS in AArch64, not in AArch32, so it only gets the two values in Aarch64.

>
> > + if (ret) {
> > + return ret;
> > + }
> > + }
> > +
> > if (!write_kvmstate_to_list(cpu)) {
> > return EINVAL;
> > }
> > @@ -887,6 +981,62 @@ int kvm_arch_get_registers(CPUState *cs)
> > return ret;
> > }
> >
> > +static bool is_abort_sea(unsigned long syndrome) {

> > + unsigned long fault_status;
>
> Don't use "unsigned long" when you really mean uint32_t.

Ok, thanks, I will change it.

>
> > + uint8_t ec = ((syndrome & ARM_EL_EC_MASK) >> ARM_EL_EC_SHIFT);
> > + if ((ec != EC_INSNABORT) && (ec != EC_DATAABORT)) {
> > + return false;
> > + }
> > +
> > + fault_status = syndrome & ARM_EL_FSC_TYPE;
> > + switch (fault_status) {
> > + case FSC_SEA:
> > + case FSC_SEA_TTW0:
> > + case FSC_SEA_TTW1:
> > + case FSC_SEA_TTW2:
> > + case FSC_SEA_TTW3:
> > + case FSC_SECC:
> > + case FSC_SECC_TTW0:
> > + case FSC_SECC_TTW1:
> > + case FSC_SECC_TTW2:
> > + case FSC_SECC_TTW3:
> > + return true;
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > +void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) {

> > + ram_addr_t ram_addr;
> > + hwaddr paddr;
> > +
> > + ARMCPU *cpu = ARM_CPU(c);
> > + CPUARMState *env = &cpu->env;
> > + assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
> > + if (addr) {
> > + ram_addr = qemu_ram_addr_from_host(addr);
> > + if (ram_addr != RAM_ADDR_INVALID &&
> > + kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
> > + kvm_cpu_synchronize_state(c);
> > + kvm_hwpoison_page_add(ram_addr);
> > + if (is_abort_sea(env->exception.syndrome)) {
> > + ghes_update_guest(ACPI_HEST_NOTIFY_SEA, paddr);
> > + kvm_inject_arm_sea(c);
> > + }
>
> This looks a bit odd. There are cases where we hwpoison the page but don't tell the guest about it? When would we get this kind of sigbus
> when the exception syndrome wasn't the right kind ?

I explained it in detail.

1. Firstly, when guest happen abort, it will firstly trap to host EL3 firmware, then jump to hypervisor exception entry, then exits from guest.
When exits from guest, KVM will record the exception syndrome to the VCPU structure.
2. KVM calls host ACPI driver, host APEI driver gets the error address from APEI table and calls host memory_failure(), memory_failure() sets this page
to poison page, then send the sigbus.
3. Qemu gets this sigbus, call kvm_cpu_synchronize_state() which will get the right exception syndrome from kvm, this step will make sure the exception syndrome
Is the right kind.
4. Qemu calls ghes_update_guest() to record this CPER to APEI table for guest, and notify guest there is an error through kvm_inject_arm_sea().
5. when guest received error notification, it will parse guest APEI table in which have the CPER record.

>
> Are we guaranteed not to get this kind of signal if we told the kernel not to expose RAS to the guest ?

Now we are still discussing whether needs to check the RAS extension when user space received the sigbus.

>
> > + return;
> > + }
> > + fprintf(stderr, "Hardware memory error for memory used by "
> > + "QEMU itself instead of guest system!\n");
> > + }
> > +
> > + if (code == BUS_MCEERR_AR) {
> > + fprintf(stderr, "Hardware memory error!\n");
> > + exit(1);
> > + }
> > +}
> > +
> > /* C6.6.29 BRK instruction */
> > static const uint32_t brk_insn = 0xd4200000;
> >
> > diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index
> > 633d088..7cdde97 100644
> > --- a/target/arm/kvm_arm.h
> > +++ b/target/arm/kvm_arm.h
> > @@ -288,4 +288,5 @@ static inline const char *its_class_name(void)
> > }
> > }
> >
> > +void kvm_hwpoison_page_add(ram_addr_t ram_addr);
>
> Any new globally-visible function prototype in a header should have a doc-comment formatted documentation comment, please.

Ok, thanks for this reminder. Do you mean I need to add comments for this globally-visible function, such as below:

/*
* xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
*/
void kvm_hwpoison_page_add(ram_addr_t ram_addr);

Peter Maydell

unread,

Sep 8, 2017, 12:30:10 PM9/8/17

to

On 8 September 2017 at 17:17, gengdongjiu <gengd...@huawei.com> wrote:
>>
>> This code has all just been copied-and-pasted from target/i386/kvm.c.
>> Please instead abstract it out properly into a cpu-independent source file.
>
>
> Yes, it copied from x86.
> Do you mean abstracting this code to a common folder so that i386 and arm platform share it?

I mean it should go into a common source file (perhaps
accel/kvm/kvm-all.c).

>> What is this ??? You should never need to look up things in the cpreg arrays by fieldoffset.
>
>
> I used it to set the esr_el1's and far_el1's value, for example:
>
> /* set ESR_EL1 */
> ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.esr_el[1]));
>
> /* set FAR_EL1 */
> ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.far_el[1]));

Yes, I saw that, but I have no idea why you think that's
the right way to set register values. No other code in
QEMU does this.

> So use the added API kvm_arm_cpreg_value() to set their value.
> If not used it, do you better method to set their value?

I suggest:

>> The code for handling debug exits (software step, watchpoint,
>> etc) is probably a good place to look for how to deal with register state.

>> Any new globally-visible function prototype in a header should
>> have a doc-comment formatted documentation comment, please.
>
>
> Ok, thanks for this reminder. Do you mean I need to add comments
> for this globally-visible function, such as below:
>
> /*
> * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> */
> void kvm_hwpoison_page_add(ram_addr_t ram_addr);

It should be in the doc-comment format, which begins
"/**" and has some stylization of how you list parameters
and so on. Lots of examples in the existing headers.

thanks
-- PMM

gengdongjiu

unread,

Sep 8, 2017, 12:50:10 PM9/8/17

to

[...]

> >
> > /*
> > * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > */
> > void kvm_hwpoison_page_add(ram_addr_t ram_addr);
>
> It should be in the doc-comment format, which begins "/**" and has some stylization of how you list parameters and so on. Lots of
> examples in the existing headers.

understand, thanks for the explanation.

>
> thanks
> -- PMM