ioatdma(Intel(R) I/OAT DMA Engine init failed)

Gavin Guo

unread,

May 16, 2016, 6:08:30 AM5/16/16

to dmae...@vger.kernel.org, linux-kernel, vinod...@intel.com, dan.j.w...@intel.com, dave....@intel.com

The following error messages can be observed on the Intel Haswell-E
chipset with v3.13 kernel. After the analysis, I found there is no
difference in the logic of these error messages in the current
upstream kernel. I also searched the git log and can't find any commit
which is fix to the error(correct me if I am wrong). The following is
the detail, and I'll really appreciate if there is any comment. :)

ioatdma 0000:00:04.0: channel error register unreachable
ioatdma 0000:00:04.0: channel enumeration error
ioatdma 0000:00:04.0: Intel(R) I/OAT DMA Engine init failed
ioatdma 0000:00:04.1: channel error register unreachable
ioatdma 0000:00:04.1: channel enumeration error
ioatdma 0000:00:04.1: Intel(R) I/OAT DMA Engine init failed
..
ioatdma 0000:00:04.7: channel error register unreachable
ioatdma 0000:00:04.7: channel enumeration error
ioatdma 0000:00:04.7: Intel(R) I/OAT DMA Engine init failed
mei_me 0000:00:16.0: initialization failed.

There are 8 I/OAT DMA controllers on the Haswell-E chipset:
8086:2f20 ~ 8086:2f27
80:04.0 System peripheral: Intel Corporation Haswell-E DMA Channel 0 (rev 02)
80:04.1 System peripheral: Intel Corporation Haswell-E DMA Channel 1 (rev 02)
80:04.2 System peripheral: Intel Corporation Haswell-E DMA Channel 2 (rev 02)
80:04.3 System peripheral: Intel Corporation Haswell-E DMA Channel 3 (rev 02)
80:04.4 System peripheral: Intel Corporation Haswell-E DMA Channel 4 (rev 02)
80:04.5 System peripheral: Intel Corporation Haswell-E DMA Channel 5 (rev 02)
80:04.6 System peripheral: Intel Corporation Haswell-E DMA Channel 6 (rev 02)
80:04.7 System peripheral: Intel Corporation Haswell-E DMA Channel 7 (rev 02)

Analysis:
The bug happens when the driver is resetting DMA controller, this is
the sequence: The function, ioat_pci_probe, is called when the DMA
controller is detected by the PCI bus. Then,
ioat3_dma_probe -> ioat_probe -> ioat2_enumerate_channels ->
ioat3_reset_hw. The following code can be found in the ioat3_reset_hw:

drivers/dma/ioat/dma_v3.c:
chanerr = readl(chan->reg_base + IOAT_CHANERR_OFFSET);
writel(chanerr, chan->reg_base + IOAT_CHANERR_OFFSET);
..
err = pci_read_config_dword(pdev,
IOAT_PCI_CHANERR_INT_OFFSET, &chanerr);
if (err) {
dev_err(&pdev->dev,
"channel error register unreachable\n");
return err;
}

Obviously, there are something wrong in the channel error register
reset process. Then all the way back to ioat_probe(). Because the
error happens, the dma->chancnt will be set to 0:

drivers/dma/ioat/dma.c:
if (!dma->chancnt) {
dev_err(dev, "channel enumeration error\n");
goto err_setup_interrupts;
}

Finally back to ioat_pci_probe:

drivers/dma/ioat/pci.c:
err = ioat3_dma_probe(device, ioat_dca_enabled);
else
return -ENODEV;

if (err) {
dev_err(dev, "Intel(R) I/OAT DMA Engine init
failed\n");
return -ENODEV;

Vinod Koul

unread,

May 17, 2016, 6:00:52 AM5/17/16

to Gavin Guo, dmae...@vger.kernel.org, linux-kernel, dan.j.w...@intel.com, dave....@intel.com

On Mon, May 16, 2016 at 06:08:20PM +0800, Gavin Guo wrote:
> The following error messages can be observed on the Intel Haswell-E
> chipset with v3.13 kernel. After the analysis, I found there is no
> difference in the logic of these error messages in the current
> upstream kernel. I also searched the git log and can't find any commit
> which is fix to the error(correct me if I am wrong). The following is
> the detail, and I'll really appreciate if there is any comment. :)

3.13 is ancient, can you check this on latest kernel

>
> ioatdma 0000:00:04.0: channel error register unreachable
> ioatdma 0000:00:04.0: channel enumeration error
> ioatdma 0000:00:04.0: Intel(R) I/OAT DMA Engine init failed
> ioatdma 0000:00:04.1: channel error register unreachable
> ioatdma 0000:00:04.1: channel enumeration error
> ioatdma 0000:00:04.1: Intel(R) I/OAT DMA Engine init failed

> ...

> ioatdma 0000:00:04.7: channel error register unreachable
> ioatdma 0000:00:04.7: channel enumeration error
> ioatdma 0000:00:04.7: Intel(R) I/OAT DMA Engine init failed
> mei_me 0000:00:16.0: initialization failed.
>
> There are 8 I/OAT DMA controllers on the Haswell-E chipset:
> 8086:2f20 ~ 8086:2f27
> 80:04.0 System peripheral: Intel Corporation Haswell-E DMA Channel 0 (rev 02)
> 80:04.1 System peripheral: Intel Corporation Haswell-E DMA Channel 1 (rev 02)
> 80:04.2 System peripheral: Intel Corporation Haswell-E DMA Channel 2 (rev 02)
> 80:04.3 System peripheral: Intel Corporation Haswell-E DMA Channel 3 (rev 02)
> 80:04.4 System peripheral: Intel Corporation Haswell-E DMA Channel 4 (rev 02)
> 80:04.5 System peripheral: Intel Corporation Haswell-E DMA Channel 5 (rev 02)
> 80:04.6 System peripheral: Intel Corporation Haswell-E DMA Channel 6 (rev 02)
> 80:04.7 System peripheral: Intel Corporation Haswell-E DMA Channel 7 (rev 02)
>
> Analysis:
> The bug happens when the driver is resetting DMA controller, this is
> the sequence: The function, ioat_pci_probe, is called when the DMA
> controller is detected by the PCI bus. Then,
> ioat3_dma_probe -> ioat_probe -> ioat2_enumerate_channels ->
> ioat3_reset_hw. The following code can be found in the ioat3_reset_hw:
>
> drivers/dma/ioat/dma_v3.c:
> chanerr = readl(chan->reg_base + IOAT_CHANERR_OFFSET);
> writel(chanerr, chan->reg_base + IOAT_CHANERR_OFFSET);

> ...

> err = pci_read_config_dword(pdev,
> IOAT_PCI_CHANERR_INT_OFFSET, &chanerr);
> if (err) {
> dev_err(&pdev->dev,
> "channel error register unreachable\n");
> return err;
> }
>
> Obviously, there are something wrong in the channel error register
> reset process. Then all the way back to ioat_probe(). Because the
> error happens, the dma->chancnt will be set to 0:
>
> drivers/dma/ioat/dma.c:
> if (!dma->chancnt) {
> dev_err(dev, "channel enumeration error\n");
> goto err_setup_interrupts;
> }
>
> Finally back to ioat_pci_probe:
>
> drivers/dma/ioat/pci.c:
> err = ioat3_dma_probe(device, ioat_dca_enabled);
> else
> return -ENODEV;
>
> if (err) {
> dev_err(dev, "Intel(R) I/OAT DMA Engine init
> failed\n");
> return -ENODEV;

--
~Vinod

Gavin Guo

unread,

May 18, 2016, 9:27:19 AM5/18/16

to Vinod Koul, dmae...@vger.kernel.org, linux-kernel, dan.j.w...@intel.com, dave....@intel.com

On Tue, May 17, 2016 at 6:06 PM, Vinod Koul <vinod...@intel.com> wrote:
> On Mon, May 16, 2016 at 06:08:20PM +0800, Gavin Guo wrote:
>> The following error messages can be observed on the Intel Haswell-E
>> chipset with v3.13 kernel. After the analysis, I found there is no
>> difference in the logic of these error messages in the current
>> upstream kernel. I also searched the git log and can't find any commit
>> which is fix to the error(correct me if I am wrong). The following is
>> the detail, and I'll really appreciate if there is any comment. :)
>
> 3.13 is ancient, can you check this on latest kernel

Thank you for the comment. It's running on the production system. However,
I'll try to figure out if it's possible to test the latest kernel.

Jiang, Dave

unread,

May 18, 2016, 12:49:19 PM5/18/16

to Koul, Vinod, gavi...@canonical.com, dmae...@vger.kernel.org, linux-...@vger.kernel.org, Williams, Dan J

On Wed, 2016-05-18 at 13:27 +0000, Gavin Guo wrote:
> On Tue, May 17, 2016 at 6:06 PM, Vinod Koul <vinod...@intel.com>
> wrote:
> >
> > On Mon, May 16, 2016 at 06:08:20PM +0800, Gavin Guo wrote:
> > >
> > > The following error messages can be observed on the Intel
> > > Haswell-E
> > > chipset with v3.13 kernel. After the analysis, I found there is
> > > no
> > > difference in the logic of these error messages in the current
> > > upstream kernel. I also searched the git log and can't find any
> > > commit
> > > which is fix to the error(correct me if I am wrong). The
> > > following is
> > > the detail, and I'll really appreciate if there is any comment.
> > > :)
> > 3.13 is ancient, can you check this on latest kernel
> Thank you for the comment. It's running on the production system.
> However,
> I'll try to figure out if it's possible to test the latest kernel.

I wonder if you don't have the extended PCI config space access enabled
in your kernel config.

Jiang, Dave

unread,

May 19, 2016, 10:49:41 AM5/19/16

to Gavin Guo, Koul, Vinod, dmae...@vger.kernel.org, linux-...@vger.kernel.org, Williams, Dan J

> -----Original Message-----
> From: Gavin Guo [mailto:gavi...@canonical.com]
> Sent: Wednesday, May 18, 2016 8:19 PM
> To: Jiang, Dave <dave....@intel.com>
> Cc: Koul, Vinod <vinod...@intel.com>; dmae...@vger.kernel.org; linux-...@vger.kernel.org; Williams, Dan J
> <dan.j.w...@intel.com>
> Subject: Re: ioatdma(Intel(R) I/OAT DMA Engine init failed)

>
> On Thu, May 19, 2016 at 12:49 AM, Jiang, Dave <dave....@intel.com> wrote:
> > On Wed, 2016-05-18 at 13:27 +0000, Gavin Guo wrote:
> >> On Tue, May 17, 2016 at 6:06 PM, Vinod Koul <vinod...@intel.com>
> >> wrote:
> >> >
> >> > On Mon, May 16, 2016 at 06:08:20PM +0800, Gavin Guo wrote:
> >> > >
> >> > > The following error messages can be observed on the Intel
> >> > > Haswell-E
> >> > > chipset with v3.13 kernel. After the analysis, I found there is
> >> > > no
> >> > > difference in the logic of these error messages in the current
> >> > > upstream kernel. I also searched the git log and can't find any
> >> > > commit
> >> > > which is fix to the error(correct me if I am wrong). The
> >> > > following is
> >> > > the detail, and I'll really appreciate if there is any comment.
> >> > > :)
> >> > 3.13 is ancient, can you check this on latest kernel
> >> Thank you for the comment. It's running on the production system.
> >> However,
> >> I'll try to figure out if it's possible to test the latest kernel.
> >
> > I wonder if you don't have the extended PCI config space access enabled
> > in your kernel config.
>

> Really thanks for your advice. :)
>
> I searched the internet about the extended PCI config space and found
> the link:
>
> [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
> http://lwn.net/Articles/263288/

Can you try calling pci_enable_ext_config() in the PCI probe for your kernel? I just haven't seen this issue in the latest kernel.

>
> And I checked the config and found the CONFIG_PCI_MMCONFIG=y. The
> following string also can be observed in the dmesg:
>
> [ 1.419853] PCI: MMCONFIG for domain 0000 [bus 00-ff] at
> [mem0x80000000-0x8fffffff] (base 0x80000000)
> [ 1.419855] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
>
> It seems the extended PCI config space is enabled. If there is
> anything missed?

Vinod Koul

unread,

May 19, 2016, 1:16:42 PM5/19/16

to Jiang, Dave, Gavin Guo, dmae...@vger.kernel.org, linux-...@vger.kernel.org, Williams, Dan J

Do we need that to be called explicitly by driver, should that not be enabled
by default?

--
~Vinod

Jiang, Dave

unread,

May 19, 2016, 4:18:08 PM5/19/16

to Koul, Vinod, Gavin Guo, dmae...@vger.kernel.org, linux-...@vger.kernel.org, Williams, Dan J

It used to be but the patch seems to indicate it's an opt in thing. But what I don't get is I have never encountered this issue.

Yinghai Lu

unread,

May 19, 2016, 6:18:21 PM5/19/16

to Jiang, Dave, Koul, Vinod, Gavin Guo, dmae...@vger.kernel.org, linux-...@vger.kernel.org, Williams, Dan J

On Thu, May 19, 2016 at 1:17 PM, Jiang, Dave <dave....@intel.com> wrote:
>> > > And I checked the config and found the CONFIG_PCI_MMCONFIG=y. The
>> > > following string also can be observed in the dmesg:
>> > >
>> > > [ 1.419853] PCI: MMCONFIG for domain 0000 [bus 00-ff] at
>> > > [mem0x80000000-0x8fffffff] (base 0x80000000)
>> > > [ 1.419855] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
>> > >
>> > > It seems the extended PCI config space is enabled. If there is
>> > > anything missed?

how about output for "lspci -vvxxxx" ?

Gavin Guo

unread,

May 25, 2016, 2:57:50 AM5/25/16

to Yinghai Lu, Jiang, Dave, Koul, Vinod, dmae...@vger.kernel.org, linux-...@vger.kernel.org, Williams, Dan J

Sorry that my client is slow in response. I'll bring it up if I
receive any information.