Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

NVMe performance 4x slower than expected

538 views
Skip to first unread message

Tobias Oberstein

unread,
Apr 1, 2015, 6:16:46 AM4/1/15
to freebsd...@freebsd.org, jimh...@freebsd.org, Michael Fuckner, k...@freebsd.org
Hi,

I am testing performance of a NVMe device (Intel P3700) using FIO at the
block device level and get 4x slower performance than expected:

4kB Random Read

Intel Datasheet FIO Measurement Match
P3700 450,000 107,092 24%
DC S3700 75,000 67,186 90%

The 2nd line are results for an Intel DC S3700 for comparison (with this
device, I do see the performance expected, but not for the P3700).

Hardware:

- 4 sockets, 48 core x86-64, 3TB RAM
- 8 x Intel P3700 2TB
- 12 x Intel DC S3700 800GB (via LSI HBAs)

Software:

FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
doesn't boot at all .. because of 3TB RAM and the amount of periphery).

Complete info and test logs are here:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/perftests.md

Right now I am running Linux on the box (openSUSE 13.2). Using the exact
same FIO control file, the values for the DC S3700 are very close to
FreeBSD, but the values for the P3700 are much higher:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/perftests.md#more-numbers-linux

I am looking for tuning hints or general advice for FreeBSD and NVMe.

I would like to go with FreeBSD (a major aspect is ZFS), but the
performance issues with NVMe might be a deal breaker.

Cheers,
/Tobias
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Alan Somers

unread,
Apr 1, 2015, 10:43:19 AM4/1/15
to Tobias Oberstein, freebsd...@freebsd.org, Michael Fuckner, k...@freebsd.org, jimh...@freebsd.org
On Wed, Apr 1, 2015 at 4:16 AM, Tobias Oberstein
<tobias.o...@gmail.com> wrote:
> Hi,
>
> I am testing performance of a NVMe device (Intel P3700) using FIO at the
> block device level and get 4x slower performance than expected:
>
> 4kB Random Read
>
> Intel Datasheet FIO Measurement Match
> P3700 450,000 107,092 24%
> DC S3700 75,000 67,186 90%
>
> The 2nd line are results for an Intel DC S3700 for comparison (with this
> device, I do see the performance expected, but not for the P3700).
>
> Hardware:
>
> - 4 sockets, 48 core x86-64, 3TB RAM
> - 8 x Intel P3700 2TB
> - 12 x Intel DC S3700 800GB (via LSI HBAs)
>
> Software:
>
> FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
> doesn't boot at all .. because of 3TB RAM and the amount of periphery).


Do you still have WITNESS and INVARIANTS turned on in your kernel
config? They're turned on by default for Current, but they do have
some performance impact. To turn them off, just build a
GENERIC-NODEBUG kernel .

Jim Harris

unread,
Apr 1, 2015, 4:05:12 PM4/1/15
to Alan Somers, freebsd...@freebsd.org, Tobias Oberstein, Michael Fuckner, Konstantin Belousov
On Wed, Apr 1, 2015 at 7:42 AM, Alan Somers <aso...@freebsd.org> wrote:

> On Wed, Apr 1, 2015 at 4:16 AM, Tobias Oberstein
> <tobias.o...@gmail.com> wrote:
> > Hi,
> >
> > I am testing performance of a NVMe device (Intel P3700) using FIO at the
> > block device level and get 4x slower performance than expected:
> >
> > 4kB Random Read
> >
> > Intel Datasheet FIO Measurement Match
> > P3700 450,000 107,092 24%
> > DC S3700 75,000 67,186 90%
> >
> > The 2nd line are results for an Intel DC S3700 for comparison (with this
> > device, I do see the performance expected, but not for the P3700).
> >
> > Hardware:
> >
> > - 4 sockets, 48 core x86-64, 3TB RAM
> > - 8 x Intel P3700 2TB
> > - 12 x Intel DC S3700 800GB (via LSI HBAs)
> >
> > Software:
> >
> > FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
> > doesn't boot at all .. because of 3TB RAM and the amount of periphery).
>
>
> Do you still have WITNESS and INVARIANTS turned on in your kernel
> config? They're turned on by default for Current, but they do have
> some performance impact. To turn them off, just build a
> GENERIC-NODEBUG kernel .
>
>
Could you also post full dmesg output as well as vmstat -i?

Tobias Oberstein

unread,
Apr 1, 2015, 4:52:31 PM4/1/15
to Jim Harris, Alan Somers, freebsd...@freebsd.org, Michael Fuckner, Konstantin Belousov
> > FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
> > doesn't boot at all .. because of 3TB RAM and the amount of periphery).
>
> Do you still have WITNESS and INVARIANTS turned on in your kernel
> config? They're turned on by default for Current, but they do have
> some performance impact. To turn them off, just build a
> GENERIC-NODEBUG kernel .

WITNESS is off, INVARIANTS is still on.

Here is complete config:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_kernel_conf.md

This is the aggregated patch (work was done by Konstantin - thanks again
btw!)

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_patch.md

> Could you also post full dmesg output as well as vmstat -i?

dmesg:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_dmesg.md

vmstat:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_vmstat.md

===

Here are results from FIO under FreeBSD:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd.md

Here are results using _same_ FIO control file under Linux:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/linux.md

===

The firmware for the P3700 cards was updated to the very latest as of
today (using isdct under Linux).

Konstantin Belousov

unread,
Apr 1, 2015, 5:23:25 PM4/1/15
to Tobias Oberstein, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
On Wed, Apr 01, 2015 at 10:52:18PM +0200, Tobias Oberstein wrote:
> > > FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
> > > doesn't boot at all .. because of 3TB RAM and the amount of periphery).
> >
> > Do you still have WITNESS and INVARIANTS turned on in your kernel
> > config? They're turned on by default for Current, but they do have
> > some performance impact. To turn them off, just build a
> > GENERIC-NODEBUG kernel .
>
> WITNESS is off, INVARIANTS is still on.
INVARIANTS are costly.
Is this vmstat after the test ?
Somewhat funny is that nvme does not use MSI(X).

I have the following patch for a long time, it allowed to increase pps
in iperf and similar tests when DMAR is enabled. In your case it could
reduce the rate of the DMAR interrupts.

diff --git a/sys/x86/iommu/intel_ctx.c b/sys/x86/iommu/intel_ctx.c
index a18adcf..b23a4c1 100644
--- a/sys/x86/iommu/intel_ctx.c
+++ b/sys/x86/iommu/intel_ctx.c
@@ -586,6 +586,18 @@ dmar_ctx_unload_entry(struct dmar_map_entry *entry, bool free)
}
}

+static struct dmar_qi_genseq *
+dmar_ctx_unload_gseq(struct dmar_ctx *ctx, struct dmar_map_entry *entry,
+ struct dmar_qi_genseq *gseq)
+{
+
+ if (TAILQ_NEXT(entry, dmamap_link) != NULL)
+ return (NULL);
+ if (ctx->batch_no++ % dmar_batch_coalesce != 0)
+ return (NULL);
+ return (gseq);
+}
+
void
dmar_ctx_unload(struct dmar_ctx *ctx, struct dmar_map_entries_tailq *entries,
bool cansleep)
@@ -619,8 +631,7 @@ dmar_ctx_unload(struct dmar_ctx *ctx, struct dmar_map_entries_tailq *entries,
entry->gseq.gen = 0;
entry->gseq.seq = 0;
dmar_qi_invalidate_locked(ctx, entry->start, entry->end -
- entry->start, TAILQ_NEXT(entry, dmamap_link) == NULL ?
- &gseq : NULL);
+ entry->start, dmar_ctx_unload_gseq(ctx, entry, &gseq));
}
TAILQ_FOREACH_SAFE(entry, entries, dmamap_link, entry1) {
entry->gseq = gseq;
diff --git a/sys/x86/iommu/intel_dmar.h b/sys/x86/iommu/intel_dmar.h
index 2865ab5..6e0ab7f 100644
--- a/sys/x86/iommu/intel_dmar.h
+++ b/sys/x86/iommu/intel_dmar.h
@@ -93,6 +93,7 @@ struct dmar_ctx {
u_int entries_cnt;
u_long loads;
u_long unloads;
+ u_int batch_no;
struct dmar_gas_entries_tree rb_root;
struct dmar_map_entries_tailq unload_entries; /* Entries to unload */
struct dmar_map_entry *first_place, *last_place;
@@ -339,6 +340,7 @@ extern dmar_haddr_t dmar_high;
extern int haw;
extern int dmar_tbl_pagecnt;
extern int dmar_match_verbose;
+extern int dmar_batch_coalesce;
extern int dmar_check_free;

static inline uint32_t
diff --git a/sys/x86/iommu/intel_drv.c b/sys/x86/iommu/intel_drv.c
index c239579..e7dc3f9 100644
--- a/sys/x86/iommu/intel_drv.c
+++ b/sys/x86/iommu/intel_drv.c
@@ -153,7 +153,7 @@ dmar_count_iter(ACPI_DMAR_HEADER *dmarh, void *arg)
return (1);
}

-static int dmar_enable = 0;
+static int dmar_enable = 1;
static void
dmar_identify(driver_t *driver, device_t parent)
{
diff --git a/sys/x86/iommu/intel_utils.c b/sys/x86/iommu/intel_utils.c
index f696f9d..d3c3267 100644
--- a/sys/x86/iommu/intel_utils.c
+++ b/sys/x86/iommu/intel_utils.c
@@ -624,6 +624,7 @@ dmar_barrier_exit(struct dmar_unit *dmar, u_int barrier_id)
}

int dmar_match_verbose;
+int dmar_batch_coalesce = 100;

static SYSCTL_NODE(_hw, OID_AUTO, dmar, CTLFLAG_RD, NULL, "");
SYSCTL_INT(_hw_dmar, OID_AUTO, tbl_pagecnt, CTLFLAG_RD,
@@ -632,6 +633,9 @@ SYSCTL_INT(_hw_dmar, OID_AUTO, tbl_pagecnt, CTLFLAG_RD,
SYSCTL_INT(_hw_dmar, OID_AUTO, match_verbose, CTLFLAG_RWTUN,
&dmar_match_verbose, 0,
"Verbose matching of the PCI devices to DMAR paths");
+SYSCTL_INT(_hw_dmar, OID_AUTO, batch_coalesce, CTLFLAG_RW | CTLFLAG_TUN,
+ &dmar_batch_coalesce, 0,
+ "Number of qi batches between interrupt");
#ifdef INVARIANTS
int dmar_check_free;
SYSCTL_INT(_hw_dmar, OID_AUTO, check_free, CTLFLAG_RWTUN,

Jim Harris

unread,
Apr 1, 2015, 5:35:08 PM4/1/15
to Konstantin Belousov, freebsd...@freebsd.org, Tobias Oberstein, Michael Fuckner, Alan Somers
On Wed, Apr 1, 2015 at 2:23 PM, Konstantin Belousov <kost...@gmail.com>
wrote:
Yes - this is exactly the problem.

nvme does use MSI-X if it can allocate the vectors (one per core). With 48
cores,
I suspect we are quickly running out of vectors, so NVMe is reverting to
INTx.

Could you actually send vmstat -ia (I left off the 'a' previously) - just
so we can
see all allocated interrupt vectors.

As an experiment, can you try disabling hyperthreading - this will reduce
the
number of cores and should let you get MSI-X vectors allocated for at least
the first couple of NVMe controllers. Then please re-run your performance
test on one of those controllers.

sys/x86/x86/local_apic.c defines APIC_NUM_IOINTS to 191 - it looks like this
is the actual limit for MSI-X vectors, even though NUM_MSI_INTS is set to
512.

Tobias Oberstein

unread,
Apr 1, 2015, 6:04:31 PM4/1/15
to Jim Harris, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Alan Somers
> Is this vmstat after the test ?

No, it wasn't (I ran vmstat hours after the test).

Here is right after test (shortened test duration, otherwise exactly the
same FIO config):

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_vmstat.md#nvd7

> Somewhat funny is that nvme does not use MSI(X).
>
>
> Yes - this is exactly the problem.
>
> nvme does use MSI-X if it can allocate the vectors (one per core). With
> 48 cores,
> I suspect we are quickly running out of vectors, so NVMe is reverting to
> INTx.
>
> Could you actually send vmstat -ia (I left off the 'a' previously) -
> just so we can
> see all allocated interrupt vectors.
>
> As an experiment, can you try disabling hyperthreading - this will
> reduce the

The CPUs in this box

root@s4l-zfs:~/src/sys/amd64/conf # sysctl hw.model
hw.model: Intel(R) Xeon(R) CPU E7-8857 v2 @ 3.00GHz

don't have hyperthreading (we deliberately selected CPU model for max.
clock rather than HT)

http://ark.intel.com/products/75254/Intel-Xeon-Processor-E7-8857-v2-30M-Cache-3_00-GHz

> number of cores and should let you get MSI-X vectors allocated for at least
> the first couple of NVMe controllers. Then please re-run your performance
> test on one of those controllers.
>

You mean I should run against nvdN where N is a controller that still
got MSI-X while other controllers did not?

How would I find out which controller N? I don't know which nvdN is
mounted in a PCIe slot directly assigned to which CPU socket, and I
don't know which one's still got MSI-X and which not.

I could arrange for disabling all but 1 CPU and retest. Would that help?

===

Right after running against nvd7

irq56: nvme0 6440 0
..
irq106: nvme7 145056 3


Then, immediately thereafter, running against nvd0

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_vmstat.md#nvd0

irq56: nvme0 9233 0
..
irq106: nvme7 145056 3

===

Earlier this day, I ran multiple longer tests .. all against nvd7. So if
these are cumulative numbers since last boot, that would make sense.

Tobias Oberstein

unread,
Apr 1, 2015, 6:24:59 PM4/1/15
to Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
Am 01.04.2015 um 23:23 schrieb Konstantin Belousov:
> On Wed, Apr 01, 2015 at 10:52:18PM +0200, Tobias Oberstein wrote:
>>> > FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
>>> > doesn't boot at all .. because of 3TB RAM and the amount of periphery).
>>>
>>> Do you still have WITNESS and INVARIANTS turned on in your kernel
>>> config? They're turned on by default for Current, but they do have
>>> some performance impact. To turn them off, just build a
>>> GENERIC-NODEBUG kernel .
>>
>> WITNESS is off, INVARIANTS is still on.
> INVARIANTS are costly.

ah, ok. will rebuild without this option.

> I have the following patch for a long time, it allowed to increase pps
> in iperf and similar tests when DMAR is enabled. In your case it could
> reduce the rate of the DMAR interrupts.

You mean these lines from vmstat?

irq257: dmar0:qi 22312 0
irq259: dmar1:qi 22652 0
irq261: dmar2:qi 261874194 6911
irq263: dmar3:qi 124939 3

So these dmar2 interrupts come from DMAR region 2 which is used by nvd7?

From dmesg:

dmar0: <DMA remap> iomem 0xc7ffc000-0xc7ffcfff on acpi0
dmar1: <DMA remap> iomem 0xe3ffc000-0xe3ffcfff on acpi0
dmar2: <DMA remap> iomem 0xfbffc000-0xfbffcfff on acpi0
dmar3: <DMA remap> iomem 0xabffc000-0xabffcfff on acpi0

mpr0: dmar3 pci0:4:0:0 rid 400 domain 4 mgaw 48 agaw 48 re-mapped
mpr1: dmar2 pci0:195:0:0 rid c300 domain 2 mgaw 48 agaw 48 re-mapped

nvme0: dmar0 pci0:65:0:0 rid 4100 domain 0 mgaw 48 agaw 48 re-mapped
nvme1: dmar0 pci0:67:0:0 rid 4300 domain 1 mgaw 48 agaw 48 re-mapped
nvme2: dmar0 pci0:69:0:0 rid 4500 domain 2 mgaw 48 agaw 48 re-mapped
nvme3: dmar1 pci0:129:0:0 rid 8100 domain 0 mgaw 48 agaw 48 re-mapped
nvme4: dmar1 pci0:131:0:0 rid 8300 domain 1 mgaw 48 agaw 48 re-mapped
nvme5: dmar1 pci0:132:0:0 rid 8400 domain 2 mgaw 48 agaw 48 re-mapped
nvme6: dmar2 pci0:193:0:0 rid c100 domain 0 mgaw 48 agaw 48 re-mapped
nvme7: dmar2 pci0:194:0:0 rid c200 domain 1 mgaw 48 agaw 48 re-mapped

unknown: dmar3 pci0:0:29:0 rid e8 domain 0 mgaw 48 agaw 48 re-mapped
unknown: dmar3 pci0:0:26:0 rid d0 domain 1 mgaw 48 agaw 48 re-mapped

ix0: dmar3 pci0:1:0:0 rid 100 domain 2 mgaw 48 agaw 48 re-mapped
ix1: dmar3 pci0:1:0:1 rid 101 domain 3 mgaw 48 agaw 48 re-mapped

ix0: Using MSIX interrupts with 49 vectors
ix1: Using MSIX interrupts with 49 vectors

--

So the LSI HBAs, Intel NICs and NVMe are all using DMAR, but only the
NICs use MSI-X?

But 2 * 49 = 98, and that is smaller than the 191 which Jim mentions.

And what are those "unknown" devices on dmar3?

Actually, I don't really know what I am talking about here .. just puzzled.

/Tobias

Adrian Chadd

unread,
Apr 1, 2015, 6:55:32 PM4/1/15
to Tobias Oberstein, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
Out of curiousity - why not cap how many MSI's are created? Why's it
need one per CPU?


-a

Jim Harris

unread,
Apr 1, 2015, 7:25:05 PM4/1/15
to Tobias Oberstein, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Alan Somers
vmstat -ia should show you which controllers were assigned per-core vectors
- you'll see all of them in the irq256+ range instead of the single vector
per controller you see now in the lower irq index range.


>
> I could arrange for disabling all but 1 CPU and retest. Would that help?
>

Yes - that would help. Depending on how your system is configured, and
which CPU socket the NVMe controllers are attached to, you may need to keep
2 CPU sockets enabled.

You can also try a debug tunable that is in the nvme driver.

hw.nvme.per_cpu_io_queues=0

This would just try to allocate a single MSI-X vector per controller - so
all cores would still share a single I/O queue pair, but it would be MSI-X
instead of INTx. (This actually should be the first fallback if we cannot
allocate per-core vectors). Would at least show we are able to allocate
some number of MSI-X vectors for NVMe.


>
> ===
>
> Right after running against nvd7
>
> irq56: nvme0 6440 0
> ...
> irq106: nvme7 145056 3
>
>
> Then, immediately thereafter, running against nvd0
>
> https://github.com/oberstet/scratchbox/blob/master/
> freebsd/cruncher/results/freebsd_vmstat.md#nvd0
>
> irq56: nvme0 9233 0
> ...

Konstantin Belousov

unread,
Apr 2, 2015, 6:15:38 AM4/2/15
to Tobias Oberstein, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
On Thu, Apr 02, 2015 at 12:24:45AM +0200, Tobias Oberstein wrote:
> Am 01.04.2015 um 23:23 schrieb Konstantin Belousov:
> > On Wed, Apr 01, 2015 at 10:52:18PM +0200, Tobias Oberstein wrote:
> >>> > FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise the box
> >>> > doesn't boot at all .. because of 3TB RAM and the amount of periphery).
> >>>
> >>> Do you still have WITNESS and INVARIANTS turned on in your kernel
> >>> config? They're turned on by default for Current, but they do have
> >>> some performance impact. To turn them off, just build a
> >>> GENERIC-NODEBUG kernel .
> >>
> >> WITNESS is off, INVARIANTS is still on.
> > INVARIANTS are costly.
>
> ah, ok. will rebuild without this option.
>
> > I have the following patch for a long time, it allowed to increase pps
> > in iperf and similar tests when DMAR is enabled. In your case it could
> > reduce the rate of the DMAR interrupts.
>
> You mean these lines from vmstat?
>
> irq257: dmar0:qi 22312 0
> irq259: dmar1:qi 22652 0
> irq261: dmar2:qi 261874194 6911
> irq263: dmar3:qi 124939 3
>
> So these dmar2 interrupts come from DMAR region 2 which is used by nvd7?
Dmar unit 2.

In modern machines, there is one (or two, sometimes) translation units
per CPU package, which handle devices from the PCIe buses rooted in the
socket.

Interrupt stats above mean that the load on your machine is unbalanced
WRT PCIe buses, most of the DMA transfers were performed by devices
attached to the bus(es) on socket where DMAR 2 is located.

>
> From dmesg:
>
> dmar0: <DMA remap> iomem 0xc7ffc000-0xc7ffcfff on acpi0
> dmar1: <DMA remap> iomem 0xe3ffc000-0xe3ffcfff on acpi0
> dmar2: <DMA remap> iomem 0xfbffc000-0xfbffcfff on acpi0
> dmar3: <DMA remap> iomem 0xabffc000-0xabffcfff on acpi0
>
> mpr0: dmar3 pci0:4:0:0 rid 400 domain 4 mgaw 48 agaw 48 re-mapped
> mpr1: dmar2 pci0:195:0:0 rid c300 domain 2 mgaw 48 agaw 48 re-mapped
>
> nvme0: dmar0 pci0:65:0:0 rid 4100 domain 0 mgaw 48 agaw 48 re-mapped
> nvme1: dmar0 pci0:67:0:0 rid 4300 domain 1 mgaw 48 agaw 48 re-mapped
> nvme2: dmar0 pci0:69:0:0 rid 4500 domain 2 mgaw 48 agaw 48 re-mapped
> nvme3: dmar1 pci0:129:0:0 rid 8100 domain 0 mgaw 48 agaw 48 re-mapped
> nvme4: dmar1 pci0:131:0:0 rid 8300 domain 1 mgaw 48 agaw 48 re-mapped
> nvme5: dmar1 pci0:132:0:0 rid 8400 domain 2 mgaw 48 agaw 48 re-mapped
> nvme6: dmar2 pci0:193:0:0 rid c100 domain 0 mgaw 48 agaw 48 re-mapped
> nvme7: dmar2 pci0:194:0:0 rid c200 domain 1 mgaw 48 agaw 48 re-mapped
>
> unknown: dmar3 pci0:0:29:0 rid e8 domain 0 mgaw 48 agaw 48 re-mapped
> unknown: dmar3 pci0:0:26:0 rid d0 domain 1 mgaw 48 agaw 48 re-mapped
>
> ix0: dmar3 pci0:1:0:0 rid 100 domain 2 mgaw 48 agaw 48 re-mapped
> ix1: dmar3 pci0:1:0:1 rid 101 domain 3 mgaw 48 agaw 48 re-mapped
>
> ix0: Using MSIX interrupts with 49 vectors
> ix1: Using MSIX interrupts with 49 vectors
>
> --
>
> So the LSI HBAs, Intel NICs and NVMe are all using DMAR, but only the
> NICs use MSI-X?
MSI-X is the method of reporting interrupt requests to CPUs.
DMARs are some engines to translate addresses of DMA requests (and also
to translate interrupts).

>
> But 2 * 49 = 98, and that is smaller than the 191 which Jim mentions.
>
> And what are those "unknown" devices on dmar3?
0:26:0 and 0:29:0 are USB controllers, most likely, the b/d/f numbers are
typical for the Intel PCH. "unknown" is displayed when pci device does
not have driver attached, you probably do not have USB loaded. DMAR still
has to enable the translation context for USB controllers, since BIOS
performs transfers behind the OS, and instructs DMAR driver to enable
mappings.

Tobias Oberstein

unread,
Apr 2, 2015, 10:13:04 AM4/2/15
to Jim Harris, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Alan Somers
> You can also try a debug tunable that is in the nvme driver.
>
> hw.nvme.per_cpu_io_queues=0

I have rerun tests with kernel that has INVARIANTS off, and above sysctl
in loader.conf.

Results are the same.

vmstat now:

root@s4l-zfs:~/oberstet # vmstat -ia | grep nvme
irq371: nvme0 8 0
irq372: nvme0 7478 0
irq373: nvme1 8 0
irq374: nvme1 7612 0
irq375: nvme2 8 0
irq376: nvme2 7695 0
irq377: nvme3 7 0
irq378: nvme3 7716 0
irq379: nvme4 8 0
irq380: nvme4 7622 0
irq381: nvme5 7 0
irq382: nvme5 7561 0
irq383: nvme6 8 0
irq384: nvme6 7609 0
irq385: nvme7 7 0
irq386: nvme7 15373417 1174

===

I was advised (off list) to run tests against a pure ramdisk.

Here are results from a single socket E3:

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#xeon-e3-machine

and here are results for the 48 core box

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#48-core-big-machine

Performance with this box is 1/10 on this test compared to single socket E3!

Something is severely wrong. It seems, there might be multiple issues
(not only NVMe). And this is after already running with 3 patches to
make it even boot.

Well. I'm running out of ideas to try, and also patience with the users
waiting for this box =(

/Tobias

Kurt Lidl

unread,
Apr 2, 2015, 10:44:25 AM4/2/15
to freebsd...@freebsd.org
On 4/2/15 10:12 AM, Tobias Oberstein wrote:
> I was advised (off list) to run tests against a pure ramdisk.
>
> Here are results from a single socket E3:
>
> https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#xeon-e3-machine
>
>
> and here are results for the 48 core box
>
> https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#48-core-big-machine
>
>
> Performance with this box is 1/10 on this test compared to single socket
> E3!
>
> Something is severely wrong. It seems, there might be multiple issues
> (not only NVMe). And this is after already running with 3 patches to
> make it even boot.

Offhand, I'd guess the performance difference between the single-socket
machine and the quad-socket machine has to do with the NUMA effects of
the memory in the multi-socket system.

FreeBSD does not have per-socket memory allocation/affliation at this
time. So, some of the memory allocated to your ramdisk might be
accessible to your process only over the QPI interconnect between the
different CPU sockets.

You could install the latest and greatest intel-pcm tools from
/usr/ports and see what that says about the memory while you are
running your randomio/fio tests.

-Kurt

Konstantin Belousov

unread,
Apr 2, 2015, 6:10:51 PM4/2/15
to Tobias Oberstein, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
The speed of dd already looks wrong.

Check the CPU frequency settings in BIOS, and check what sysctl dev.cpu
reports. Ensure that cpufreq.ko is loaded from loader.conf.

Tobias Oberstein

unread,
Apr 4, 2015, 7:15:15 AM4/4/15
to Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
>> I was advised (off list) to run tests against a pure ramdisk.
>>
>> Here are results from a single socket E3:
>>
>> https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#xeon-e3-machine
>>
>> and here are results for the 48 core box
>>
>> https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#48-core-big-machine
>>
>> Performance with this box is 1/10 on this test compared to single socket E3!
>>
>> Something is severely wrong. It seems, there might be multiple issues
>> (not only NVMe). And this is after already running with 3 patches to
>> make it even boot.
> The speed of dd already looks wrong.

Yes.

>
> Check the CPU frequency settings in BIOS, and check what sysctl dev.cpu
> reports. Ensure that cpufreq.ko is loaded from loader.conf.

It's loaded now, and CPU clock is at maximum: dev.cpu.0.freq: 3000

Unfortuanetly, performance (randomio/fio) did not change ..

Konstantin Belousov

unread,
Apr 4, 2015, 7:48:59 PM4/4/15
to Tobias Oberstein, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
On Sat, Apr 04, 2015 at 01:14:59PM +0200, Tobias Oberstein wrote:
> >> I was advised (off list) to run tests against a pure ramdisk.
> >>
> >> Here are results from a single socket E3:
> >>
> >> https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#xeon-e3-machine
> >>
> >> and here are results for the 48 core box
> >>
> >> https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_ramdisk.md#48-core-big-machine
> >>
> >> Performance with this box is 1/10 on this test compared to single socket E3!
> >>
> >> Something is severely wrong. It seems, there might be multiple issues
> >> (not only NVMe). And this is after already running with 3 patches to
> >> make it even boot.
> > The speed of dd already looks wrong.
>
> Yes.
Try some simple checks for the performance anomalies. E.g., verify the
raw memory bandwidth, check that it is in the expected range. Quick search
popped up the following tool, e.g.:
https://zsmith.co/bandwidth.html

Tobias Oberstein

unread,
Apr 7, 2015, 8:39:55 AM4/7/15
to Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
>>>> Something is severely wrong. It seems, there might be multiple issues
>>>> (not only NVMe). And this is after already running with 3 patches to
>>>> make it even boot.
>>> The speed of dd already looks wrong.
>>
>> Yes.
> Try some simple checks for the performance anomalies. E.g., verify the
> raw memory bandwidth, check that it is in the expected range. Quick search
> popped up the following tool, e.g.:
> https://zsmith.co/bandwidth.html

Here are results for those 2 machines (single socket 4 core E3, quad
socket 48 core E7):

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_memperf.md

https://raw.githubusercontent.com/oberstet/scratchbox/master/freebsd/cruncher/results/bw_4core.bmp

https://raw.githubusercontent.com/oberstet/scratchbox/master/freebsd/cruncher/results/bw_48core.bmp

Jim Harris

unread,
Apr 9, 2015, 1:16:42 PM4/9/15
to Konstantin Belousov, freebsd...@freebsd.org, Tobias Oberstein, Michael Fuckner, Alan Somers
On Wed, Apr 1, 2015 at 2:34 PM, Jim Harris <jim.h...@gmail.com> wrote:

>
>
> Yes - this is exactly the problem.
>
> nvme does use MSI-X if it can allocate the vectors (one per core). With
> 48 cores,
> I suspect we are quickly running out of vectors, so NVMe is reverting to
> INTx.
>
> Could you actually send vmstat -ia (I left off the 'a' previously) - just
> so we can
> see all allocated interrupt vectors.
>
> As an experiment, can you try disabling hyperthreading - this will reduce
> the
> number of cores and should let you get MSI-X vectors allocated for at least
> the first couple of NVMe controllers. Then please re-run your performance
> test on one of those controllers.
>
> sys/x86/x86/local_apic.c defines APIC_NUM_IOINTS to 191 - it looks like
> this
> is the actual limit for MSI-X vectors, even though NUM_MSI_INTS is set to
> 512.
>

Update on this vector allocation discussion:

The DC P3*00 SSDs have a 32-entry MSI-X vector table, which is fewer than
the 48 cores on your system, so it falls back to INTx. I committed r281280
which will make it fall back to 2 MSI-X vectors (one for admin queue, one
for
a single I/O queue) which is better than INTx. In the future this driver
will be
improved to utilize the available number of interrupts rather than falling
back to
a single I/O queue. (Based on your ramdisk performance data, it does not
appear that lack of per-CPU NVMe I/O queues is the cause of the performance
issues on this system - but I'm working to confirm on a system in my lab.)

I also root caused a vector allocation issue and filed PR199321. In
summary,
driver initialization occurs before APs are started, so MSI-X allocation
code only
allocates IRQs from the BSP's lapic. The BSP's lapic can only accommodate
191 (APIC_NUM_IOINTS) total vectors. The allocation code is designed to
round-robin allocations across all lapics, but does not allow it until
after the
APs are started which is after driver initialization during boot. Drivers
that are
loaded after boot should round-robin IRQ allocation across all physical
cores,
up to the NUM_MSI_INTS (512) limit.

Tobias Oberstein

unread,
Apr 9, 2015, 5:08:21 PM4/9/15
to Jim Harris, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Alan Somers
Hi Jim,

thanks for coming back to this and your work / infos - highly appreciated!

> (Based on your ramdisk performance data, it does not
> appear that lack of per-CPU NVMe I/O queues is the cause of the performance
> issues on this system -

My unscientific gut feeling is: it might be related to NUMA in general.

The memory performance

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_memperf.md#results-48-core-numa-machine

is slower than a E3 single socket Xeon

https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_memperf.md#results-small-xeon-machine

The E3 is Haswell at 3.4 GHz, whereas the E7 is one gen. older and 3.0
GHz, but I don't think this explains the very large difference.

The 4 socket box should have an aggregate main memory bandwidth of

4 x 85GB/s = 340 GB/s

The measured numbers are orders smaller.

> but I'm working to confirm on a system in my lab.)

FWIW, the box I am testing is

http://www.quantaqct.com/Product/Servers/Rackmount-Servers/4U/QuantaGrid-Q71L-4U-p18c77c70c83c79

The box is maxed out on RAM, CPU (mostly), internal SSDs, as well as
PCIe cards (it has 10 slots). There are very few x86 systems with bigger
scale-up. Tops out with the SGI Ultraviolet UV2000. But this is totally
exotic, whereas above is pure Intel design.

How about Intel donating such a baby to FBSD foundation to get NUMA and
everything sorted out? Street price is roughly 150k, but given most of
the components are made by Intel, should be cheaper for Intel;)

==

Sadly, given the current state of affairs, I couldn't support targeting
FreeBSD on this system any longer. Customer wants to go to production
soonish. We'll be using Linux / SLES12. Performance at block device
level there is as expected from Intel datasheets. Means: massive! We now
"only" need to translate those millions of IOPS from block device to
filesystem level and then database (PostgreSQL). Ha, will be fun;) And I
will miss ZFS and all the FreeBSD goodies =(

Cheers,
/Tobias

Adrian Chadd

unread,
Apr 9, 2015, 5:22:19 PM4/9/15
to Tobias Oberstein, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
Hi,

Dell has graciously loaned me a bunch of hardware to continue doing
NUMA development on, but I have no NVMe hardware. I'm hoping people at
Intel can continue kicking along any desires for NUMA that they
require. (Which they have, fwiw.)



-adrian

Tobias Oberstein

unread,
Apr 10, 2015, 12:08:04 PM4/10/15
to Adrian Chadd, Konstantin Belousov, freebsd...@freebsd.org, Michael Fuckner, Jim Harris, Alan Somers
Hi Adrian,

> Dell has graciously loaned me a bunch of hardware to continue doing

FWIW, Dell has a roughly comparable system: Dell R920. But they don't
have Intel NVMe's on their menu, only Samsung (and FusionIO, but that's
not NVMe).

> NUMA development on, but I have no NVMe hardware. I'm hoping people at

The 8 NVMe PCIe SSDs in the box we're deploying are a key feature of
this system (will be a data-warehouse). A single NVMe probably won't
have triggered (all) issues we experienced.

We are using the largest model (2TB), and this amounts to 50k bucks for
all eight. The smallest model (400GB) is 1.5k, so 12k in total.

> Intel can continue kicking along any desires for NUMA that they
> require. (Which they have, fwiw.)

It's already awesome that Intel has senior engineers working on FreeBSD
driver code! And it would underline Intel's Open-source commitment and
tech leadership if they donated a couple of these beefy NVMes.

The specs are incredible, but extracing all the performance is
non-trivial ..

Cheers,
/Tobias

Jim Harris

unread,
Apr 10, 2015, 12:58:16 PM4/10/15
to Tobias Oberstein, Konstantin Belousov, freebsd...@freebsd.org, Adrian Chadd, Michael Fuckner, Alan Somers
On Fri, Apr 10, 2015 at 9:07 AM, Tobias Oberstein <
tobias.o...@gmail.com> wrote:

> Hi Adrian,
>
> > Dell has graciously loaned me a bunch of hardware to continue doing
>
> FWIW, Dell has a roughly comparable system: Dell R920. But they don't have
> Intel NVMe's on their menu, only Samsung (and FusionIO, but that's not
> NVMe).
>
> NUMA development on, but I have no NVMe hardware. I'm hoping people at
>>
>
> The 8 NVMe PCIe SSDs in the box we're deploying are a key feature of this
> system (will be a data-warehouse). A single NVMe probably won't have
> triggered (all) issues we experienced.
>
> We are using the largest model (2TB), and this amounts to 50k bucks for
> all eight. The smallest model (400GB) is 1.5k, so 12k in total.
>
> Intel can continue kicking along any desires for NUMA that they
>> require. (Which they have, fwiw.)
>>
>
> It's already awesome that Intel has senior engineers working on FreeBSD
> driver code! And it would underline Intel's Open-source commitment and tech
> leadership if they donated a couple of these beefy NVMes.
>

Intel has agreed to send DC P3700 samples to the FreeBSD Foundation to put
in the cluster for this kind of work - we are working on getting these
through the internal sample distribution process at the moment.

-Jim

Tobias Oberstein

unread,
Apr 16, 2015, 4:31:59 PM4/16/15
to Jim Harris, Konstantin Belousov, freebsd...@freebsd.org, Adrian Chadd, Michael Fuckner, Alan Somers
> It's already awesome that Intel has senior engineers working on
> FreeBSD driver code! And it would underline Intel's Open-source
> commitment and tech leadership if they donated a couple of these
> beefy NVMes.
>
> Intel has agreed to send DC P3700 samples to the FreeBSD Foundation to
> put in the cluster for this kind of work - we are working on getting
> these through the internal sample distribution process at the moment.

This is awesome! Intel rocks. I am spreading the word about this.

==

OT (slightly): I've done more measurements under Linux, now with all
eight cards (md RAID-0 over all 8) under fire from FIO.

The results are very good. Here is 1 number (block device level):

2647.4K random 4kB read IOPS @ <800us (95% quantile)

The Intel datasheet says 450k IOPS per card, which would be 3600K, so
the scaleup is quite good. The random _write_ IOPS I've measured is
almost too high to believe (2390.3K) where Intel datasheet suggests
1400k - I am running a longer test overnight now. Let's see if it can be
sustained. Anyway: these Intel NVMe's are great tech!

We'll now move on testing at filesystem level (XFS) and database level
(PostgreSQL).

Cheers,
/Tobias



FIO Ergebnisse 1 NVMe Karte
===========================

bvr-sql18:~ # fio --filename /dev/nvme7n1 control.fio
random-read-4k: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
iodepth=1
..
random-write-4k: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
ioengine=sync, iodepth=1
..
sequential-read-128k: (g=2): rw=read, bs=128K-128K/128K-128K/128K-128K,
ioengine=sync, iodepth=1
..
sequential-write-128k: (g=3): rw=write,
bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
..
fio-2.2.6
Starting 256 threads
Jobs: 64 (f=64): [_(192),W(64)] [2.4% done] [0KB/1902MB/0KB /s]
[0/15.3K/0 iops] [eta 09h:36m:00s] ]
random-read-4k: (groupid=0, jobs=64): err= 0: pid=21898: Thu Apr 16
18:56:01 2015
read : io=381205MB, bw=2117.9MB/s, iops=542157, runt=180000msec
clat (usec): min=6, max=5568, avg=116.92, stdev=103.46
lat (usec): min=6, max=5568, avg=117.00, stdev=103.47
clat percentiles (usec):
| 1.00th=[ 10], 5.00th=[ 14], 10.00th=[ 18], 20.00th=[ 26],
| 30.00th=[ 36], 40.00th=[ 54], 50.00th=[ 127], 60.00th=[ 151],
| 70.00th=[ 171], 80.00th=[ 193], 90.00th=[ 223], 95.00th=[ 249],
| 99.00th=[ 310], 99.50th=[ 346], 99.90th=[ 828], 99.95th=[ 1880],
| 99.99th=[ 2864]
bw (KB /s): min= 0, max=36016, per=1.56%, avg=33798.82,
stdev=1884.10
lat (usec) : 10=0.57%, 20=11.28%, 50=26.41%, 100=5.80%, 250=51.15%
lat (usec) : 500=4.64%, 750=0.04%, 1000=0.02%
lat (msec) : 2=0.04%, 4=0.05%, 10=0.01%
cpu : usr=1.57%, sys=3.76%, ctx=114070395, majf=0, minf=3150
IO depths : 1=116.9%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued : total=r=97588427/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
random-write-4k: (groupid=1, jobs=64): err= 0: pid=21994: Thu Apr 16
18:56:01 2015
write: io=301353MB, bw=1674.2MB/s, iops=428591, runt=180000msec
clat (usec): min=7, max=18212, avg=147.90, stdev=202.69
lat (usec): min=8, max=18212, avg=147.99, stdev=202.69
clat percentiles (usec):
| 1.00th=[ 14], 5.00th=[ 24], 10.00th=[ 33], 20.00th=[ 52],
| 30.00th=[ 76], 40.00th=[ 109], 50.00th=[ 145], 60.00th=[ 173],
| 70.00th=[ 195], 80.00th=[ 213], 90.00th=[ 247], 95.00th=[ 290],
| 99.00th=[ 454], 99.50th=[ 564], 99.90th=[ 1112], 99.95th=[ 1928],
| 99.99th=[13120]
bw (KB /s): min= 0, max=29584, per=1.56%, avg=26719.94,
stdev=1839.10
lat (usec) : 10=0.03%, 20=3.11%, 50=15.64%, 100=18.69%, 250=53.11%
lat (usec) : 500=8.69%, 750=0.52%, 1000=0.10%
lat (msec) : 2=0.07%, 4=0.03%, 10=0.01%, 20=0.01%
cpu : usr=1.52%, sys=3.20%, ctx=89437032, majf=0, minf=2657
IO depths : 1=115.9%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued : total=r=0/w=77146473/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
sequential-read-128k: (groupid=2, jobs=64): err= 0: pid=22088: Thu Apr
16 18:56:01 2015
read : io=424549MB, bw=2358.6MB/s, iops=18868, runt=180006msec
clat (usec): min=213, max=9712, avg=3390.92, stdev=1888.31
lat (usec): min=213, max=9713, avg=3391.02, stdev=1888.31
clat percentiles (usec):
| 1.00th=[ 548], 5.00th=[ 764], 10.00th=[ 924], 20.00th=[ 1272],
| 30.00th=[ 1832], 40.00th=[ 2608], 50.00th=[ 3376], 60.00th=[ 4128],
| 70.00th=[ 4896], 80.00th=[ 5408], 90.00th=[ 5856], 95.00th=[ 6176],
| 99.00th=[ 6752], 99.50th=[ 6944], 99.90th=[ 7584], 99.95th=[ 7968],
| 99.99th=[ 8640]
bw (KB /s): min= 4, max=40031, per=1.56%, avg=37670.97,
stdev=2609.28
lat (usec) : 250=0.01%, 500=0.59%, 750=4.10%, 1000=7.60%
lat (msec) : 2=20.12%, 4=25.78%, 10=41.81%
cpu : usr=0.07%, sys=0.29%, ctx=3980959, majf=0, minf=1920
IO depths : 1=117.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued : total=r=3396390/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
sequential-write-128k: (groupid=3, jobs=64): err= 0: pid=22153: Thu Apr
16 18:56:01 2015
write: io=337530MB, bw=1875.9MB/s, iops=15000, runt=180008msec
clat (usec): min=52, max=25625, avg=4258.89, stdev=2628.32
lat (usec): min=52, max=25625, avg=4259.05, stdev=2628.31
clat percentiles (usec):
| 1.00th=[ 76], 5.00th=[ 314], 10.00th=[ 780], 20.00th=[ 1608],
| 30.00th=[ 2576], 40.00th=[ 3440], 50.00th=[ 4256], 60.00th=[ 5024],
| 70.00th=[ 5792], 80.00th=[ 6624], 90.00th=[ 7584], 95.00th=[ 8256],
| 99.00th=[ 9792], 99.50th=[12352], 99.90th=[18304], 99.95th=[19584],
| 99.99th=[21888]
bw (KB /s): min= 4, max=32572, per=1.56%, avg=29945.45,
stdev=1770.29
lat (usec) : 100=2.65%, 250=1.83%, 500=2.21%, 750=2.89%, 1000=3.12%
lat (msec) : 2=11.60%, 4=22.73%, 10=52.03%, 20=0.89%, 50=0.04%
cpu : usr=0.25%, sys=0.22%, ctx=3151153, majf=0, minf=0
IO depths : 1=116.7%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued : total=r=0/w=2700240/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: io=381205MB, aggrb=2117.9MB/s, minb=2117.9MB/s,
maxb=2117.9MB/s, mint=180000msec, maxt=180000msec

Run status group 1 (all jobs):
WRITE: io=301353MB, aggrb=1674.2MB/s, minb=1674.2MB/s,
maxb=1674.2MB/s, mint=180000msec, maxt=180000msec

Run status group 2 (all jobs):
READ: io=424549MB, aggrb=2358.6MB/s, minb=2358.6MB/s,
maxb=2358.6MB/s, mint=180006msec, maxt=180006msec

Run status group 3 (all jobs):
WRITE: io=337530MB, aggrb=1875.9MB/s, minb=1875.9MB/s,
maxb=1875.9MB/s, mint=180008msec, maxt=180008msec

Disk stats (read/write):
nvme7n1: ios=118062872/92590109, merge=0/0, ticks=26066308/26109972,
in_queue=55971668, util=100.00%


FIO Ergebnisse für RAID-0 über all 8 NVMe Karten
================================================

bvr-sql18:/home/oberstet/scm/scratchbox/freebsd/cruncher # mdadm
--detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Apr 16 19:31:34 2015
Raid Level : raid0
Array Size : 15627067392 (14903.13 GiB 16002.12 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent

Update Time : Thu Apr 16 19:31:34 2015
State : clean
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0

Chunk Size : 256K

Name : bvr-sql18:0 (local to host bvr-sql18)
UUID : 7280eb72:929d263f:4604a091:3fe38c91
Events : 0

Number Major Minor RaidDevice State
0 259 0 0 active sync /dev/nvme0n1
1 259 1 1 active sync /dev/nvme1n1
2 259 2 2 active sync /dev/nvme2n1
3 259 3 3 active sync /dev/nvme3n1
4 259 4 4 active sync /dev/nvme4n1
5 259 5 5 active sync /dev/nvme5n1
6 259 6 6 active sync /dev/nvme6n1
7 259 7 7 active sync /dev/nvme7n1


bvr-sql18:~ # fio --filename /dev/md0 control2.fio
random-read-4k: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=32
..
random-write-4k: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=32
..
sequential-read-128k: (g=2): rw=read, bs=128K-128K/128K-128K/128K-128K,
ioengine=libaio, iodepth=32
..
sequential-write-128k: (g=3): rw=write,
bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=32
..
fio-2.2.6
Starting 256 threads
random-read-4k: (groupid=0, jobs=64): err= 0: pid=31860: Thu Apr 16
20:22:41 2015
read : io=620514MB, bw=10341MB/s, iops=2647.4K, runt= 60005msec
slat (usec): min=1, max=29604K, avg=22.55, stdev=6631.04
clat (usec): min=0, max=37611K, avg=717.16, stdev=34318.22
lat (usec): min=8, max=37611K, avg=739.82, stdev=35080.01
clat percentiles (usec):
| 1.00th=[ 318], 5.00th=[ 374], 10.00th=[ 406], 20.00th=[ 442],
| 30.00th=[ 466], 40.00th=[ 486], 50.00th=[ 510], 60.00th=[ 532],
| 70.00th=[ 556], 80.00th=[ 588], 90.00th=[ 636], 95.00th=[ 692],
| 99.00th=[ 1048], 99.50th=[16512], 99.90th=[16768], 99.95th=[24448],
| 99.99th=[28544]
bw (KB /s): min= 0, max=341456, per=1.78%, avg=188940.35,
stdev=43167.55
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.05%, 500=45.93%, 750=51.16%, 1000=1.81%
lat (msec) : 2=0.18%, 4=0.02%, 10=0.04%, 20=0.72%, 50=0.09%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 2000=0.01%, >=2000=0.01%
cpu : usr=3.42%, sys=68.44%, ctx=63458, majf=0, minf=18647
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=158851602/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
random-write-4k: (groupid=1, jobs=64): err= 0: pid=31935: Thu Apr 16
20:22:41 2015
write: io=560381MB, bw=9336.8MB/s, iops=2390.3K, runt= 60019msec
slat (usec): min=1, max=867805, avg=21.12, stdev=515.87
clat (usec): min=0, max=892513, avg=833.67, stdev=5252.13
lat (usec): min=8, max=892527, avg=854.96, stdev=5280.37
clat percentiles (usec):
| 1.00th=[ 34], 5.00th=[ 251], 10.00th=[ 414], 20.00th=[ 462],
| 30.00th=[ 494], 40.00th=[ 516], 50.00th=[ 532], 60.00th=[ 556],
| 70.00th=[ 572], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 756],
| 99.00th=[16512], 99.50th=[16512], 99.90th=[24448], 99.95th=[29568],
| 99.99th=[171008]
bw (KB /s): min= 5, max=275816, per=1.59%, avg=151755.38,
stdev=43952.51
lat (usec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.43%, 50=0.99%
lat (usec) : 100=1.12%, 250=2.41%, 500=27.32%, 750=62.61%, 1000=2.59%
lat (msec) : 2=0.56%, 4=0.24%, 10=0.18%, 20=1.34%, 50=0.14%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
cpu : usr=5.60%, sys=61.80%, ctx=1940986, majf=0, minf=29148
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=0/w=143457578/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
sequential-read-128k: (groupid=2, jobs=64): err= 0: pid=32020: Thu Apr
16 20:22:41 2015
read : io=758042MB, bw=12611MB/s, iops=100885, runt= 60111msec
slat (usec): min=4, max=9475, avg=25.05, stdev=34.23
clat (usec): min=0, max=1609.2K, avg=20264.21, stdev=88933.36
lat (usec): min=264, max=1609.2K, avg=20289.38, stdev=88933.66
clat percentiles (usec):
| 1.00th=[ 442], 5.00th=[ 596], 10.00th=[ 716], 20.00th=[ 924],
| 30.00th=[ 1144], 40.00th=[ 1416], 50.00th=[ 1976], 60.00th=[ 4512],
| 70.00th=[11328], 80.00th=[21120], 90.00th=[34048], 95.00th=[60672],
| 99.00th=[288768], 99.50th=[602112], 99.90th=[1351680],
99.95th=[1449984],
| 99.99th=[1548288]
bw (KB /s): min=11636, max=1197056, per=1.57%, avg=202645.32,
stdev=164421.79
lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=2.18%, 750=9.42%, 1000=12.10%
lat (msec) : 2=26.54%, 4=8.62%, 10=9.60%, 20=10.52%, 50=14.83%
lat (msec) : 100=3.39%, 250=1.66%, 500=0.50%, 750=0.16%, 1000=0.15%
lat (msec) : 2000=0.32%
cpu : usr=0.33%, sys=5.35%, ctx=5444833, majf=0, minf=699080
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=6064339/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
sequential-write-128k: (groupid=3, jobs=64): err= 0: pid=32084: Thu Apr
16 20:22:41 2015
write: io=841679MB, bw=13993MB/s, iops=111944, runt= 60150msec
slat (usec): min=4, max=6855, avg=23.85, stdev=25.17
clat (usec): min=0, max=901229, avg=18237.98, stdev=63454.35
lat (usec): min=53, max=901246, avg=18262.01, stdev=63454.86
clat percentiles (usec):
| 1.00th=[ 55], 5.00th=[ 75], 10.00th=[ 98], 20.00th=[ 155],
| 30.00th=[ 251], 40.00th=[ 446], 50.00th=[ 852], 60.00th=[ 1656],
| 70.00th=[ 4192], 80.00th=[13504], 90.00th=[34560], 95.00th=[86528],
| 99.00th=[350208], 99.50th=[432128], 99.90th=[749568],
99.95th=[790528],
| 99.99th=[856064]
bw (KB /s): min= 2981, max=971776, per=1.57%, avg=224887.01,
stdev=155980.09
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.09%
lat (usec) : 100=10.23%, 250=19.57%, 500=11.90%, 750=6.27%, 1000=4.41%
lat (msec) : 2=9.95%, 4=7.15%, 10=7.44%, 20=7.82%, 50=7.51%
lat (msec) : 100=3.25%, 250=2.55%, 500=1.52%, 750=0.25%, 1000=0.09%
cpu : usr=3.89%, sys=5.37%, ctx=5690696, majf=0, minf=336258
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=0/w=6733432/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=620514MB, aggrb=10341MB/s, minb=10341MB/s, maxb=10341MB/s,
mint=60005msec, maxt=60005msec

Run status group 1 (all jobs):
WRITE: io=560381MB, aggrb=9336.8MB/s, minb=9336.8MB/s,
maxb=9336.8MB/s, mint=60019msec, maxt=60019msec

Run status group 2 (all jobs):
READ: io=758042MB, aggrb=12611MB/s, minb=12611MB/s, maxb=12611MB/s,
mint=60111msec, maxt=60111msec

Run status group 3 (all jobs):
WRITE: io=841679MB, aggrb=13993MB/s, minb=13993MB/s, maxb=13993MB/s,
mint=60150msec, maxt=60150msec

Disk stats (read/write):
md0: ios=164916853/150191138, merge=0/0, ticks=0/0, in_queue=0,
util=0.00%, aggrios=20614611/18773876, aggrmerge=0/0,
aggrticks=16975295/18070966, aggrin_queue=35215176, aggrutil=99.55%
nvme0n1: ios=20613912/18773078, merge=0/0, ticks=85132920/4244536,
in_queue=89694924, util=96.14%
nvme1n1: ios=20610214/18771214, merge=0/0, ticks=11432236/7672636,
in_queue=19248200, util=98.60%
nvme2n1: ios=20618968/18778997, merge=0/0, ticks=2527348/3999968,
in_queue=6608776, util=96.34%
nvme3n1: ios=20616191/18777118, merge=0/0, ticks=8099032/85948268,
in_queue=94354244, util=99.55%
nvme4n1: ios=20610527/18772284, merge=0/0, ticks=11455980/5902592,
in_queue=17468636, util=98.45%
nvme5n1: ios=20620340/18777158, merge=0/0, ticks=2224840/3858944,
in_queue=6176544, util=98.01%
nvme6n1: ios=20615107/18768738, merge=0/0, ticks=5476296/7946824,
in_queue=13518332, util=97.72%
nvme7n1: ios=20611629/18772423, merge=0/0, ticks=9453712/24993964,
in_queue=34651756, util=98.89%
bvr-sql18:~ #

Tobias Oberstein

unread,
Apr 27, 2015, 12:46:38 PM4/27/15
to Jim Harris, Konstantin Belousov, freebsd...@freebsd.org, Adrian Chadd, Michael Fuckner, Alan Somers
Hi Jim,

I have now done extensive tests under Linux (SLES12) at the block device
level.

8kB Random IO results:
http://tavendo.com.s3.amazonaws.com/scratch/fio_p3700_8kB_random.pdf

All results:
http://tavendo.com.s3.amazonaws.com/scratch/fio_p3700.pdf

What becomes apparent is:

1) IOPS is scaling nicely "linear" for (software) RAID-0

It scales up to roughly 2.2 Mio. 8kB random reads, and 750k 8kB random
writes. Extrapolating Intel's datasheet would give: 2.36 Mio / 720k

Awesome!

2) It does not scale for RAID-1.

In fact, the write performance fully collapses for more than 4 devices.

Note: I don't know which NVMe is wired to which CPU socket, and which
block device - IOW: I did not "handplace" the devices into RAID sets or
anything.

==

I am currently running the same set of tests against 10 DC S3700 via SAS.

This should reveal if it's a general mdadm thing, or NVMe related.

==

For now, we likely will use the NVMes in a RAID-0 setup to leverage the
maximum performance.

Cheers,
/Tobias

CadSoftware

unread,
Sep 15, 2015, 11:18:44 AM9/15/15
to freebsd...@freebsd.org

I believe FreeBSD 10.2-RELEASE includes an updated nvme (4) driver. Do these
changes fix the performance issues described in this thread?

Thank you.



--
View this message in context: http://freebsd.1045724.n5.nabble.com/NVMe-performance-4x-slower-than-expected-tp6001866p6039235.html
Sent from the freebsd-hackers mailing list archive at Nabble.com.
_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers

Tobias Oberstein

unread,
Sep 15, 2015, 11:29:53 AM9/15/15
to CadSoftware, freebsd...@freebsd.org
> I believe FreeBSD 10.2-RELEASE includes an updated nvme (4) driver. Do these
> changes fix the performance issues described in this thread?

I don't know.

The problems with the machine we've deployed however run deeper than
"just NVMe":

RAM: 3TB .. the FBSD kernel patches that fixed this might be in 10.2 ..
dunno

amount of PCIx resources: the box has 8x NVMe, 2x SAS controller, 2x
dual-port 10GbE .. FBSD freaked out due to this .. there were kernel
patches also .. again no clue if those are in mainline now

And finally, the bummer: NUMA performance. The box has 4 sockets, 48
cores, and we are moving to 64 cores.

My unscientific impression was, that FBSD isn't yet able to cope with that.

Anyway, in the meantime, we've moved from SLES 12 to Ubuntu 15.04 (
Linux kernel 3.19.3) .. due to recent Linux developments: blkmq and such.

8kB random write performance now is now at >1 million IOPS (over a test
duration of 10h and a 1TB dataset at block device level - RAID-0 over 8x
2TB NVMe). That's nearly 40% more than on SLES 12, and the CPU load is
lower also. So we can saturate the hardware IOPS wise.

Cheers,
/Tobias
0 new messages