intel p3700 nvme (/dev/nvme0 and /dev/nvme0n1) disappears? bug?

134 views
Skip to first unread message

Paul Rabinowitz

unread,
Sep 16, 2017, 1:43:54 PM9/16/17
to CoreOS User
very very strange behavior. 

the same behavior happened exactly on two seperate machines so it is unlikely to be hardware or bios.

I just moved this machine from a core os hd installed on sda (sda is a 72 tb raid6 array) to a matchbox / ignition /ipxe setup.

I could access both /dev/sda and /dev/nvme0n1 no problems. 

after the first ipxe boot i could see both /dev/sda and also /dev/nvme0n1 no problems.

I then logged in and did the following::

sudo cfdisk /dev/sda

made everything free space and created 72tb ext4.sda1

then ran

sudo mkfs.ext4 /dev/sda1

everything looked good so I rebooted.

after rebooting 

I can access /dev/sda1 ok but
/dev/nvme0 and /dev/nvme0n1 disappears.

----

I have tried removing all drive mounting via ignition,
I have also tried installing back to disk (sda).
 dmesg |grep nvme  (its no longer there)

Why would I lose my nvme after wiping, repartitioning, formatting sda and a reboot?

and how can I get it back?  

Thanks in advance!


Paul Rabinowitz

unread,
Sep 16, 2017, 1:47:45 PM9/16/17
to CoreOS User
forgot to mention, Im running 1465.7.0 x64

Seán C. McCord

unread,
Sep 16, 2017, 1:59:32 PM9/16/17
to Paul Rabinowitz, CoreOS User

Have you validated the nvme drive externally after it disappeared, to make sure it didn't really die (or become dislodged) coincidentally?  They really should have nothing to do with each other, and even if you accidentally repartitioned the nvme drive, it would still show up.

Validating externally aside, you should also check the output of `dmesg`.  It is highly unlikely that this problem is due to anything outside of the kernel itself.  It is certainly possible that between the boots, you changed kernels.

I've been running dozens of nvme servers for about two years on Container Linux (CoreOS) without similar problems, both with and without PXE booting and with and without other drives (i.e.,, /dev/sda).  It is, however, possible that the particular combination of devices you have is causing some internal conflict.  If it is not unreasonable, you might try removing the non-nvme drive(s) from the system and see if the nvme drive returns.  (This is purely for diagnostic purposes.)


On Sat, Sep 16, 2017, 13:47 Paul Rabinowitz <paulrab...@gmail.com> wrote:
forgot to mention, Im running 1465.7.0 x64

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Seán C McCord
CyCore Systems, Inc

Paul Rabinowitz

unread,
Sep 16, 2017, 2:18:23 PM9/16/17
to CoreOS User
Thank you for your response.  

I initially suspected the hardware failed as well, 

so I went on to convert another box with identical hardware and the exact behavior happened.  

If I had a third box (which I unfortunately dont) I would try a second ipxe reboot without the cfdisk and mkfs commands to see if the commands screwed it up.

Also, there is no way it booted a different kernel since i am feeding it 1465.7.0 via matchbox

maybe I should try an earlier coreos version?

Unfortunately I would need to fly across the country to hurricane electric for direct hardware access. I Might have to do this.

Hardware is :

dell poweredge 730xd  12x8tb perc 730 raid 6  and 1.6tb intel p3700 pcie.  128gb ram 

Paul Rabinowitz

unread,
Sep 16, 2017, 3:27:44 PM9/16/17
to CoreOS User
Tried 1185.5.0.  still no nvme.

will try to disable sda next...

Seán C. McCord

unread,
Sep 16, 2017, 3:30:01 PM9/16/17
to Paul Rabinowitz, CoreOS User
My fleet is currently running 1520.3.0, and I don't have any history as to whether I passed that particular release, but I've certainly been running these boxes for longer than that.

You definitely have different hardware (I've never used Dell equipment), but if anything, I would expect the PERC to be the flaky side.  Apparently, that's not the case for you, though.

Examining `dmesg` would be my next step.

Paul Rabinowitz

unread,
Sep 19, 2017, 6:28:54 PM9/19/17
to CoreOS User
Just an update, The issue is solved.

It took a hard power cycle to solve the problem for both machines.

(reboots didn't cut it)





Reply all
Reply to author
Forward
Message has been deleted
0 new messages