Big trouble with FibreChannel

471 views
Skip to first unread message

Marian Knichala

unread,
Jul 8, 2017, 4:52:48 AM7/8/17
to esos-users
Hi,

we have big trouble with our new FibreChannel setup.

Our setup consist of 2 storage servers running Esos 1.1.2 and 5 Proxmox Virtualisation Servers. We use Qlogic 4gb FibreChannel equipment.
On Esos we use ZFS with 2 Pools - one mirror pool for the VMs and one raidz6 for backups.
Two of the Proxmox Virtualisation Hosts have local discs and use Fibre Channel only for backup.

We tested this setup for months and all seemed fine. Now it is productive and we have big trouble. Something with FibreChannel seems to be broken in esos causing all Virtualisation Hosts to crash with kernel panics - even the ones that use FibreChannel for backup only.
Both esos storages encountered the same problem:
In this situation the Virtualisation Hosts hang and after a reboot and can't connect to the Esos Server. The TUI show the frozen connections but no changes. If I try "zfs status" this command hangs, too. Trying "reboot" tokes a long time before the reboot begins.

dmesg one the esos host that crashed today shows:
[411099.049815] qla2x00t(13): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411107.214766] qla2x00t(11): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411112.738999] qla2x00t(15): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411127.368155] qla2x00t(11): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411132.892020] qla2x00t(15): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411133.713648] qla2x00t(13): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411153.045048] qla2x00t(15): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411153.607072] qla2x00t(11): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411153.607169] qla2x00t(11): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411153.628181] qla2x00t(13): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411153.652184] qla2x00t(13): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411181.939723] qla2xxx [0000:09:00.1]-505f:13:
[411181.939725] Link is operational (4 Gbps).
[411181.946492] qla2xxx [0000:0a:00.1]-505f:11:
[411181.946494] Link is operational (4 Gbps).
[411182.021531] qla2xxx [0000:02:00.1]-505f:15:
[411182.021533] Link is operational (4 Gbps).
[411522.889595] qla2x00t(12): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411543.831539] qla2x00t(10): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411564.858017] qla2x00t(14): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411567.522206] qla2x00t(12): RSCN registration failed: 0x2 (OK for non-fabric setups)
[411598.119940] qla2xxx [0000:09:00.0]-505f:12:
[411598.119942] Link is operational (4 Gbps).
[411598.139469] qla2xxx [0000:0a:00.0]-505f:10:
[411598.139471] Link is operational (4 Gbps).
[411598.418171] qla2xxx [0000:02:00.0]-505f:14:
[411598.418173] Link is operational (4 Gbps).

on the Proxmox host:
Jul  8 08:35:19 pve04 kernel: [14428.224565] qla2xxx [0000:05:00.0]-801c:1: Abort command issued nexus=1:2:0 --  1 2002.
Jul  8 08:35:20 pve04 kernel: [14429.216704] qla2xxx [0000:05:00.1]-801c:2: Abort command issued nexus=2:2:0 --  1 2002.

the screnshot show a Proxmox host too.

Hope you can help
Greetings Marian
pve2crash.png

Marc Smith

unread,
Jul 8, 2017, 12:33:12 PM7/8/17
to esos-...@googlegroups.com
Using fibre channel switches? One or two? Or none?


--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Patrick

unread,
Jul 8, 2017, 1:20:16 PM7/8/17
to esos-users
We use two EMC DS-4100B.
Each storage system and each Proxmox Host is connected to both switches.
No trunk between them. Completely seperated fabrics.

Marc Smith

unread,
Jul 8, 2017, 1:32:36 PM7/8/17
to esos-...@googlegroups.com
Is the issue only occurring on one of the switches? Or both? And to confirm, it was working fine in production and then all of a sudden it threw errors, and absolutely no changes were made, not even a reboot of anything?

Marc

--

Marian Knichala

unread,
Jul 8, 2017, 5:47:47 PM7/8/17
to esos-users
No, we made a few reboots and added a second zpool to the production enviroment.

All targets from the first storage server on booth switches hang but the other storage server works fine at this time. But 3 days ago the other storage had exact the same issue.

Marc Smith

unread,
Jul 8, 2017, 5:55:57 PM7/8/17
to esos-...@googlegroups.com
Get me a support package/bundle and I'll take a look (email me directly).

Patrick

unread,
Jul 9, 2017, 6:39:23 AM7/9/17
to esos-users
The one year support contract?

Marc Smith

unread,
Jul 9, 2017, 2:33:47 PM7/9/17
to esos-...@googlegroups.com
No the config/log tarball you can generate using the TUI under the Interface menu, or from the shell with "support_pkg.sh" and email me the file it generates.

Marc

Marian Knichala

unread,
Jul 15, 2017, 3:20:32 PM7/15/17
to esos-users
Hi,

maybe we found our mistake. Our servers have 64GB physical ram. In our configuration we increase the maximum ZFS arc cache to 60GB. Normaly ZFS should resize the arc cache, if some system processes needs RAM too. We see that ZFS already reduced the arc max size to 57GB. After the crashes we set the maximum to 50GB and since then it is stable. 

Thanks for your support.
Greetings Marian

Patrick

unread,
Jul 17, 2017, 2:53:49 PM7/17/17
to esos-users
Today we had another fatal failure.

I don't know the time (no log entry) but one disk of our raidz2 pool failed.

        NAME                                 STATE     READ WRITE CKSUM
        STORAGE01_RAIDZ2_POOL_062017         DEGRADED     0     0     0
          raidz2-0                           DEGRADED     0     0     0
            ata-ST2000DM001-1CH164_W1E5H5N4  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z1E5G4KL  FAULTED      0    96     0  too many errors
            ata-ST2000DM001-1CH164_Z1E5G4Z3  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z1E5G6M8  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z1E5QYP4  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z1E62SLQ  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z1E6BGZ5  ONLINE       0     0     0



From 05:58:43 onwards we got these log entries:

Jul 17 05:58:43 storage01 kernel: [764231.342106] dev_vdisk: ***ERROR***: BLOCKIO for cmd ffff880fcff67248 finished with error -28
Jul 17 05:58:43 storage01 kernel: [764231.342110] dev_vdisk: ***ERROR***: BLOCKIO for cmd ffff88092de68d88 finished with error -28
Jul 17 05:58:43 storage01 kernel: [764231.350739] dev_vdisk: ***ERROR***: BLOCKIO for cmd ffff880fcff65848 finished with error -28


All connected hosts/initators went mad:
Jul 17 05:59:43 pve01 kernel: [60249.511130] o2net: Connection to node pve03 (num 3) at 172.16.105.3:7777 has been idle for 30.80 secs.
Jul 17 05:59:46 pve01 kernel: [60252.459500] o2cb: o2dlm has evicted node 3 from domain 8E19F406F1F045A69BC428A75E44ED86
Jul 17 05:59:46 pve01 kernel: [60252.947085] o2dlm: Waiting on the recovery of node 3 in domain 8E19F406F1F045A69BC428A75E44ED86
Jul 17 05:59:46 pve01 kernel: [60252.947101] o2dlm: Waiting on the recovery of node 3 in domain 96DF8C8A72094F4C96DBDF8B13A8936B

Jul 17 05:58:08 pve03 pvedaemon[27353]: <root@pam> successful auth for user 'srueck@pve'
Jul 17 08:45:35 pve03 kernel: [   75.385237] (ocfs2rec-96DF8C,2918,15):__ocfs2_recovery_thread:1437 ERROR: Error -5 recovering node 4 on device (251,3)!

The last host (pve03) crashed totally as there are no log entries for 3hrs.
The degraded zpool is used with one vdev_blockio with ocfs2.

Until now, that lun cannot be seen by some hosts/initiators.

On other host/initiator, we see the log entries from storage01 occur when mounting or unmounting the lun.

I'm desperate.

Regards
Patrick

Marc Smith

unread,
Jul 17, 2017, 3:13:55 PM7/17/17
to esos-...@googlegroups.com
Error code 28 is "No space left on device"... ZFS allows you to create
volumes bigger than the amount of free space in a ZFS storage pool
(eg, thin provisioned). Any chance your volume(s) are larger than the
amount of free space your ZFS storage pool, and now your ZFS storage
pool is out of free space? If so, I suppose the solution would be to
add space (disks) to the storage pool?

--Marc

Patrick

unread,
Jul 17, 2017, 4:06:31 PM7/17/17
to esos-users
You are right. Shame we didn't checked that but why is this causing so much trouble in the entire SAN?

Marc Smith

unread,
Jul 17, 2017, 4:11:36 PM7/17/17
to esos-...@googlegroups.com
Can't say for sure unless we look at all of the initiators, but in my
experience issues on one set of storage causes issues for the
initiators, even though there may be no issues with other target
systems. ESXi for example does not handle APD (all paths down) well at
all, even though you have storage that is available and working, ESXi
itself freaks out and becomes unmanageable, even if the storage
re-appears after some time. A reboot is the only solution when ESXi
gets this way.

I'm not familiar with your hypervisor solution, but perhaps it too
does not handle this well.


--Marc

On Mon, Jul 17, 2017 at 4:06 PM, Patrick <numb...@gmx.de> wrote:
> You are right. Shame we didn't checked that but why is this causing so much
> trouble in the entire SAN?
>

Marc Smith

unread,
Jul 18, 2017, 1:53:44 PM7/18/17
to esos-...@googlegroups.com
Also, see this: https://github.com/zfsonlinux/zfs/issues/5334

Sounds like the actual amount of available space is not disclosed by
the "zpool" command (only data that has been written), but you can use
"zfs list" command to see the available space for a pool dataset.
Sounds like a number of other ZFS users got burned by this "feature"
as well. =)


--Marc
Reply all
Reply to author
Forward
0 new messages