Hi All,
I'm testing BeeGFS 6.17 on CentOS 7.3.1611 with kernel 4.4.116
The setup is a simple two nodes cluster and one client node.
The cluster status (before the failure test) was as follows:
# beegfs-net
mgmt_nodes
=============
localhost.localdomain [ID: 1]
meta_nodes
=============
localhost2.localdomain [ID: 1]
Connections: <none>
localhost.localdomain [ID: 2]
storage_nodes
=============
localhost2.localdomain [ID: 1]
localhost.localdomain [ID: 3]
# beegfs-ctl --listnodes --nodetype=storage --details
localhost2.localdomain [ID: 1]
Ports: UDP: 8003; TCP: 8003
Interfaces: enp0s3(TCP)
localhost.localdomain [ID: 3]
Ports: UDP: 8003; TCP: 8003
Interfaces: enp0s3(TCP)
The mirror buddy is configured as target 301 is the primary and target 101 is the secondary:
# beegfs-ctl --listtargets --mirrorgroups
MirrorGroupID MGMemberType TargetID NodeID
============= ============ ======== ======
100 primary 301 3
100 secondary 101 1
# beegfs-ctl --listmirrorgroups --nodetype=storage
BuddyGroupID PrimaryTargetID SecondaryTargetID
============ =============== =================
100 301 101
The test is to write 32KB zero files using an infinite loop from the client node as follows:
# x=0; while true; do dd if=/dev/zero of=file$x bs=1K count=32; echo $x; x=$((x+1)); done
While writing the files, I've failed the primary storage target (301) by removing it from the SCSI bus as follows:
# echo "scsi remove-single-device 5 0 0 0" > /proc/scsi/scsi
Immediately, the client was dumping the following write error:
dd: error writing ‘file2000’: Remote I/O error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0186692 s, 0.0 kB/s
Surprisingly, the storage targets status was showing that target "301" is still online even after waiting for few minutes to query the status:
# beegfs-ctl --listtargets --nodetype=storage --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
101 Online Good 1
301 Online Good 3
BeeGFS is showing the file contents as follows:
# beegfs-ctl --getentryinfo /mnt/beegfs/test/file2000
Path: /test/file2000
Mount: /mnt/beegfs
EntryID: A8-5A8D076C-2
Metadata buddy group: 100
Current primary metadata node: localhost.localdomain [ID: 2]
Stripe pattern details:
+ Type: Buddy Mirror
+ Chunksize: 1M
+ Number of storage targets: desired: 1; actual: 1
+ Storage mirror buddy groups:
+ 100
While the OS was showing an empty file!
# ls -lh /mnt/beegfs/test/file2000
-rw-r--r-- 1 root root 0 Feb 20 21:45 /mnt/beegfs/test/file2000
The storage daemon log (/var/log/beegfs-storage.log) is showing the following error:
(0) Feb20 21:44:26 Worker7 [SessionLocalFile (open)] >> Failed to open chunkFile: u0/5A8D/0/0-5A8D02F8-2/37F-5A8D0730-2
(0) Feb20 21:44:26 Worker12 [ChunkStore.cpp:682] >> Failed to create file. chunkFilePathStr: u0/5A8D/0/0-5A8D02F8-2/380-5A8D0730-2; retVal: Internal error (1)
(0) Feb20 21:44:26 Worker12 [SessionLocalFile (open)] >> Failed to open chunkFile: u0/5A8D/0/0-5A8D02F8-2/380-5A8D0730-2
(0) Feb20 21:44:26 Worker10 [ChunkStore.cpp:682] >> Failed to create file. chunkFilePathStr: u0/5A8D/0/0-5A8D02F8-2/381-5A8D0730-2; retVal: Internal error (1)
.
.
.
So My questions are:
1. Why BeeGFS remained showing the Primary target as online even after the failure by few minutes?
2. Why BeeGFS was able to create the file but failed to write to it?
3. Why the file content inspection shows that the file has no problem?!
Regards,
Bishoy