Creating a ScoutFS filesystem that includes multiple block devices

210 views
Skip to first unread message

Kevin Buterbaugh

unread,
Jun 12, 2020, 1:42:19 PM6/12/20
to scoutfs developer email list
Hi,

New to ScoutFS and trying to set it up on one of our clusters to do some benchmarking.  The TLDR version of my question is, "How do I create a ScoutFS filesystem that consists of more than one shared block device?"  Just specifying multiple block devices doesn't work, as ScoutFS only uses the first one.

In my specific example I've got a couple of NetApps and 4 servers all connected via Infiniband.  I've got four 8+2P RAID 6 LUNs on the NetApps that I'd like to make one big ScoutFS filesystem on.  Should I use something like mdadm to make a big RAID 0 device on each of my servers and pass that to scoutfs mkfs?

Once I have the filesystem created, I may have some followup questions on how I make sure the clients can access the filesystem and utilize all 4 I/O servers.

Thanks...

Kevin B.

Ben McClelland

unread,
Jun 12, 2020, 2:42:44 PM6/12/20
to Kevin Buterbaugh, scoutfs developer email list
Hi Kevin,
Yes, mdadm using raid0 is the way to go here.  You can create the device on the first host with the create commands, something like:
sudo mdadm --create /dev/md/scoutfs_0 --level=0 --chunk=1024 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd

and then you can get the rest of the hosts to see it with the following:
sudo mdadm --assemble --scan

We are still working on stabilizing some of the cluster components within the next set of updates, so there may be some instability with clusters tests in the current rpms.

Also, a word of warning that we have not set a stable on disk format yet.  So updating scoutfs rpms will likely result in the need to run mkfs again, wiping all data.
The next set of changes will include increased metadata block size so will fall under the block format change.

Thanks for the interest in testing scoutfs.  We are very interested to hear feedback and test results.

thanks,
-Ben

--
You received this message because you are subscribed to the Google Groups "scoutfs developer email list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scoutfs-deve...@scoutfs.org.
To view this discussion on the web visit https://groups.google.com/a/scoutfs.org/d/msgid/scoutfs-devel/20ae4207-38a2-4b80-b205-9aeb97bee7e3o%40scoutfs.org.

Kevin Buterbaugh

unread,
Jul 7, 2020, 3:06:31 PM7/7/20
to scoutfs developer email list, kevin.bu...@gmail.com
Hi Ben,

Thanks ... I have the RAID 0 device created and can create a filesystem on it:

[root@hazel-scout4 ~]# scoutfs mkfs -Q 2 /dev/md127
Created scoutfs filesystem:
  device path:          /dev/md127
  fsid:                 52383b3c6cb89ea8
  format hash:          e0c61ab2f6c93ec2
  uuid:                 65b098b3-8200-4cc9-8133-906a6ecdf7f9
  device blocks:        31251807232 (116.42 TB)
  metadata blocks:      6250366976 (23.28 TB)
  data blocks:          25001440256 (93.14 TB)
  quorum count:         2
[root@hazel-scout4 ~]#

However, when I try to mount it the command hangs:

[root@hazel-scout4 ~]# mount -t scoutfs -o server_addr=hazel-scout4 /dev/md127 /mnt

Am I doing something wrong with the above?

Also, you may note that I'm running that command on "hazel-scout4" ... yes, there's a hazel-scout[1-3].  They can all "see" /dev/md127, the RAID 0 device I created.  What I'm wanting to accomplish is to first off all get the filesystem mounted on all 4 server nodes.  But once that's done then I will want to also mount it on a half-dozen client nodes.  I don't particularly want to "bind" a client to a particular server.  What I want is to have the clients talk to the servers in parallel (kind of like GPFS and Lustre) ... is that possible?

Thanks again...

Kevin

Kevin Buterbaugh

unread,
Jul 7, 2020, 3:15:11 PM7/7/20
to scoutfs developer email list
Reran the mount with strace and I see:

access("/run/mount/utab", R_OK|W_OK)    = 0
readlink("/dev", 0x7fffffffa840, 4096)  = -1 EINVAL (Invalid argument)
readlink("/dev/md127", 0x7fffffffa840, 4096) = -1 EINVAL (Invalid argument)
readlink("/mnt", 0x7fffffffa7a0, 4096)  = -1 EINVAL (Invalid argument)
stat("/sbin/mount.scoutfs", 0x7fffffffb850) = -1 ENOENT (No such file or directory)
stat("/sbin/fs.d/mount.scoutfs", 0x7fffffffb850) = -1 ENOENT (No such file or directory)
stat("/sbin/fs/mount.scoutfs", 0x7fffffffb850) = -1 ENOENT (No such file or directory)
mount("/dev/md127", "/mnt", "scoutfs", MS_MGC_VAL, "server_addr=hazel-scout4"^C
strace: Process 232848 detached
 <detached ...>
[root@hazel-scout4 ~]#
[root@hazel-scout4 ~]# find /sbin -name mount.scoutfs -ls
[root@hazel-scout4 ~]#

Kevin Buterbaugh

unread,
Jul 7, 2020, 3:20:10 PM7/7/20
to scoutfs developer email list
[root@hazel-scout4 ~]# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Fri Jun 12 12:54:29 2020
        Raid Level : raid0
        Array Size : 125007228928 (119216.18 GiB 128007.40 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Jun 12 12:54:29 2020
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 1024K

Consistency Policy : none

              Name : hazel-scout1:scoutfs_0
              UUID : 17dcfda8:6ac74bc6:c01d1cf7:2886c799
            Events : 0

    Number   Major   Minor   RaidDevice State
       0      65      208        0      active sync   /dev/sdad
       1      65      144        1      active sync   /dev/sdz
       2       8       80        2      active sync   /dev/sdf
       3       8       96        3      active sync   /dev/sdg
[root@hazel-scout4 ~]#

Ben McClelland

unread,
Jul 11, 2020, 12:06:01 AM7/11/20
to Kevin Buterbaugh, scoutfs developer email list
Hi Kevin,
Sorry for the delay in response.  We are still working on stabilizing the quorum protocol.  There are currently times where this can just hang.  The best way to test multiple hosts is to just use a single quorum server, and then have the rest be non-quorum clients.  You can do this with the following on hazel-scout4:

scoutfs mkfs -Q 1 /dev/md127
mount -t scoutfs -o server_addr=hazel-scout4 /dev/md127 /mnt
(also, I tend to use the IP address instead of the hostname above since this is what specifies which interface to listen on if there are more than one)

and then for the rest of the clients just mount like this:
mount -t scoutfs /dev/md127 /mnt

With quorum of 1, there is no high availability of the quorum set (so quorum host must be mounted for any client to function), but the clients still function as they would in the other quorum cases.  We will have several cluster mount improvements coming in the near future.

thanks,
-Ben


-- 
You received this message because you are subscribed to the Google Groups "scoutfs developer email list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scoutfs-deve...@scoutfs.org.

Kevin Buterbaugh

unread,
Jul 13, 2020, 5:57:38 PM7/13/20
to scoutfs developer email list, kevin.bu...@gmail.com
Hi Ben,

Thanks for the response.  I had to do the mount below specifying the IP address as it still hung when I tried to specify the hostname.  Oddly enough, when I used the hostname and it was hanging in the output of dmesg I see:

[ 5225.816639] scoutfs f.b4823e.r.3fd531: server setting up at 8.0.0.0:0
[ 5225.816694] scoutfs f.b4823e.r.3fd531 error: server failed to bind to 8.0.0.0:0, err -99 (Bad address?)
[ 5225.830316] scoutfs f.b4823e.r.3fd531: server stopped at 8.0.0.0:0

Once it was mounted on hazel-scout4 I moved on to hazel-scout1-3 but decided to try:

[root@hazel-scout1 ~]# mount -t scoutfs -o server_addr=192.168.105.55 /dev/md127 /mnt
[root@hazel-scout1 ~]# df -h
Filesystem                            Size  Used Avail Use% Mounted on
192.168.105.1:/gmi/images/toss3-prod  2.2T  383G  1.7T  19% /
devtmpfs                               95G     0   95G   0% /dev
tmpfs                                  95G     0   95G   0% /dev/shm
tmpfs                                  95G  9.7M   95G   1% /run
tmpfs                                  95G     0   95G   0% /sys/fs/cgroup
tmpfs                                 500M  2.0M  499M   1% /ram
/dev/sda1                             549G   73M  521G   1% /localdisk
coldstart:/admin                      2.2T  383G  1.7T  19% /admin
/dev/md127                            117T  480M  117T   1% /mnt
[root@hazel-scout1 ~]#

Which obviously worked.  But in the output of dmesg on hazel-scout4 I see:

[ 5395.785316] scoutfs f.b4823e.r.741384: server setting up at 192.168.105.58:0
[ 5395.785740] scoutfs f.b4823e.r.741384: server ready at 192.168.105.58:36411
[ 5395.885795] scoutfs f.b4823e.r.741384: client connected 192.168.105.58:60312 -> 192.168.105.58:36411
[ 5395.885860] scoutfs f.b4823e.r.741384: server accepted 192.168.105.58:36411 -> 192.168.105.58:60312
[ 5447.561856] scoutfs f.b4823e.r.741384: server accepted 192.168.105.58:36411 -> 192.168.105.57:54550
[ 5484.747408] scoutfs f.b4823e.r.741384: server accepted 192.168.105.58:36411 -> 192.168.105.56:58130
[ 5517.354579] scoutfs f.b4823e.r.741384: server accepted 192.168.105.58:36411 -> 192.168.105.55:60286

Does that mean that if I do I/O on hazel-scout1, for example, it's actually going to use hazel-scout4 as the server?

What I'm really wanting is for each of hazel-scout1-4 to do their own I/O.  And then I have some true clients - i.e. compute nodes with no access to the block storage - that I'd like to have "mount" the filesystem...

Thanks...

Kevin

On Friday, July 10, 2020 at 10:06:01 PM UTC-6, Ben McClelland wrote:
Hi Kevin,

Ben McClelland

unread,
Jul 13, 2020, 6:02:11 PM7/13/20
to Kevin Buterbaugh, scoutfs developer email list
The server is just for coarse lock requests to enforce posix consistency.  All nodes in the scoutfs cluster do their own data and metadata I/O to the local block devices.

Oh, it looks like we don’t yet support specifying hostnames in the server_addr option.  It was changing that name into an integer representation of the IP and trying to use that.

thanks,
-Ben

-- 
You received this message because you are subscribed to the Google Groups "scoutfs developer email list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scoutfs-deve...@scoutfs.org.

Kevin Buterbaugh

unread,
Jul 13, 2020, 6:22:01 PM7/13/20
to scoutfs developer email list, kevin.bu...@gmail.com
Hi Ben,

Thanks - makes sense.  What do I do for the compute nodes that don't have access to the shared block device?  Do I set up hazel-scout1-4 as NFS servers or something like that?

Kevin

Ben McClelland

unread,
Jul 13, 2020, 7:20:12 PM7/13/20
to Kevin Buterbaugh, scoutfs developer email list
Yes, NFS is the best path for this.

Also, wanted to make sure the quorum configs are clear:
the -Q option during mkfs specifies max number of quorum mounts
the -o server_addr option tells scoutfs that this mount is a quorum mount (and which interface to use for cluster communications via IP)

Once there are a majority of the quorum mounts started, the mount can happen, and the lock leader role is determined.
Only 1 lock server is running on one of the quorum mounts, but this role will automatically failover to another quorum mount if there is still a majority remaining if the lock server goes down.

It's very important that no more than the original specified quorum mounts is mounted with server_addr option, otherwise there can be a split brain scenario and device corruption.

thanks,
-Ben

--
You received this message because you are subscribed to the Google Groups "scoutfs developer email list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scoutfs-deve...@scoutfs.org.
Reply all
Reply to author
Forward
0 new messages