Mounting local file-system in Alluxio

394 views
Skip to first unread message

John Ouellette

unread,
Sep 14, 2017, 6:54:46 PM9/14/17
to Alluxio Users
Hi -- we're experimenting with using Alluxio to serve data from multiple hosts.  Each host has its own local POSIX file-system, with different data on each host.  The idea is to use Alluxio (initially) to provide a unified namespace for the data stored on these nodes.

In my test set up, I have three containers (all using the 1.6.0-RC1 image created by Dustin Jenkins):
  • an Alluxio master container on a host which stores no data and has no file-systems in common with the other hosts. 
  • an Alluxio worker container on a host with data I would like to make available through the Alluxio unified namespace.  These data are available under /archive on the container, and alluxio.underfs.address=/archive on this container.
  • an Alluxio proxy container on the same host as the worker container is on.
On the worker, if I run 'alluxio fs mount /stor file:///archive' I get:

Ufs path /archive does not exist


However, if I switched 'file:///archive' for 'archive:///tmp', the mount worked.

After a bunch of testing I found that the original command succeeded if I created an empty '/archive' directory on the master.  However, 'alluxio fs ls /stor' returned nothing.

  1. Does the master need access (through NFS or other) to the same '/archive' directory that the worker has?
  2. If the above is true, if a file is retrieved by a client, does the file get served via the master or the worker?
  3. If the above is not true, what might I be missing in my config?
On the master, the contents of alluxio-site.properties are:

alluxio.user.file.writetype.default=CACHE_THROUGH

alluxio.underfs.address=/underStorage  <---- note: I was assuming that this made no difference on the master

alluxio.logs.dir=/opt/alluxio/logs

alluxio.master.hostname=test3


On the worker, the contents are:

alluxio.worker.memory.size=5GB

alluxio.underfs.object.store.mount.shared.publicly=true

alluxio.underfs.address=/archive

alluxio.worker.hostname=test1

alluxio.master.hostname=test3


Thanks for any insight you can give,

John Ouellette

Yupeng Fu

unread,
Sep 14, 2017, 10:32:18 PM9/14/17
to John Ouellette, Alluxio Users
Hi John,

The under storage file system must be a distributed file system. Alluxio master will run certain metadata commands on the under storage such as folder creation in your example.
You can use NFS instead of local file system if you want to mount local folders.

Hope this hepls,

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Ouellette

unread,
Sep 15, 2017, 11:50:10 AM9/15/17
to Alluxio Users
Thanks Yupeng -- That is what I was thinking, but I wanted to be sure.  

So, in the simplistic example I outlined above, we have node1 with some local data (/archive1).  If we NFS mount /archive1 from node1 on the master, we can use Alluxio to create a namespace (call it /stor1) with 'alluxio fs mount /stor1 file:///archive1' as long as node1:/archive1 is NFS mounted on the master as /archive1.  Is this correct?

If we extend the above example and add node2:/archive2 and node3:/archive3, we would need to NFS mount those data volumes on the master as well if we wanted to use an Alluxio namespace for each (e.g. /stor2 and /stor3).  Is this correct?  Am I correct in assuming that node2:/archive2 and node3:/archive3 would only need to be mounted on the master and not on node1 (and node1:/archive on node2 and node3, etc.)?  Am I also correct in assuming that, if we wanted to make a completely unified namespace with Alluxio for the data on all three of those nodes, we *would* have to cross-mount the local file-systems on all nodes?  

As an alternative to the above, I suspect it would make more sense to deploy several stand-alone Alluxio workers with no local data, which then NFS mount the data from the above nodes (which simplifies their role to just NFS servers) -- the Alluxio master would NFS mount the same volumes.  If each worker node had some available 'work space' (e.g. lots of memory, some SSD, and some HDD) it would allow Alluxio's tiered storage to be used more efficiently.  Is this a viable scenario?

Thanks again, and sorry for my naive questions.
John
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.

Yupeng Fu

unread,
Sep 15, 2017, 12:03:10 PM9/15/17
to John Ouellette, Alluxio Users
Hi John,

I want to clarify some concepts:
 - Under storage is a distributed file system such as s3, HDFS, or NFS that can be mounted to Alluxio namespace. The data on such file system should be available to all Alluxio masters and workers, and Alluxio masters and workers will make such assumption as well.
- Tiered storage is your local storage resources (MEM, SSD, and HDD) managed by Alluxio. You do not need to mount any of those local folders, and Alluxio will take care of that.

Hope this makes sense to you. And it will be helpful to let me know the goal you try to achieve. If your goal is to migrate the local resources to Alluxio and use Alluxio to manage the local resources, then I suggest you use bin/alluxio fs copyFromLocal command.

Cheers,
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-users+unsubscribe@googlegroups.com.

John Ouellette

unread,
Sep 15, 2017, 2:47:33 PM9/15/17
to Alluxio Users

Hi Yupeng -- I think I'm starting to get the idea.  All 'under storage' must be available to the master for it to reference and operate on; the same under storage must also be made available to all workers that are registered with that master because the master may cause any of those workers to cache the files into Alluxio's operating space.  Is that correct?

If I can horribly mangle Alluxio's intent, could I summarize it as a fast cache layer between processing and slower storage?  The common namespace allows users to reference files the same way, regardless of what under storage they are located on, and regardless of whether they have been brought into Alluxio's operating storage (the tiered storage, which is or could be separate from the under storage) for access optimization.  Is that an accurate, albeit horribly simplified, view?

Our goal, which we were/are hoping to Alluxio for, is to provide a common namespace for several *geographically separated* storage sites, and on top of that, using a processing framework such as Flink to enforce data migration rules among those sites.  For example, take four sites: A, B, C and D.  All four have large data storage installations: sites A and B are running, e.g., Ceph locally which presents an S3 interface; site C is running, e.g., GlusterFS; site D has a large monolithic file-system with minio on top for S3 API compatibility.  We'd like to be able to provide a common namespace for all four sites, ensure that some rule-defined portions of the data at each site be replicated to other sites (e.g. some data at Site A should be replicated to C, and D, but not B; some data from Site B should be replicated to C and D, but not A; C and D should be replicated both ways (C<->D); data put into storage at either C or D in A's namespace should be replicated back to A.    A client should be able to go to (for example) /alluxio/A/sub/dir/file1 and not care whether the file is coming from site A, C, or D.  Admittedly, there are some processes in this description which are not part of Alluxio: that's what we're planning on implementing (hopefully using a framework like Flink).  We were looking into using Alluxio for the namespace aspect.

Assuming the above is even remotely possible, it looks like we'd need to have a master which has access to A, B, C, and D, and workers which have the same access.  This might not be feasible.

John


On Friday, September 15, 2017 at 9:03:10 AM UTC-7, Yupeng Fu wrote:
Hi John,

I want to clarify some concepts:
 - Under storage is a distributed file system such as s3, HDFS, or NFS that can be mounted to Alluxio namespace. The data on such file system should be available to all Alluxio masters and workers, and Alluxio masters and workers will make such assumption as well.
- Tiered storage is your local storage resources (MEM, SSD, and HDD) managed by Alluxio. You do not need to mount any of those local folders, and Alluxio will take care of that.

Hope this makes sense to you. And it will be helpful to let me know the goal you try to achieve. If your goal is to migrate the local resources to Alluxio and use Alluxio to manage the local resources, then I suggest you use bin/alluxio fs copyFromLocal command.

Cheers,

Yupeng Fu

unread,
Sep 15, 2017, 9:36:31 PM9/15/17
to John Ouellette, Alluxio Users
 All 'under storage' must be available to the master for it to reference and operate on; the same under storage must also be made available to all workers that are registered with that master because the master may cause any of those workers to cache the files into Alluxio's operating space.  Is that correct?

Yup, your understanding is correct. And I think your view of Alluxio is also close our vision. You can draw an analogy to Mac's finder that a newly added device shows up as a mounted folder, and at the data center level, disparate storage systems can be mounted as different folders in Alluxio.

Alluxio will work best if its workers collocate with the compute cilents. So I assume your compute framework should be able to access all these sites, to work with replication rules. If so, it should be feasible for Alluxio to access those sites, too. Does this make sense?

Cheers,

 


Reply all
Reply to author
Forward
0 new messages