Extending RubiX for different storage then HDFS or S3

Manish Malhotra

unread,

Dec 16, 2018, 11:31:40 PM12/16/18

to RubiX

Hello,

this question is about extending RubiX.

Trying to understand if needs to work with another storage layer, theoretically speaking, it could be anything.

I believe its possible to extend RubiX for custom storage.

Where custom storage could have 1. implemented Hadoop FileSystem class or 2. not not have implemented.

So, is the FileSystem implementation is mandatory and how to configure RubiX to use the custom storage layer?

thanks for help here.

-Manish

Shubham Tagra

unread,

Dec 17, 2018, 11:33:42 PM12/17/18

to Manish Malhotra, RubiX

Hi Manish,

Rubix was built for the Hadoop eco-system. The locality-based scheduling to achieve higher hit rate is tightly bound to Hadoop FS semantics.

You could use the core parts of Rubix to build a generic cache but you would also have to write a scheduler in your clients which could

schedule data reads as per Rubix's layout which I believe would not be an acceptable solution unless you are building customized version

of Rubix which works with only one client.

So, it is mandatory to use FS implementations in Rubix. It is trivial to extend to storage layer which already has a Hadoop FS implementation,

check CachingNativeAzureFileSystem class for reference.

--
You received this message because you are subscribed to the Google Groups "RubiX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rubix-users...@googlegroups.com.
To post to this group, send email to rubix...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rubix-users/fe3aae14-da1c-4b92-a689-07ad615fbf6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Shubham

Manish Malhotra

unread,

Jan 4, 2019, 3:06:58 AM1/4/19

to RubiX

Thanks Shubham !

Manish Malhotra

unread,

Jan 15, 2019, 4:02:05 PM1/15/19

to RubiX

Shubham,

Say, we are not worried about the data locality, as we would like to have a dedicated RubiX cache cluster for multiple clusters/jobs.

As in case of say a workflow of jobs, where 1st one is reading and generating output/aggregates, which rest of the 4 jobs will use to provide different output.

And, these jobs could be on-demand cluster and not running on a dedicated Yarn/Presto cluster.

And if data resides say on S3, then these jobs can access RubiX data using FS implementation provided by RubiX as HDFS. (rubix:///)

Plus I understand right now RubiX is not created for running a separate service or dedicated cluster. Its implemented for Yarn or Presto.

I'm assuming if we can write another RubiX Master then we can achieve the above requirement.

I had few questions on current implementation of RubiX as well (which are related to both the changes which we are looking for 1. RubiX Service, and 2. use across different Jobs/Clusters ),

1. does every data/HDFS/S3 read request, when RubiX is used goes through Yarn/Master or read requests directly goes to the RubiX child nodes ? My understanding is at the master node, it just uses the cluster information to find which nodes has RubiX service running and if data is available then route request to that node, otherwise give it to any less loaded node which will fetch from underlying storage and cache it in local?

This would also make sense as RubiX doesnt have any single point of failure like Alluxio.

Though RubiX uses Yarn master to achieve the routing work.

2. My understanding is, each RubiX's child node metadata is contained in that node, does it also knows about the metadata of other nodes in the cluster as well, like Cassandra ring?

3. If we build RubiX as a Cache service, then it can also have single point of failure like Alluxio?

thanks,

Manish

Goden Yao

unread,

Jan 15, 2019, 5:58:08 PM1/15/19

to Manish Malhotra, RubiX

"we are not worried about data locality" - this seems odd to me as the whole purpose of a caching system is about data locality and reducing I/O latency.

Arguably, the intra-cluster network I/O may still be faster than cloud storage I/O.

Intra-cluster communication through another master seems to complicate the current use cases.

Systematically, this design introduces another external dependency.

Answered your questions below inline...

On Tue, Jan 15, 2019 at 1:02 PM Manish Malhotra <manish.mal...@gmail.com> wrote:

Shubham,

Say, we are not worried about the data locality, as we would like to have a dedicated RubiX cache cluster for multiple clusters/jobs.
As in case of say a workflow of jobs, where 1st one is reading and generating output/aggregates, which rest of the 4 jobs will use to provide different output.
And, these jobs could be on-demand cluster and not running on a dedicated Yarn/Presto cluster.
And if data resides say on S3, then these jobs can access RubiX data using FS implementation provided by RubiX as HDFS. (rubix:///)

Plus I understand right now RubiX is not created for running a separate service or dedicated cluster. Its implemented for Yarn or Presto.
I'm assuming if we can write another RubiX Master then we can achieve the above requirement.

I had few questions on current implementation of RubiX as well (which are related to both the changes which we are looking for 1. RubiX Service, and 2. use across different Jobs/Clusters ),

1. does every data/HDFS/S3 read request, when RubiX is used goes through Yarn/Master or read requests directly goes to the RubiX child nodes ? My understanding is at the master node, it just uses the cluster information to find which nodes has RubiX service running and if data is available then route request to that node, otherwise give it to any less loaded node which will fetch from underlying storage and cache it in local?
This would also make sense as RubiX doesnt have any single point of failure like Alluxio.
Though RubiX uses Yarn master to achieve the routing work.

[G] your understanding is correct. routing via master is the core design right now of Rubix

2. My understanding is, each RubiX's child node metadata is contained in that node, does it also knows about the metadata of other nodes in the cluster as well, like Cassandra ring?

[G] metatdata - do you mean which node contains which file piece - that information? I believe this is only in master node.

To view this discussion on the web visit https://groups.google.com/d/msgid/rubix-users/51b743cb-35ec-4979-b83a-4a1579529b8b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

-Goden

Goden Yao

unread,

Jan 15, 2019, 6:43:39 PM1/15/19

to Manish Malhotra, RubiX

sorry, let me rectify the answers - as I was confused by the first question:

1. does every data/HDFS/S3 read request, when RubiX is used goes through Yarn/Master or read requests directly goes to the RubiX child nodes ? My understanding is at the master node, it just uses the cluster information to find which nodes has RubiX service running and if data is available then route request to that node, otherwise give it to any less loaded node which will fetch from underlying storage and cache it in local?

- you seem to talk about Alluxio's design. Rubix daemon on the master node doesn't do any routing. when a worker node needs to access a file block, the same rubix daemon on the work node would know if itself (local) or another node in the cluster contains the block. it doesn't go through the master.

2. My understanding is, each RubiX's child node metadata is contained in that node, does it also knows about the metadata of other nodes in the cluster as well, like Cassandra ring?

- My previous answer was assuming you were talking about NameNode function. If by "metadata", you mean the rubix cached file location - every node would invoke a consistent hashing function to locate which node may contain that file.

Every node keeps a list of all active nodes in the cluster that may be used for rubix.

I think it's better if we can publish a rubix architecture and data flow on the doc website.

--

-Goden

Shubham Tagra

unread,

Jan 15, 2019, 10:55:16 PM1/15/19

to Goden Yao, Manish Malhotra, RubiX

Architecture wiki would be useful.

Rubix will need to be extended to support separate Rubix cluster option as current architecture strictly depends on co-existence of Engine and Rubix daemons, following would be needed at the very least:

- Split generation would not need any input from Rubix anymore, I guess this is why you mentioned locality not being a requirement

- RubixClusterManager to return list of nodes which will be used by Engine workers to find the right owner of each block in the split

- CachingInputStream to always invoke Non-Local Read Requests to the Rubix node that was found the owner of block

To answer your questions Manish:

1. In master, the node who would be the owner of the File Block is calculated and passed as a hint to Engine scheduler to schedule the split on that node if possible to achieve as much local reads as possible (this is the locality part)

But the scheduler is not bound to schedule as per the hints and in such cases Non-Local reads happens in worker node.

2. The ownership of block is calculated each time a node sees a split (may it be master or worker) using a consistent hash which Goden mentioned. This is the part that removes the bottleneck from master node as during the actual reads each worker is capable of finding the Rubix node from which it has to read data without burdening master. So, worker does not know about the metadata in other nodes but can calculate which node is the owner.

3. It wouldn't be any worse off than current implementation. See my comment at the start, we would continue to calculate Block ownership at each node and instead of Yarn/Engine-master, use a new Rubix Cluster Manager to get the list of nodes.

To view this discussion on the web visit https://groups.google.com/d/msgid/rubix-users/CANULu-Qvo0TWtTmOLueQE6T1Zt84s3iAP9WWBGY2tfDoodEraw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Shubham

Manish Malhotra

unread,

Jan 22, 2019, 3:23:30 PM1/22/19

to RubiX

Goden, your understanding was correct for all of my questions, thanks for the answers !

Shubham thanks for your answer/explanation as well !!

please see my comments/questions inline

"Architecture wiki would be useful." yes, it would be great help !

On Tuesday, January 15, 2019 at 7:55:16 PM UTC-8, Shubham Tagra wrote:

Architecture wiki would be useful.

Rubix will need to be extended to support separate Rubix cluster option as current architecture strictly depends on co-existence of Engine and Rubix daemons, following would be needed at the very least:
- Split generation would not need any input from Rubix anymore, I guess this is why you mentioned locality not being a requirement

yes, as if RubiX is a separate dedicated cache cluster ( say like Redis ) then data locality will not be possible.

this would be helpful when the network is fast enough to access data from remote machines.

- RubixClusterManager to return list of nodes which will be used by Engine workers to find the right owner of each block in the split

Make sense, I saw the code at high level, and this should suffice the requirements here.

- CachingInputStream to always invoke Non-Local Read Requests to the Rubix node that was found the owner of block

Make sense.

My other questions on the similar thread are:

1. As we say, consistent hashing can be used by any node to identify the owner of that file/block.

but what happen when the block is not cached yet, so it would mean the node based on the consistent hashing will read data from S3/HDFS, cache it and return it?

2. Is there any replication of the data, to mitigate the risk of hosts are down or service is down on hosts?

3. As currently RubiX uses Yarn/Presto cluster/infra to run the child node services like BookKeeper etc.,

So, assuming if the RubiX daemons on worker node goes down, there will be a way to restart those nodes/maintain the lifecycle of the RubiX daemons?

And I assume in the dedicated service mode, we have to build the intelligence to manage the lifecycle of the RubiX daemons on the nodes?

this would not be trivial and depends on how sophisticated we need this.

4. As it uses consistent hashing, adding removing new nodes should not be a problem.

and should not impact the whole cluster?

Also should not create imbalance nodes?

5. On distribution of data among RubiX nodes: If we have a lots of small files and few big files.

then can it create the situation of imbalance RubiX nodes ( few nodes will cache the small files, and few large files )