Shubham,
Say, we are not worried about the data locality, as we would like to have a dedicated RubiX cache cluster for multiple clusters/jobs.
As in case of say a workflow of jobs, where 1st one is reading and generating output/aggregates, which rest of the 4 jobs will use to provide different output.
And, these jobs could be on-demand cluster and not running on a dedicated Yarn/Presto cluster.
And if data resides say on S3, then these jobs can access RubiX data using FS implementation provided by RubiX as HDFS. (rubix:///)
Plus I understand right now RubiX is not created for running a separate service or dedicated cluster. Its implemented for Yarn or Presto.
I'm assuming if we can write another RubiX Master then we can achieve the above requirement.
I had few questions on current implementation of RubiX as well (which are related to both the changes which we are looking for 1. RubiX Service, and 2. use across different Jobs/Clusters ),
1. does every data/HDFS/S3 read request, when RubiX is used goes through Yarn/Master or read requests directly goes to the RubiX child nodes ? My understanding is at the master node, it just uses the cluster information to find which nodes has RubiX service running and if data is available then route request to that node, otherwise give it to any less loaded node which will fetch from underlying storage and cache it in local?
This would also make sense as RubiX doesnt have any single point of failure like Alluxio.
Though RubiX uses Yarn master to achieve the routing work.
2. My understanding is, each RubiX's child node metadata is contained in that node, does it also knows about the metadata of other nodes in the cluster as well, like Cassandra ring?
3. If we build RubiX as a Cache service, then it can also have single point of failure like Alluxio?
thanks,
Manish