Rubix Architecture?

207 views
Skip to first unread message

Manish Malhotra

unread,
Nov 27, 2018, 2:19:33 AM11/27/18
to RubiX
I'm trying to compare caching framework for big data world.
I see Rubix is light weight and not like a big solution, which  

Is there any description of the Rubix architecture?

Few questions comes to mind :

What all components are there, any dependency on database/rdbms?
Can we deploy Rubix easily in mesos kind of infrastructure?
Does it needs same size of memory and disk ?
How the spark scheduler can benefit with the location of data ?
Can we pre-warm cache ?

thanks for your time and help !!

Regards,
Manish

ad...@qubole.com

unread,
Nov 28, 2018, 12:55:28 AM11/28/18
to RubiX

What all components are there, any dependency on database/rdbms?
There is no dependency on database/rdbms

Can we deploy Rubix easily in mesos kind of infrastructure?
Currently RubiX is not supported in mesos but it can be extended by implementing proper ClusterManager

Does it needs same size of memory and disk ?
Currently RubiX is file based cache. Hence memory requirement is not the same as disk. We run two daemons to and the memory requirement for those two are less.

How the spark scheduler can benefit with the location of data ?
We provide hint to to the scheduler by overriding getFileBlockLocations in CachingFileSystem (which derives from Hadoop file system).

Can we pre-warm cache ?
The cache cane be pre-warmed by running a select query on the tables being queried in the main query. There is no separate command to pre warm the cache. But we have a feature that enables cache warmup asynchronously (no cache warmup penalty). By default the feature is turned off. The feature can be turned on by enabling rubix.parallel.warmup. This needs to be enabled at spark side (assuming you are using Spark) and BookKeeper daemon side.

Hope this answers your questions. Feel free to revert if you have any questions.

Abhishek

Manish Malhotra

unread,
Nov 28, 2018, 8:28:45 PM11/28/18
to RubiX
Appreciate Abhishek for the quick reply !!
Im trying to test it in local, not finding it easy using RPM...
so was trying to start bookkeeper and LocalDataTransferServer based on rubix.service script.
But it requires Hadoop lib to start these services, is there a way to start without any Hadoop dependencies?

Please see few CIL.

Plus i have few more questions, few could take time to reply and subjective as well, whatever you can reply quickly, will appreciate :) :
I would also like to meet the Rubix developers to discuss in detail.... 

1. How is Rubix different from say Alluxio?
2. Does Rubix uses Amazon S3 APIs to use S3 as the underlying storage?
3. Is there are SPOF like BookKeeper or other components?
4. What is the largest scale you have seen Rubix is being tested and what are the use-cases?
What I understood is Microsoft HDInsights is using Rubix?
5. Can Rubix swap the old data with new data, if its full?
6. If the data is fetched from S3/Blob, as its not cached yet, is there any overhead compare to accessing directly from S3/Blob store?

thanks
- Manish


On Tuesday, November 27, 2018 at 9:55:28 PM UTC-8, ad...@qubole.com wrote:

Hi Manish.

Rubix architecture:


What all components are there, any dependency on database/rdbms?
There is no dependency on database/rdbms

Can we deploy Rubix easily in mesos kind of infrastructure?
Currently RubiX is not supported in mesos but it can be extended by implementing proper ClusterManager

I see the BookKeeper is also implemented using Hadoop tools API, any specific reason to have dependency on Hadoop APIs?
Or Im missing something as you are talking about ClusterManager?
Im coming from the world, where we dont want to run this in Hadoop/Yarn/EMR world but say Mesos world?


Does it needs same size of memory and disk ?
Currently RubiX is file based cache. Hence memory requirement is not the same as disk. We run two daemons to and the memory requirement for those two are less.
This is super good ! 

How the spark scheduler can benefit with the location of data ?
We provide hint to to the scheduler by overriding getFileBlockLocations in CachingFileSystem (which derives from Hadoop file system).

ok, if the Spark is running on Mesos/Kubernetes and not on YARN then will it impact the data locality?
As in Mesos or Kubernetes world by default there is no data locality concept.
Data and compute are separate.


 

Can we pre-warm cache ?
The cache cane be pre-warmed by running a select query on the tables being queried in the main query. There is no separate command to pre warm the cache. But we have a feature that enables cache warmup asynchronously (no cache warmup penalty). By default the feature is turned off. The feature can be turned on by enabling rubix.parallel.warmup. This needs to be enabled at spark side (assuming you are using Spark) and BookKeeper daemon side.

So, basically user has to run the queries pre-emptively before running a real one.

Manish Malhotra

unread,
Dec 4, 2018, 5:07:15 PM12/4/18
to RubiX
Hey Abhishek,

Let me know if need more information on the questions?

Regards,
Manish 

ad...@qubole.com

unread,
Dec 4, 2018, 5:57:21 PM12/4/18
to RubiX



Hi Manish,

Please find my answers inline. 

But it requires Hadoop lib to start these services, is there a way to start without any Hadoop dependencies?

We start the rubix daemons as hadoop jar command. Currently thats the way to start the daemons.


1. How is Rubix different from say Alluxio?
RubiX is light weight easily configurable caching solution. It doesn't have a centralize system and uses consistent hashing to determine the membership between nodes and files.
2. Does Rubix uses Amazon S3 APIs to use S3 as the underlying storage?
It doesn't use Amazon S3 API. It uses Hadoop filesystem api and the S3 FileSystem class to fetch data from S3.
3. Is there are SPOF like BookKeeper or other components?
Not sure I got the question right.RubiX doesn;t have a centralize component. If in any node bookkeeper daemon is not running, the tasks are going to fallback reading directly from the object store. So the job is not going to fail.
4. What is the largest scale you have seen Rubix is being tested and what are the use-cases?
Lot of Qubole's customers are using RubiX with Presto. It is mainly used for ad-hoc analysis.
 
HDInsight blog claims that they are using RubiX as caching solution.

5. Can Rubix swap the old data with new data, if its full?
Do you mean cache eviction when the cache is full ? If so, yes, RubiX takes care of evicting old data (based on LRU) to make space for new data.

6. If the data is fetched from S3/Blob, as its not cached yet, is there any overhead compare to accessing directly from S3/Blob store?
No there is no difference.


Let me know if you have any queries.

Regards,
Abhishek

Manish Malhotra

unread,
Dec 4, 2018, 8:15:23 PM12/4/18
to RubiX
thanks Abhishek,

Few more fundamental questions, 

>> Can I start Rubix cache cluster as a standalone cluster not coupled, and not tied up with the Yarn/Spark/Presto cluster.
As if say I want to start a new Yarn/Spark/Presto cluster in AWS then also I can use the cached data, if its outside my Yarn/Spark/Presto cluster?

For example: On demand Spark/AWS cluster comes up and dies per Spark job, but data is cached and resides outside of the Spark cluster and can be read by next Spark job/cluster.

My understanding, this is how Alluxio architecture is, and runs in a separate cluster.
   
thanks,
Manish

Goden Yao

unread,
Dec 6, 2018, 7:48:11 PM12/6/18
to Manish Malhotra, RubiX
Rubix right now is not designed to run cross clusters. 

--
You received this message because you are subscribed to the Google Groups "RubiX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rubix-users...@googlegroups.com.
To post to this group, send email to rubix...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rubix-users/040b1909-82b7-477d-bdd2-28c63b2b8ae5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
-Goden

Manish Malhotra

unread,
Dec 7, 2018, 4:23:42 PM12/7/18
to RubiX
thanks Goden,

I feel with-in and across clusters, both are valid use cases in todays world.

Based on RubiX code, I see its doable.
though needs to have a light-weight master which knows about the Cache cluster and not Yarn one.

WDYT?

Goden Yao

unread,
Dec 7, 2018, 4:44:40 PM12/7/18
to Manish Malhotra, RubiX
I think across-cluster caching (or dedicated-cluster caching) is certainly a valid use case.
When we first came up with RubiX idea, it was tailored to cloud use cases where clusters are ephemeral and independent from each other.  

If we want to make rubix a cross-cluster caching system, a certain component like NameNode service in Hadoop is certainly required and we are welcome ideas/contributions from the community. 

From the end users perspective, if users want to have total control and expecting all other clusters to depend on a caching cluster - this is certainly doable. From a service provider perspective who may want to provide caching as a service while users don't have to worry about maintaining the caching cluster, I think this is also an appealing use case to explore. 


For more options, visit https://groups.google.com/d/optout.
--
-Goden

Manish Malhotra

unread,
Dec 8, 2018, 4:04:32 AM12/8/18
to Go...@qubole.com, rubix...@googlegroups.com
Yeah, I think both the flavors make sense, and has space.
As we see now with solutions like S3 Select and AWS Aethna ( which is Presto ) but pay-as-use model. It would be good to have cache outside analytics cluster as well.

"If we want to make rubix a cross-cluster caching system, a certain component like NameNode service in Hadoop is certainly required and we are welcome ideas/contributions from the community."

[Manish]
Yeah I think this is a good idea, in this space Alluxio is implemented in similar way ( having a master node ) though I believe RubiX depends on Yarn Master but not to maintain all the metadata ? but just for the cluster information. As I feel if its more of C* kind of Architecture then that might be better as then there is no single master, and no load on single process to maintain the metadata like Namenode, which latter on can be a problem?

"From the end users perspective, if users want to have total control and expecting all other clusters to depend on a caching cluster - this is certainly doable. From a service provider perspective who may want to provide caching as a service while users don't have to worry about maintaining the caching cluster, I think this is also an appealing use case to explore. "

[Manish] 
Can't say not agreed :)....

Plus other relevant feature for RubiX, could be tiered storage support ?

 
Reply all
Reply to author
Forward
0 new messages