Deploy Rubix with Presto

95 views
Skip to first unread message

Manish Malhotra

unread,
Apr 4, 2019, 3:07:35 AM4/4/19
to RubiX

We want to test Rubix with presto-sql or presto-db.

1. Are there are best practices or pointers to bundle Rubix with Presto?
there will be a new CachingFileSystem implementation, so we need to add that jar as well.

2. Thinking of dependent on the rubix jars from maven ( below), this maven repo is good and can be used?

3. Is there a tar file or uber jar that can be used for rubix?

4. checked rubix-admin, but using RPM is not that easy in our environment.

Appreciate your time and help !

thanks,
-Manish

Abhishek Das

unread,
Apr 4, 2019, 6:49:05 PM4/4/19
to Manish Malhotra, RubiX
Hi Manish,

Presto Rubix integration depends on what remote file system you are using. For S3, Presto doesn't allow to override fs.s3.impl configuration. We have done that in our internal presto repo to make it work. For other scheme, you might try setting the config in your presto config file. If you are using a scheme, that we are not currently supporting or if it your internal scheme, then you need to create a file system class similar to rubix-presto/src/main/java/com/qubole/rubix/presto/CachingPrestoS3FileSystem.java and add the jar in the class path of presto server.

If you have to add dependency on RubiX, you can add this in your pom.
<dependency>
    <groupId>com.qubole.rubix</groupId>
    <artifactId><RUBIX-MODULE></artifactId>
    <version><VERSION></version>
</dependency>

There is no tar file or uber jar. You can build the jars from source code. Just download the code and building it should be fine.

Rubix admin is just a tool to install rubix in a cluster and starting the daemons. If you are not able to run rpm, then you can just ship the jars (after building from source code) and place these jars in proper location.

Hope this helps. Let me know if you have any other questions.

Regards,
Abhishek

--
You received this message because you are subscribed to the Google Groups "RubiX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rubix-users...@googlegroups.com.
To post to this group, send email to rubix...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rubix-users/c761db30-492b-4c6a-b1b1-859b7ccaecf4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Manish Malhotra

unread,
Apr 4, 2019, 7:42:32 PM4/4/19
to Abhishek Das, RubiX
thanks Abhishek !

Just wanted to make sure the maven jars are latest greatest in public repo :)
All modules were published on Nov. 2018,  which would be this stable branch : https://github.com/qubole/rubix/tree/rubix-root-0.3.3.1

Yeah, for some other scheme/FS will implement something like CachingPrestoS3FileSystem class.

regards,
-Manish





Manish Malhotra

unread,
Apr 5, 2019, 8:29:44 PM4/5/19
to Abhishek Das, RubiX
Hi Abhishek,

Looks like, also needs to modify CacheType class of rubix-spi project as well to add a new CustomCacheFileSystem as that is used from Bookkeeper to initialized the CachingFileSystem.
And Bookkeeper will use custom one to read data on child nodes.
To do this SPI needs to be modified, build and used ?
Am I thinking in the right direction? 

thanks,
- Manish


Manish Malhotra

unread,
Apr 8, 2019, 3:51:09 PM4/8/19
to Abhishek Das, RubiX
correcting myself : the above changes ( CacheType ) are not required :), it was my bad.
As if we just need use a different scheme for Caching with Presto.
We just need to add the custom caching scheme and update fs.rubix.imp variable.

The above changes would be required if we want to use a custom CachingManager as well.

-Manish

Abhishek Das

unread,
Apr 10, 2019, 1:00:05 AM4/10/19
to Manish Malhotra, RubiX
Not sure I got the use case right. You have a custom scheme and you want to use that custom scheme in the file path. If that is the case, you need to have a FileSystem class (e.g. Caching<CustomScheme>FileSystem which implements CachingFileSystem of type Your Custom File System. Then you need to set fs.<custom scheme>.imp config to Caching<CustomScheme>FileSystem.

Hope this helps. If you want we can setup a call and discuss the procedure. Let me know if this works for you.

Regards,
Abhishek

Manish Malhotra

unread,
Apr 10, 2019, 1:11:32 PM4/10/19
to Abhishek Das, RubiX
Hi Abhishek,

yeah that part was clear and I got my custom CachingFS done.
Use case is to run presto-sql or prestodb ( both open source versions ) with Rubix.
Plus it will use a custom scheme as underlying storage.

So, now I need to start the BookkeeperServer with PrestoClusterManager, and use custom CachingFS

For that 
1. need to configure BookkeeperServer to use PrestoClusterManager instead of HadoopClusterManager ( yarn based )
So, is there any specific property to be set to use PrestoClusterManager?

2. Also, what I see is PrestoClusterManager needs to know the Presto master information so that it can get the worker node information.
and as BookkeeperManager runs on Coordinator and Workers, so how to configure BookkeeperServer daemons to talk to Presto master/coordinator?

3. To use the custom CachingFS, as you mentioned we can use fs.rubix.impl=CachingCustomFS.
So that part is clear.

Regards,
- Manish





Abhishek Das

unread,
Apr 10, 2019, 3:37:26 PM4/10/19
to Manish Malhotra, RubiX
Hi Manish,

Please find my answers below.

1. need to configure BookkeeperServer to use PrestoClusterManager instead of HadoopClusterManager ( yarn based )
So, is there any specific property to be set to use PrestoClusterManager?

Your CachingCustomFS should have this code. 
@Override
public void initialize(URI uri, Configuration conf) throws IOException
{
try {
initializeClusterManager(conf, ClusterType.PRESTO_CLUSTER_MANAGER);
super.initialize(uri, conf);
}
catch (ClusterManagerInitilizationException ex) {
throw new IOException(ex);
}
}
where we are specifying what cluster manager to initialize.

2. Also, what I see is PrestoClusterManager needs to know the Presto master information so that it can get the worker node information.
and as BookkeeperManager runs on Coordinator and Workers, so how to configure BookkeeperServer daemons to talk to Presto master/coordinator?

Check ClusterUtil.getMasterHostName method where we are reading the master host name from master.hostname configuration. Your presto server in all the nodes should have the config set.

Hope this explains your questions.

Regards,
Abhishek

Manish Malhotra

unread,
Apr 12, 2019, 10:27:12 PM4/12/19
to Abhishek Das, RubiX
thanks Abhishek,

please see my comments inline, though good news is I mostly got things intact.
though still having some issues while testing with presto-sql version. I hope Rubix should work with Presto-sql version as well? 

On Wed, Apr 10, 2019 at 12:37 PM Abhishek Das <ad...@qubole.com> wrote:
Hi Manish,

Please find my answers below.

1. need to configure BookkeeperServer to use PrestoClusterManager instead of HadoopClusterManager ( yarn based )
So, is there any specific property to be set to use PrestoClusterManager?

Your CachingCustomFS should have this code. 
@Override
public void initialize(URI uri, Configuration conf) throws IOException
{
try {
initializeClusterManager(conf, ClusterType.PRESTO_CLUSTER_MANAGER);
super.initialize(uri, conf);
}
catch (ClusterManagerInitilizationException ex) {
throw new IOException(ex);
}
}
where we are specifying what cluster manager to initialize.

Yes, I had similar code as PrestoCachingS3FS, so it initialize the PrestoCM.
 

2. Also, what I see is PrestoClusterManager needs to know the Presto master information so that it can get the worker node information.
and as BookkeeperManager runs on Coordinator and Workers, so how to configure BookkeeperServer daemons to talk to Presto master/coordinator?

Check ClusterUtil.getMasterHostName method where we are reading the master host name from master.hostname configuration. Your presto server in all the nodes should have the config set.

yes all the hosts would require this variable to be passed from some config file.

though 1 question, does host that runs Presto Coordinator also has to run BookkeeperServer ( master mode ) + LocalDataTransfer daemons?
My understanding is, it should run only BookkeeperServer. 

Manish Malhotra

unread,
Apr 14, 2019, 1:04:43 PM4/14/19
to Abhishek Das, RubiX
Hi Abhishek,

thanks. I think it would be helpful, if we can meet or have a call. as we are moving forward, facing some issues/questions :)

Putting most of the latest one here :) 

1. For creating CustomCashingFS, I had to override some of the abstract CachingFileSystem class methods as well.
For example getFileStatus because the PrestoCachingS3FS uses PrestoS3FileSystem to handle S3 scheme for Presto.
but for custom scheme, few things were different so, had to do it.
I hope that should not be a problem or you see any problem in this approach?

2. "Presto Rubix integration depends on what remote file system you are using. For S3, Presto doesn't allow to override fs.s3.impl configuration."
What does this exactly mean? we dont use S3 proeprty, but its for custom scheme ... fs.<custom>.impl , so that should be ok?


3. Is there an easier way to test the Rubix with Presto in local apart from running all the daemons in separately in local?

4.  For now Im testing in local with separate daemons for Presto, 
So, when trying to run following daemons in local, and facing some issues:

a. bookkeeper master mode 
b. bookkeeper non-master mode
c. locadata transfer
d. Presto ( local mode includes coordinator and worker in same JVM )
e. Hive Metastore Service ( Thrift server )
f. Derby DB in network mode for hive metastore db.

Issue with bookkeeper daemons : 

Not sure If Im missing something something, 

but this is the case: Bookkeeper master starts at say port 8899 based on this (hadoop.cache.data.bookkeeper.port
public static int getServerPort(Configuration conf)
{
return conf.getInt(KEY_SERVER_PORT, DEFAULT_SERVER_PORT);
}
and when Bookkeeper worker starts it first tries to connect to HeartbeatServer on the same port as using same variable (KEY_SERVER_PORT).
and then tries to start the server using the same variable (KEY_SERVER_PORT), so find it difficult to start both master and worker in local machine.

Please let me know if this is possible to change some property, to start in local?

3. If Bookkeeper master is down, I see the Bookkeeper worker node doesn't start as its trying to do connect to Bookkeeper.master.heartbeat server.
and until successful its not able to start the server.
So, it means if the Bookkeeper.master.heartbeat server is down, then we cant start Bookkeeper.worker.

This I faced in the local, because of the port issue I cant start both the services, and therefore my query requests always were getting data from remote FS instead of getting from local/rubix.


thanks,
Manish







Reply all
Reply to author
Forward
0 new messages