Cross data with S3 and Hive

84 views
Skip to first unread message

sambi...@gmail.com

unread,
May 30, 2016, 12:40:52 PM5/30/16
to Crossdata Users
Hi, Do you support crossdata with S3 and Hive out of box. Or can i use a hive external table mapped to S3.

What about if i need to do multi step processing like i generally do in spark code. For example i want to iterate the result of S3 hive table data based on some condition i need to change some data and upsert to a target store like cassandra.

Regards
-Sambit

Miguel Angel Fernandez

unread,
May 31, 2016, 5:04:01 AM5/31/16
to Crossdata Users
Hi Sambit,

Crossdata can access to S3 data in the same way that SparkSQL does. You can map S3 data to a table in the XDContext as it's done with HDFS by changing the URL. For instance:

  • CREATE TABLE csvS3Table USING com.databricks.spark.csv OPTIONS (path "s3n://myBucket/myFile.csv")
  • SELECT * FROM csvS3Tables LIMIT 10


As for the multi step processing, it depends on the deployment of Crossdata. It can be deployed with a client-server architecture and only SQL sentences can be launched from the driver. However, Crossdata can be also used as a Spark library. In this case, you'll be executing a Spark driver and, therefore, all the methods that Spark API supports will be available (some of them optimized) plus some other methods that Crossdata adds. The easiest way to try Crossdata as a Spark library is using the Spark package:


I hope it helps



Sambit Dixit

unread,
Jun 6, 2016, 1:47:10 AM6/6/16
to Crossdata Users
Thanks Miguel. Can i run a Hybrid deployment ie crossdata server plus spark library. Im assuming if i use the spark library will it perform the native sql queries against the datasources rather than going through spark based crossdata context.

What is the best recommended approach when i need to fire native query plus spark based operations to mix stream plus batch data plus multi step processing.

If i use the client server mode, i have to pull the data to my client layer and then do whatever multi step operation which i want to avoid.

Miguel Angel Fernandez

unread,
Jun 6, 2016, 5:07:48 AM6/6/16
to Crossdata Users
Hi Sambit,

the server-client deployment sets up also of the spark library, that is, every Crossdata server run a Crossdata Context and, therefore, the native access is included out-of-box. XDContext is an extension of the SQLContext of Spark and it makes use and improves the parser, analyzer and optimizer of SparkSQL. Thus, the usage of the native access if transparent to the user and Crossdata, automatically, analyze the query in order to decide whether the query can be resolved natively or not. Hence, the native access is accessible directly from the XDContext.

As mentioned in the previous message, using the Crossdata client makes you to be restricted to use only SQL Commands that will be executed instantaneously and you don't have the chance of making multi step processing. However, using the Crossdata library you have all the methods of the SQLContext available and, therefore, you can make multi step processing before pulling the data.

I hope it helps

Sambit Dixit

unread,
Jun 6, 2016, 5:23:16 AM6/6/16
to Crossdata Users
Thanks Miguel. Yes it helps. The CrossData server actually works as a Spark JobServer right. Whats the concurrency limit to execute jobs through CrossData Server ? How many contexts you create to submit the SQL jobs. 

Regards
-Sambit

Miguel Angel Fernandez

unread,
Jun 7, 2016, 5:30:19 AM6/7/16
to Crossdata Users
There are as many contexts as Crossdata Servers are running in the cluster, therefore, the set of the Crossdata Servers cluster acts as a Spark JobServer.

Crossdata Servers are built with Akka, so the concurrency limit is related to the heap space of the JVM of the process where every Crossdata server is running. For instance, you can create ~2.5 million actors for handling the query requests and they will occupy 1 GB of heap. A Crossdata server creates an actor for every query, thus, the queries are queued in the Spark Cluster, not in the Crossdata cluster.

Regards
-Miguel

Sambit Dixit

unread,
Jun 7, 2016, 6:47:22 AM6/7/16
to Crossdata Users
Thanks Miguel. Do you have any benchmark suite run to test the concurrency. That will help me to run in our setup. Im thinking of using cross-data as engine for "Data As Service" where people can query all types of data. The source of data could be 

1. Druid - For realtime events data - For events analytics. The access should be native through Plyql jdbc interface to Druid. Do we need to create any connector for this. How do i enable a jdbc connector to Cross-Data for native access. 

2. Hbase or Scylladb(Cassandra dropin) - For fast access store - Use native mode only for points and range queries on raw 1 months events as well as dimensions and facts entities derived/prepared from the batch pipeline (spark/hadoop).  In case anyone wants to run join queries, we can create a view in cross-data and this should run through Spark. 

3. Redshift / Hive - For OLAP queries on dimensions and facts. We may need to write some connectors for this probably. Any idea how to enable jdbc based native access to Redshift. Do i need to write a connector for Redshift.  For Hive queries we can easily integrate i know. 

4. Cross Data Source queries - Create views in Cross Data for example i want to join data between Druid + ScyllaDB + Redshift.  This probably should be handled through Spark. 

5. Interactive Adhoc Analysis on Deep History available on S3 + Hive - We can connect through Cross Data for this and Spark will handle it. 

Another thing i want to ask - In spark cluster, we are using Tachyon, so the intermediate state of processing data we are storing in Tachyon with tiered storage(Memory -> SSD -> HDD) enabled. Our objective is not to use HDFS when reading data from S3. Rather use In memory file system Tachyon for fast access.  Do you see any challenges in this setup.

Let me know your thoughts. 

Regards
-Sambit

Miguel Angel Fernandez

unread,
Jun 9, 2016, 6:42:55 AM6/9/16
to Crossdata Users
Hi Sambit,

at the moment, there is no benchmark suite but it's in our roadmap to include it in future releases. Anyway, Crossdata concurrency is inherited by the Akka features: http://doc.akka.io/docs/akka/2.4.7/intro/why-akka.html

Crossdata makes use of most of the parts of Spark Catalyst but it adds an additional phase where the Spark LogicalPlan is analyzed and decided if the query can be resolved natively by using the native driver of the data store. This analysis is possible because the Crossdata connectors are asked about their capabilities to resolve queries. In order to develop a Crossdata connector, there are some interfaces that should be implemented depending on what features of Crossdata are wanted for the connector:


Once the connector is developed, it should be packed in a jar file and the path of this jar file should be added to the parameter crossdata-server.config.spark.jars of the server or it should be added to the XDContext with the method addJar when using Crossdata as a library. 

By the way, information about the benchmark Crossdata vs Presto is already available here:


As for Tachyon, there shouldn't be any problem to use it with Crossdata given the 100% compatibility with all technology Spark stack.

Regards
-Miguel

sambit...@olacabs.com

unread,
Jun 9, 2016, 7:20:14 AM6/9/16
to Crossdata Users
Thanks Miguel. 
Reply all
Reply to author
Forward
0 new messages