How is Google Cloud Dataproc different from Databricks?

5,063 views
Skip to first unread message

Amanpreet Khurana

unread,
May 24, 2019, 4:08:17 AM5/24/19
to Google Cloud Dataproc Discussions
I'm looking for a solution to build a real time natural language processing application using spark which is capable of providing real time Q&A to customers. Is it possible to solve this problem statement easily using spark with Google Cloud Dataproc?

What is the latency for any job run in Google Cloud Dataproc? Job latency in Databricks is ~10-15 secs which tends to spark initialization time. Do we observe the same time lag in Google Cloud Dataproc? 

Dustin Smith

unread,
May 24, 2019, 12:23:21 PM5/24/19
to Google Cloud Dataproc Discussions
Hi Amanpreet,

Thanks for your question.  Sounds like there are several things you're trying to understand, but all centered around your main goal of building an NLP application for real-time customer Q&A. 

  • How is Google Cloud Dataproc different than Databricks?  At it's core, Cloud Dataproc is a fully-managed solution for rapidly spinning up Apache Hadoop clusters (which come pre-loaded with Spark, Hive, Pig, etc.) and then have easy check-box options for including components like Jupyter, Zeppelin, Druid, Presto, etc.  It it very much meant to be a fast, easy, and cost-effective way to do ELT, Data Science, ML model training, etc. using familiar open-source technologies while bypassing the normal time, money, people intensive requirements for running traditional Hadoop clusters.  Another major benefit is that being a product inside of Google Cloud means it can take advantage of our petabyte network, customizable VM and storage options, and that it integrates directly with our other storage, AI, and analytics products.  Regarding Databricks, they're foundational talent and expertise on Apache Spark is undisputed and they've done an incredible job creating a world class environment for data scientists (e.g. their notebook user experience is incredible).  So both products are great for what they're built for and it's not uncommon for companies to use both depending on the use case.
  • Building an NLP app for real-time Q&A with customers:  Can you do this with Cloud Dataproc, absolutely.  However, one of the advantages I mentioned above is that Dataproc lives inside an ecosystem of other Google Cloud products and often someone will start out with the intent to build an application from the ground up using a technology they're well versed in, only to discover that Google Cloud provides a different technology/product/solution that can potentially help build that app, but it's not necessarily connected to the technology they're familiar with.  Case in point here, you might want to check out Google Cloud's Natural Language solutions and see if they might help jump start your project. If you'd like to get more hands-on, there is actually a really fantastic tutorial you might want to look at: How to build a conversational app using Cloud Machine Learning APIs
  • What is the latency for any job run in Google Cloud Dataproc?  This is something that is highly dependent on how you choose to configure your Dataproc cluster and other variable.  Like I mentioned above, Dataproc gives you the ability to configure your cluster resources (size, speed, location, etc.) so you have a lot of control on how to optimize your clusters to support the application you're building. 
There is a really fantastic Dataproc starter tutorial here in case you wanted to take 10 minutes and get a little deeper.

Hope these answers help.  Let me know if I can provide additional resources.

Best,
Dustin

Amanpreet Khurana

unread,
May 28, 2019, 3:49:11 AM5/28/19
to Google Cloud Dataproc Discussions
Thanks Dustin for such detailed information. 

Will send across more queries if any. 
Reply all
Reply to author
Forward
0 new messages