How do I load Spark-csv package in HUE

217 views
Skip to first unread message

Anandha Loganathan

unread,
Nov 19, 2016, 5:05:13 AM11/19/16
to Hue-Users

I am trying to load Spark-csv  package in HUE. I tried with adding it as jar file and while running it throw an exception. (attached screenshot).


This is my command  in HUE 

dataset = sqlContext.read..format('com.databricks.spark.csv').load("/user/dwuser/v0/*.gz") 


It throws  an error "com.fasterxml.jackson.databind.JsonMappingException: Can not deserialize instance of scala.collection.immutable.List out of VALUE_STRING token\n at [Source: HttpInputOverHTTP@233fc482; line: 1, column: 177] (through reference chain: com.cloudera.livy.server.interactive.CreateInteractiveRequest[\"jars\"])" (error 500).


 I am not if I am doing it in the right way or not . Can anyone help me to find the solution for this problem.


Thanks
Anand
Screen Shot 2016-11-19 at 1.09.27 AM.png

Anandha Loganathan

unread,
Nov 19, 2016, 5:09:14 AM11/19/16
to Hue-Users
I am running it using PySpark Notebook. 

penny chan

unread,
Nov 19, 2016, 7:17:50 AM11/19/16
to Anandha Loganathan, Hue-Users
I think it is because of the Json format in your data

--
You received this message because you are subscribed to the Google Groups "Hue-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+unsubscribe@cloudera.org.

AnandaLoganathan

unread,
Nov 19, 2016, 2:37:45 PM11/19/16
to penny chan, Hue-Users
Hi Pennny,

It is working fine when we launch it PySpark from command line. But we want to integrate it with HUE and we are testing. 

I feel the loading of spark-csv package is failing. Is there any other configuration settings should I need to take care of  ? 



Any help is appreciated. 

Thanks
Anand

Anandha Loganathan

unread,
Nov 19, 2016, 8:30:30 PM11/19/16
to Hue-Users
The problem seems to be an existing issue. 

There is JIRA already opened for this purpose. 




On Saturday, November 19, 2016 at 2:05:13 AM UTC-8, Anandha Loganathan wrote:

AnandaLoganathan

unread,
Nov 19, 2016, 9:11:22 PM11/19/16
to penny chan, Hue-Users
​I am getting this error even through I have uploaded the jar file  /user/anand.ranganathan/spark-csv_2.11-1.5.0.jar.  

  • Error: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at $iwC$$iwC.<init>(<console>:19) at $iwC.<init>(<console>:24) at <init>(<console>:26) at .<init>(<console>:30) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at com.cloudera.livy.repl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:264) at com.cloudera.livy.repl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:264) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.Console$.withOut(Console.scala:126) at com.cloudera.livy.repl.SparkInterpreter.executeLine(SparkInterpreter.scala:263) at com.cloudera.livy.repl.SparkInterpreter.com$cloudera$livy$repl$SparkInterpreter$$executeLines(SparkInterpreter.scala:238) at com.cloudera.livy.repl.SparkInterpreter$$anonfun$execute$1.apply(SparkInterpreter.scala:99) at com.cloudera.livy.repl.SparkInterpreter$$anonfun$execute$1.apply(SparkInterpreter.scala:96) at com.cloudera.livy.repl.SparkInterpreter.restoreContextClassLoader(SparkInterpreter.scala:279) at com.cloudera.livy.repl.SparkInterpreter.execute(SparkInterpreter.scala:96) at com.cloudera.livy.repl.Session.executeCode(Session.scala:96) at com.cloudera.livy.repl.Session.execute(Session.scala:77) at com.cloudera.livy.repl.ReplDriver$$anonfun$handle$1.apply$mcV$sp(ReplDriver.scala:72) at com.cloudera.livy.repl.ReplDriver$$anonfun$handle$1.apply(ReplDriver.scala:72) at com.cloudera.livy.repl.ReplDriver$$anonfun$handle$1.apply(ReplDriver.scala:72) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 42 more

On Sat, Nov 19, 2016 at 5:37 PM, penny chan <penny...@gmail.com> wrote:
I think you can have a try a csv file without json format. And then test upload the csv file with scala. I am not sure about pyspark. scala just works fine.

--

Anandha Loganathan

unread,
Nov 21, 2016, 3:08:29 PM11/21/16
to Hue-Users

Can anyone  provide solution for this ?  Is this feature supported by HUE. If not we might have look for alternative solution.


I am not sure if my configuration were right and do I have to make some other configuration changes ? .

Thanks in advance.

Romain Rigaux

unread,
Nov 21, 2016, 3:45:26 PM11/21/16
to Anandha Loganathan, Hue-Users

AnandaLoganathan

unread,
Nov 21, 2016, 4:46:21 PM11/21/16
to Romain Rigaux, Hue-Users
Thanks Romain fro the reply. 

The JIRA you pointed is for. Is it same applied for Scala/Spark  also ? 

I am trying to run the code using Scala but I am getting this error. 

Error: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.....)


AnandaLoganathan

unread,
Nov 21, 2016, 6:02:09 PM11/21/16
to Romain Rigaux, Hue-Users
​Hey Romain,

Do you have workaround for loading libraries/dependencies for PySpark..  ?


Reply all
Reply to author
Forward
0 new messages