Hyperopt Spark dependencies

182 views
Skip to first unread message

Joseph Giovanelli

unread,
Feb 2, 2021, 12:28:19 PM2/2/21
to hyperopt-discuss
Could anyone provide me with the dependencies required to run Hyperopt with SparkTrials? They say they provide a script (https://github.com/hyperopt/hyperopt/blob/master/download_spark_dependencies.sh) but there is nothing.

Thanks

P.S. First I had the error :
  • ModuleNotFoundError: No module named 'pyspark'
Then I installed it via pip install pyspark, but now I have the error:
  • ModuleNotFoundError: No module named 'ml_pipeline' 
But this time pip install ml_pipeline doesn't exist. 
It should be a dependencies problem.

Brent Komer

unread,
Feb 2, 2021, 2:07:22 PM2/2/21
to Joseph Giovanelli, hyperopt-discuss
Hi Joseph,

I haven't used SparkTrials at all, but I found some info that might help:

It looks like that script was removed when upgrading to Spark 3.0.1 since this version doesn't require a manual download
For reference, here is the file before it was removed (in case you need it for the specific version you are using)

--
You received this message because you are subscribed to the Google Groups "hyperopt-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hyperopt-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hyperopt-discuss/e3c35da2-fef3-46d3-b51a-10242d4749ccn%40googlegroups.com.

Joseph Giovanelli

unread,
Feb 3, 2021, 3:54:32 AM2/3/21
to hyperopt-discuss
Hi Brent,

thanks for the reply.

Then I guess it wasn't a dependencies problem.
I ran the "python main.py" command as is, maybe I should do something different?
I checked and the $SPARKHOME variable is set, but I keep having the following error:

21/02/03 08:15:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using built[464/1868]
asses where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Because the requested parallelism was None or a non-positive value, parallelism will be set to (8), which is Spark's de$
ault parallelism (8), or the current total of Spark task slots (8), or 1, whichever is greater. We recommend setting pa$
allelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
21/02/03 08:15:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]| 0/100 [00:00<?, ?trial/s, best loss=?$
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 587, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
    return self.loads(obj)
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'ml_pipeline'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
        at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
        at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
        at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) 
       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
21/02/03 08:15:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ea8b46bfecb0, executor driver): org.apache.sp$
rk.api.python.PythonException: Traceback (most recent call last):
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 587, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
    return self.loads(obj)
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'ml_pipeline'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
        at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
        at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
        at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

21/02/03 08:15:44 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job


Joseph Giovanelli

unread,
Feb 3, 2021, 10:35:09 AM2/3/21
to hyperopt-discuss
I solved the problem.
Actually everything in my environment was fine, the issue was that I didn't linked my own dependencies. In the end the solution was simple: I ran the the main with spark-submit main.py, zipped the other .py files, and gave them with the --py-files option.

Reply all
Reply to author
Forward
0 new messages