Hyperopt Spark dependencies

Joseph Giovanelli

unread,

Feb 2, 2021, 12:28:19 PM2/2/21

to hyperopt-discuss

Could anyone provide me with the dependencies required to run Hyperopt with SparkTrials? They say they provide a script (https://github.com/hyperopt/hyperopt/blob/master/download_spark_dependencies.sh) but there is nothing.

Thanks

P.S. First I had the error :

ModuleNotFoundError: No module named 'pyspark'

Then I installed it via pip install pyspark, but now I have the error:

ModuleNotFoundError: No module named 'ml_pipeline'

But this time pip install ml_pipeline doesn't exist.

It should be a dependencies problem.

Brent Komer

unread,

Feb 2, 2021, 2:07:22 PM2/2/21

to Joseph Giovanelli, hyperopt-discuss

Hi Joseph,

I haven't used SparkTrials at all, but I found some info that might help:

It looks like that script was removed when upgrading to Spark 3.0.1 since this version doesn't require a manual download

https://github.com/hyperopt/hyperopt/pull/749/files

For reference, here is the file before it was removed (in case you need it for the specific version you are using)

https://github.com/hyperopt/hyperopt/blob/25bbb8a2a0ba0b2efd5fcb60218388fbdccd119e/download_spark_dependencies.sh

--
You received this message because you are subscribed to the Google Groups "hyperopt-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hyperopt-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hyperopt-discuss/e3c35da2-fef3-46d3-b51a-10242d4749ccn%40googlegroups.com.

Joseph Giovanelli

unread,

Feb 3, 2021, 3:54:32 AM2/3/21

to hyperopt-discuss

Hi Brent,

thanks for the reply.

Then I guess it wasn't a dependencies problem.

I ran the "python main.py" command as is, maybe I should do something different?

I checked and the $SPARKHOME variable is set, but I keep having the following error:

21/02/03 08:15:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using built[464/1868]

asses where applicable

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Because the requested parallelism was None or a non-positive value, parallelism will be set to (8), which is Spark's de$

ault parallelism (8), or the current total of Spark task slots (8), or 1, whichever is greater. We recommend setting pa$

allelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.

21/02/03 08:15:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]| 0/100 [00:00<?, ?trial/s, best loss=?$

org.apache.spark.api.python.PythonException: Traceback (most recent call last):

File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 587, in main

func, profiler, deserializer, serializer = read_command(pickleSer, infile)

File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command

command = serializer._read_with_length(file)

File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length

return self.loads(obj)

File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads

return pickle.loads(obj, encoding=encoding)

ModuleNotFoundError: No module named 'ml_pipeline'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)

at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)

at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)

at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

at scala.collection.Iterator.foreach(Iterator.scala:941)

at scala.collection.Iterator.foreach$(Iterator.scala:941)

at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)

at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)

at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)

at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)

at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)

at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)

at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)

at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)

at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)

at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)

at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)

at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

at org.apache.spark.scheduler.Task.run(Task.scala:127)

at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

21/02/03 08:15:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ea8b46bfecb0, executor driver): org.apache.sp$

rk.api.python.PythonException: Traceback (most recent call last):