At the first cell I got problems, and a message was:
ImportError: No module named 'sparkdl'.
I installed the module in my cluster library. Got
ImportError: No module named 'keras'.
Installed this as well. Then the same thing with tensorflow
. At this point I got
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
I tried different order for module installation. In particular since keras
is build on the top of tensorflow
I put the last one before first. Finally I got a list of all required modules: sparkdl
, tensorflow
, tensorflowonspark
, tensorframes
, kafka
, jieba
, keras
.
I wish somewhere would be a list with them. Still was getting an error message:
AttributeError: module 'tensorflow' has no attribute 'Session'
As far as I know `Session` method is a base `tensorflow` method. Googling did not yield a solution for PySpark.
I found Spark Deep Learning repository on github and read current recommendations, just in case it's here: https://github.com/databricks/spark-deep-learning/blob/master/README.md
We see the advice "To work with the latest code, Spark 2.3.0 is
required and Python 3.6 & Scala 2.11 are recommended". Thus I need
to create a cluster with these versions. But there is no such option
for me when I create a cluster, see attached picture.
I may use only Spark 2.4.* or 2.2.*
Can somebody please help me?
Best,
Mya