Crossposting from SO for more visibility, seems like audioread problems have been encountered often by librosa users fairly often:
I have previously build pyspark environments using conda to package all dependancies and ship them to all the nodes at runtime. Here's how I create the environment:
`conda/bin/conda create -p conda_env --copy -y python=2 \
numpy scipy ffmpeg gcc libsndfile gstreamer pygobject audioread librosa`
`zip -r conda_env.zip conda_env`Then sourcing conda_env and running pyspark shell I can successfully execute:
`import librosa
y, sr = librosa.load("test.m4a")`Note without the environment sourced this script results in an error as ffmpeg/gstreamer are NOT installed on my locally.
Submitting a script to the cluster results in a librosa.load error which traces back to audioreadindicating the backend (either gstreamer or ffmpeg) can no longer be found in the zipped archive environment. The stacktrace is below:
Submit:
`PYSPARK_PYTHON=./NODE/conda_env/bin/python spark-submit --verbose \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NODE/conda_env/bin/python \
--conf spark.yarn.appMasterEnv.PYTHON_EGG_CACHE=/tmp \
--conf spark.executorEnv.PYTHON_EGG_CACHE=/tmp \
--conf spark.yarn.executor.memoryOverhead=1024 \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.driver.cores=5 \
--conf spark.driver.maxResultSize=0 \
--master yarn --deploy-mode cluster --queue production \
--num-executors 20 --executor-cores 5 --executor-memory 40G \
--driver-memory 20G --archives conda_env.zip#NODE \
--jars /data/environments/sqljdbc41.jar \
script.py`Trace:
`Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "script.py", line 245, in <lambda>
File "script.py", line 119, in download_audio
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/NODE/conda_env/lib/python2.7/site-packages/librosa/core/audio.py", line 107, in load
with audioread.audio_open(os.path.realpath(path)) as input_file:
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/NODE/conda_env/lib/python2.7/site-packages/audioread/__init__.py", line 114, in audio_open
raise NoBackendError()
NoBackendError`My question is: How can I package this conda archive so that librosa (really audioread) is able to find the backend and load .m4a files?
>>> import soundfile as sf
>>> y, sr = sf.read("test.m4a")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 373, in read
  subtype, endian, format, closefd) as f:
 File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 740, in __init__
  self._file = self._open(file, mode_int, closefd)
 File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 1265, in _open
  "Error opening {0!r}: ".format(self.name))
 File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 1455, in _error_check
  raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'test.m4a': File contains data in an unknown format.
envname = "./NODE/conda_env/bin"
path = os.getenv("PATH")
if envname not in path:
   path += os.pathsep + envname
   os.environ["PATH"] = path