Zipping Conda Environment Breaks Librosa's Audioread Backend (Python/Pyspark)

287 views
Skip to first unread message

Tim Schmeier

unread,
Oct 16, 2017, 12:02:51 PM10/16/17
to librosa

Crossposting from SO for more visibility, seems like audioread problems have been encountered often by librosa users fairly often:



I have previously build pyspark environments using conda to package all dependancies and ship them to all the nodes at runtime. Here's how I create the environment:


`conda/bin/conda create -p conda_env --copy -y python=2  \
numpy scipy ffmpeg gcc libsndfile gstreamer pygobject audioread librosa`

`zip -r conda_env.zip conda_env`


Then sourcing conda_env and running pyspark shell I can successfully execute:


`import librosa
y, sr = librosa.load("test.m4a")`


Note without the environment sourced this script results in an error as ffmpeg/gstreamer are NOT installed on my locally.

Submitting a script to the cluster results in a librosa.load error which traces back to audioreadindicating the backend (either gstreamer or ffmpeg) can no longer be found in the zipped archive environment. The stacktrace is below:


Submit:


`PYSPARK_PYTHON=./NODE/conda_env/bin/python spark-submit --verbose \
        --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NODE/conda_env/bin/python \
        --conf spark.yarn.appMasterEnv.PYTHON_EGG_CACHE=/tmp \
        --conf spark.executorEnv.PYTHON_EGG_CACHE=/tmp \
        --conf spark.yarn.executor.memoryOverhead=1024 \
        --conf spark.hadoop.validateOutputSpecs=false \
        --conf spark.driver.cores=5 \
        --conf spark.driver.maxResultSize=0 \
        --master yarn --deploy-mode cluster --queue production \
        --num-executors 20 --executor-cores 5 --executor-memory 40G \
        --driver-memory 20G --archives conda_env.zip#NODE \
        --jars /data/environments/sqljdbc41.jar \
        script.py`


Trace:


`Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "script.py", line 245, in <lambda>
  File "script.py", line 119, in download_audio
  File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/NODE/conda_env/lib/python2.7/site-packages/librosa/core/audio.py", line 107, in load
    with audioread.audio_open(os.path.realpath(path)) as input_file:
  File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/NODE/conda_env/lib/python2.7/site-packages/audioread/__init__.py", line 114, in audio_open
    raise NoBackendError()
NoBackendError`

My question is: How can I package this conda archive so that librosa (really audioread) is able to find the backend and load .m4a files?

Brian McFee

unread,
Oct 16, 2017, 12:53:40 PM10/16/17
to librosa
Wow, that's a doozy.  Maybe out of scope for librosa (since it's an audioread issue, or maybe even downstream of that!) but since I originally made the audioread feedstock, that's fair game.

A couple of ideas:

  1. Have you tried exporting the conda environment and rebuilding it at the destination, rather than shipping a .zip?  Maybe that's not feasible on your network, but it would help rule out things like architecture mismatch.

  2. You might consider replacing librosa.load (and audioread) calls with pysoundfile.  If you're only using m4a, this ought to work, and the dependencies are much more sane.

Tim Schmeier

unread,
Oct 16, 2017, 3:19:14 PM10/16/17
to librosa
Yeah, there is no doubt this is out of librosa scope. I just noted a lot of librosa users have wrestled with audioread before and maybe someone had a solution. To address your comments:

1. Shipping the .zip works on our cluster, this same approach is currently used other production pyspark processes successfully. I think this works because our hardware and AMI images are homogeneous.

2. Using pysoundfile was my first inclination and I repackaged it with its dependencies as a conda tarball but reading .m4a files results in an unknown format error. Even locally without distributed computing or conda environment management I see the following error. I'm unclear on if libsndfile actually supports decoding this format.

>>> import soundfile as sf

>>> y, sr = sf.read("test.m4a")

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 373, in read

    subtype, endian, format, closefd) as f:

  File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 740, in __init__

    self._file = self._open(file, mode_int, closefd)

  File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 1265, in _open

    "Error opening {0!r}: ".format(self.name))

  File "/Users/1111631/anaconda/lib/python2.7/site-packages/soundfile.py", line 1455, in _error_check

    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))

RuntimeError: Error opening 'test.m4a': File contains data in an unknown format.


I suspect this might be a path issue, perhaps the containers can't find ffmpeg.

Tim Schmeier

unread,
Oct 17, 2017, 10:34:21 AM10/17/17
to librosa
This was indeed a path issue. The YARN containers could not find ffmpeg because it was not in the executor's path. This hack fixes the problem although there may be a better way to do this in spark-env.sh or spark-defaults.conf. Thanks for the help Brian.


envname = "./NODE/conda_env/bin"


path = os.getenv("PATH")

if envname not in path:

    path += os.pathsep + envname

    os.environ["PATH"] = path

Reply all
Reply to author
Forward
0 new messages