LocalScheduler may pollute the cwd; fixing addFile()

27 views
Skip to first unread message

Josh Rosen

unread,
Jan 19, 2013, 11:00:34 PM1/19/13
to spark-de...@googlegroups.com
Summary: LocalScheduler may pollute the driver's current working directory.

The original implementation of addFile() had a bug that could lead to users' code/files being deleted from the current working directory:


Spark tasks assume that files added through addFile() will be available in the current working directory.  Because of this, files added through addFile() are copied into the driver's current working directory so that jobs that use addFile() will work with the LocalScheduler.

My pull request fixed the file deletion problem by removing the delete calls and adding checks to prevent local files from being overwritten with different content.

This introduces problems for PySpark users.  For example, if I run the word count example using ./pyspark python/examples/wordcount.py, it will copy wordcount.py into the root Spark directory.  If I modify the original wordcount.py and try to re-run my job it will fail because Spark will refuse to overwrite the old copy of wordcount.py.

This may also cause jobs to fail when using an updated version of a JAR.

There's no way to set the current working directory on a per-thread basis (we need per-thread because globally changing the working directory could cause problems for other code that executes in the same JVM).

I see two possible fixes:
  • Modify the LocalScheduler to run tasks in a separate JVM.  This might have other benefits, especially for jobs that use the working directory in other ways (e.g. for storing temporary scratch files).  This might make it harder to run local profiling or code coverage tools.
  • Change the API so that files are accessed through an API like SparkFiles.get("my-file-name.txt").  This breaks backwards compatibility and doesn't solve problems caused by other uses of the current working directory.
Any thoughts on which approach we should take?  I'd like to fix this before 0.7 is released because the current bug will cause complaints / confusion.

- Josh

Matei Zaharia

unread,
Jan 20, 2013, 12:30:36 AM1/20/13
to spark-de...@googlegroups.com
Hey Josh,

I'd rather change the API, because launching a separate JVM for the local mode is going to create nontrivial management problems (e.g. how do we kill it on exit, how do you configure its classpath, etc) and is going to make Spark code much more painful to unit test (2-3 second startup cost).

I think the right API would be a SparkFiles.get as you said, as well as a SparkFiles.getRootDirectory() that gives you the root directory where all the files get downloaded. That would make it easier to port existing code that wants a File object.

Matei
Reply all
Reply to author
Forward
0 new messages