SparkFiles.get / sc.addFile changes file permissions

612 views
Skip to first unread message

Grega Kešpret

unread,
May 29, 2013, 5:49:48 AM5/29/13
to spark...@googlegroups.com
Hi,
I've noticed that by using sc.addFile() with SparkFiles.get I get changed file permissions on the file I was working with.

git diff
diff --git a/shared/constants.json b/shared/constants.json
old mode 100644
new mode 100755

It's not a big issue, but I would still like to know why this happens. Thanks.

Josh Rosen

unread,
May 29, 2013, 11:11:31 AM5/29/13
to spark...@googlegroups.com
The actual permissions change is occurring in the Utils.fetchFile() function (just `git grep chmod` to find it); I'm not sure what use-case originally motivated it.

The motivation behind the SparkFiles API was to avoid polluting the driver's current working directory with downloaded files and to fix a bug where calling addFile() on certain files could cause the original files to be deleted after the job completed (see https://github.com/mesos/spark/pull/394 for the relevant discussions).

My motivation was to leave original files untouched by addFile(), so this behavior is a bug that I'd like to fix.

It looks like the problem is that Utils.addFile() symlinks local files into the target directory, so the permissions change to the target file affects the original file via the symlink.  I think the idea behind the symlink was to avoid an expensive local copy of a large file when adding it.  I think we could probably just perform the extra copy for safety, since users should be storing large files in HDFS (which should also work with addFile()).  If users add large local files, then the driver will have to broadcast those files to all workers, so the cost of one extra copy shouldn't have a huge relative performance impact.  If the file is small, there will be negligible impact.  I can take a pass at fixing this later.


--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Grega Kešpret

unread,
May 29, 2013, 11:25:58 AM5/29/13
to spark...@googlegroups.com
Thanks for clarification. Now I indeed see the FileUtil.chmod(filename, "a+x") in Utils.scala.

Grega

--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/JkDYdp3Yw4U/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.

Reynold Xin

unread,
May 29, 2013, 2:50:16 PM5/29/13
to spark...@googlegroups.com
The reason for the +x is some files needs to be executable once they are passed to the workers, although that should not affect the original file.

I agree that it would be better to not symlink the file locally, and just make a copy to avoid this problem.

--
Reynold Xin, AMPLab, UC Berkeley

Reply all
Reply to author
Forward
0 new messages