Problems executing shell command from within Pyspark script

2,245 views
Skip to first unread message

Andre S

unread,
Jan 30, 2013, 8:54:10 PM1/30/13
to spark...@googlegroups.com

Hi everyone,

I'm trying to execute a shell command from within the map function of a
Python script (see attached script) and get an Out Of Memory exception
which is really strange (given that I'm only trying to run 'ls', etc.
Would be great if anybody could point me to the right direction. I'm new
to Spark in general and Pyspark in particular, so the problem may be in
fact simple.

I'm using mesos-0.9.0-incubating and the spark git repo version from
github.com/mesos/spark.git (last updated yesterday). I also attach my
config files. There is one strange thing when I start up mesos:

andre@MBP:~/mesos$ ./sbin/mesos-start-cluster.sh
Starting mesos-master on 127.0.0.1
ssh 127.0.0.1 /home/andre/mesos/sbin/mesos-daemon.sh mesos-master
</dev/null >/dev/null
/home/andre/mesos/sbin/mesos-daemon.sh: line 7: ulimit: open files:
cannot modify limit: Operation not permitted
Starting mesos-slave on 127.0.0.1
ssh 127.0.0.1 /home/andre/mesos/sbin/mesos-daemon.sh mesos-slave
</dev/null >/dev/null
/home/andre/mesos/sbin/mesos-daemon.sh: line 7: ulimit: open files:
cannot modify limit: Operation not permitted
Everything's started!

But I doubt this is related(?).

Would be really great if someone could point me into the right
direction. It may be that I have done something wrong, in the configs or
so. The strange thing is that most of the examples (Python and others) I
tried run through without problems.

Cheers,

Andre
stderr
test_command.py
spark-env.sh
stdout
mesos.conf

Matei Zaharia

unread,
Feb 1, 2013, 1:47:12 AM2/1/13
to spark...@googlegroups.com
This might be because subprocess.check_call somehow writes stuff to the program's stdout stream. In PySpark, we use the Python process's original stdout to write data back to Spark, and redirect sys.stdout to sys.stderr so that your log messages appear in that file. However, maybe subprocess.check_call uses the original stdout stream somehow. That would cause it to write the output of ls there, which will confuse the Java code in Spark. Is there a way you can make it not write to stdout? (E.g. redirect the output to /dev/null or to sys.stderr).

Matei
> --
> You received this message because you are subscribed to the Google Groups "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> <stderr.txt><test_command.py><spark-env.sh><stdout.txt><mesos.conf>

Andre S

unread,
Feb 1, 2013, 1:55:23 AM2/1/13
to spark...@googlegroups.com

Hi Matei,

Thanks! Replacing

subprocess.check_call('ls', shell=True)

by

subprocess.check_call('ls', shell=True, stdout = sys.stderr)

seems to have done the trick. Guess I should have tried that one
earlier. Alright, thanks again,

Andre

Matei Zaharia

unread,
Feb 1, 2013, 2:05:15 AM2/1/13
to spark...@googlegroups.com
Cool, good to know. It's definitely not obvious. I wonder whether there's another way we can redirect stdout "for good"... doesn't seem easy except maybe by doing it in C.

Matei

Josh Rosen

unread,
Feb 1, 2013, 2:07:38 AM2/1/13
to spark...@googlegroups.com

We could use local sockets instead of pipes, which would eliminate the need to redirect stdout.

- Josh

Matei Zaharia

unread,
Feb 1, 2013, 2:14:02 AM2/1/13
to spark...@googlegroups.com
I wonder if you can just call freopen through ctypes or something like that. It might make it harder to run on Windows, but it will definitely do the right thing (in that the file descriptor number won't change). I guess sys.stdout is just a reference which may or may not point to FD 1 (or whatever stdout is).

Matei

Charles Reiss

unread,
Feb 1, 2013, 2:28:36 AM2/1/13
to spark...@googlegroups.com
On 1/31/13 11:14 PM, Matei Zaharia wrote:
> I wonder if you can just call freopen through ctypes or something like that.
> It might make it harder to run on Windows, but it will definitely do the right
> thing (in that the file descriptor number won't change). I guess sys.stdout is
> just a reference which may or may not point to FD 1 (or whatever stdout is).


Isn't os.open/os.dup2 enough?

- Charles

>
> Matei
>
>
> On Jan 31, 2013, at 11:07 PM, Josh Rosen <rosen...@gmail.com
> <mailto:rosen...@gmail.com>> wrote:
>
>> We could use local sockets instead of pipes, which would eliminate the need
>> to redirect stdout.
>>
>> - Josh
>>
>> On Thu, Jan 31, 2013 at 11:05 PM, Matei Zaharia <ma...@eecs.berkeley.edu
>> <mailto:ma...@eecs.berkeley.edu>> wrote:
>>
>> Cool, good to know. It's definitely not obvious. I wonder whether
>> there's another way we can redirect stdout "for good"... doesn't seem
>> easy except maybe by doing it in C.
>>
>> Matei
>>
>> On Jan 31, 2013, at 10:55 PM, Andre S <andre...@gmail.com
>> <mailto:andre...@gmail.com>> wrote:
>>
>> >
>> > Hi Matei,
>> >
>> > Thanks! Replacing
>> >
>> > subprocess.check_call('ls', shell=True)
>> >
>> > by
>> >
>> > subprocess.check_call('ls', shell=True, stdout = sys.stderr)
>> >
>> > seems to have done the trick. Guess I should have tried that one
>> earlier. Alright, thanks again,
>> >
>> > Andre
>> >
>> > On 01/31/2013 10:47 PM, Matei Zaharia wrote:
>> >> This might be because subprocess.check_call somehow writes stuff to
>> the program's stdout stream. In PySpark, we use the Python process's
>> original stdout to write data back to Spark, and redirect sys.stdout to
>> sys.stderr so that your log messages appear in that file. However, maybe
>> subprocess.check_call uses the original stdout stream somehow. That
>> would cause it to write the output of ls there, which will confuse the
>> Java code in Spark. Is there a way you can make it not write to stdout?
>> (E.g. redirect the output to /dev/null or to sys.stderr).
>> >>
>> >> Matei
>> >>
>> >> On Jan 30, 2013, at 5:54 PM, Andre S <andre...@gmail.com
>> <mailto:andre...@gmail.com>> wrote:
>> >>
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> I'm trying to execute a shell command from within the map function
>> of a Python script (see attached script) and get an Out Of Memory
>> exception which is really strange (given that I'm only trying to run
>> 'ls', etc. Would be great if anybody could point me to the right
>> direction. I'm new to Spark in general and Pyspark in particular, so the
>> problem may be in fact simple.
>> >>>
>> >>> I'm using mesos-0.9.0-incubating and the spark git repo version from
>> github.com/mesos/spark.git <http://github.com/mesos/spark.git> (last
>> <mailto:spark-users%2Bunsu...@googlegroups.com>.
>> >>> For more options, visit https://groups.google.com/groups/opt_out.
>> >>>
>> >>>
>> >>> <stderr.txt><test_command.py><spark-env.sh><stdout.txt><mesos.conf>
>> >>
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "Spark Users" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to spark-users...@googlegroups.com
>> <mailto:spark-users%2Bunsu...@googlegroups.com>.
>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >
>> >
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Spark Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to spark-users...@googlegroups.com
>> <mailto:spark-users%2Bunsu...@googlegroups.com>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Spark Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to spark-users...@googlegroups.com
>> <mailto:spark-users...@googlegroups.com>.

Josh Rosen

unread,
Feb 1, 2013, 3:31:11 AM2/1/13
to spark...@googlegroups.com
Good idea.  I submitted a pull request that uses os.fdopen / os.dup2:

Reply all
Reply to author
Forward
0 new messages