Proposal for addition to clojure.java.shell

776 views
Skip to first unread message

Marc Limotte

unread,
May 24, 2013, 12:39:27 PM5/24/13
to cloju...@googlegroups.com
This is a proposal for some extensions to clojure.java.shell.  These additions are backward compatible (non-breaking) with existing uses of clojure.java.shell.  I think they cover some very common use-cases that should be simple, but tend to be a bit more of a hassle than required.

We've been using our modified version of shell.clj for over two years, and I think it might be helpful to the community, so I wanted to see if there was any interest in my preparing a Jira and pull-request?  BTW, I did look at some of the third-party, open-source shell libs that already exist, but wasn't happy with how they departed from clojure.java.shell.

A sample modified docstring for 'sh' is below, but as an overview, some goals:
  1. Support streaming for shell commands that return a lot of data, but still have clojure.java.shell manage the connection (i.e. call close)
  2. Optionally accept collection of Strings as the cmd, instead of requiring inline.  Otherwise, using a programmatically built command with an option like :dir requires an ugly apply and concat combo.  E.g.:
    (let [cmd (make-file-list-command ...)] (apply sh (concat cmd [:dir home-dir])))
  3. Handle some common stdout/stderr redirect cases.  
    1. E.g. send stderr of the shell command to where stderr of the jvm goes.
    2. I find use of (sh "bash" "-c" …) to be an ugly workaround, a common case is to run something and direct output to a file.  
  4. A shortcut for using the current env with some additions assoc'ed onto it.
  5. A wrapper abstraction for pipes (more below)

A modified docstring for proposed sh:

(defn sh
  "Passes the given strings to Runtime.exec() to launch a sub-process.
  
  (sh command* options)
  
  command can either be inline Strings, or a Seq of Strings.

  Options are

  :in      may be given followed by any legal input source for
           clojure.java.io/copy, e.g. InputStream, Reader, File, byte[],
           or String, to be fed to the sub-process's stdin.
  :out     may be given followed by :capture, a File, :pass, :err, or a fn.  For...
           - :capture - the sub-process's stdout will be stored in String or byte array
           as specified by :out-enc.
           - File - the output is written to the file
           - :pass - the sub-process's stdout is passed to the main process stdout.
           - :err - send stdout to stderr
           - fn - The fn is called with the stdout InputStream as an argument (stream is
           closed automatically). The fn is run in a Thread, but sh blocks until the
           Thread completes. This can be used, for example, to filter the stream or to 
           pipe the output to another sh process.
           Defaults to :capture
  :err     same options as :out. Defaults to the same value as :out.
  :in-enc  option may be given followed by a String, used as a character
           encoding name (for example \"UTF-8\" or \"ISO-8859-1\") to
           convert the input string specified by the :in option to the
           sub-process's stdin.  Defaults to UTF-8.
           If the :in option provides a byte array, then the bytes are passed
           unencoded, and this option is ignored.
  :out-enc option may be given followed by :bytes or a String. If a
           String is given, it will be used as a character encoding
           name (for example \"UTF-8\" or \"ISO-8859-1\") to convert
           the sub-process's stdout to a String which is returned.
           If :bytes is given, the sub-process's stdout will be stored
           in a byte array and returned.  Defaults to UTF-8.
  :env     override the process env with a map (or the underlying Java
           String[] if you are a masochist).
  :dir     override the process dir with a String or java.io.File.

  You can bind :env or :dir for multiple operations using with-sh-env
  and with-sh-dir.

  sh returns a map of
    :exit => sub-process's exit code
    :out  => sub-process's stdout (as byte[] or String)
    :err  => sub-process's stderr (String via platform default encoding)"
. . . )



Examples

(println (sh ["ls" "-l"] :out :pass))
(println (sh "ls" "-l" "/no-such-thing"))
(println (sh "ls" :out :err)
(println (sh "java" "..." :env (merge-env {"CLASSPATH" "/tmp/some.jar"}))) 
(println (sh "cat" :in (io/file "/tmp/input") :out (io/file "/tmp/out") :err :pass))

Pipes 

Assuming you don't want to do (sh "bash" "-c" "foo.py | bar.py"), maybe because you want something platform agnostic?  Here is an example that uses the 'fn' option for :out.  (This one is ugly, but it gets better).
 
(deftest test-sh-piped
  ; This is similar to the sh expr `cat <input> | sort`
  (let [input (File/createTempFile "shell-unit" nil)]
    (spit input "b\na\n")
    (with-open [pipe-in (PipedInputStream.)]
      ; Note that we start the second process (sort) first (in a thread), so it is ready
      ; to accept data concurrently. And we block on the generating process, so we
      ; can close the stream it's writing to.  We need to close this stream first, or
      ; the 'sort' process will never finish.
      (let [result (future (sh "sort" :in pipe-in :err :pass))]
        (with-open [pipe-out (PipedOutputStream. pipe-in)]
          (sh "cat" :in input :out pipe-out :err :pass))
        (is= 0 (:exit @result))
        (is= "a\nb\n" (:out @result))))))

That's a lot of work and lots of opportunity for mistakes for something that should be simple.  Luckily, the pattern is common for all such pipes, so a wrapper could help.  This could be built on top of the functionality presented above.  There's a lot of possible styles for this api, here's some alternatives:

; Option A - A :pipe option

(sh "echo" :in input :pipe "sort") ; single string command

(sh "echo" :in input :pipe ["wc" "-l"]) ; seq of strings command

(sh "echo" :in input :pipe {:cmd "sort" :env {} :pipe ["wc" "-l"]} )  ; Map command w/ sub-options and nested pipes

; TODO a :pipe-err option

 

; Option B - A macro, that will redirect the out of one to the in of the next sh form

(sh/pipe

  (sh "echo" :in input)

  (sh "sort"))


; Option C - nested sh/pipe forms as the :out value

(sh/sh "echo" :in input :out (sh/pipe "sort"))

; which could be threaded, but then needs to be read right-to-left

(->> (sh/pipe "sort") 

  (sh/sh "echo" :in input :out))


; Option D - nested sh/pipe forms as the :in value

; which reads from the inside out 

(sh/sh "sort" :in (sh/pipe "echo" :in input))

; or threaded, which can be read left=to-right

(->> input 

  (sh/pipe "echo" :in) 

  (sh/sh "sort" :in))




Marc



cees van Kemenade

unread,
May 27, 2013, 6:35:57 AM5/27/13
to cloju...@googlegroups.com
+1 for additional options on clojure.java.shell.

Especially the option to streaming output and piping options would be usefull for me.

Stuart Halloway

unread,
May 27, 2013, 10:17:26 PM5/27/13
to cloju...@googlegroups.com
I have not looked in detail at this proposal, but based on my past experience I would say

1. java.shell is underpowered

2. Now that we have good maven discipline around contribs, I wish that libraries like java.shell weren't in Clojure proper.  We could iterate and ship a lot more quickly if we created a "better shell" contrib instead of working on clojure.java.shell.

Stu


--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Marc Limotte

unread,
May 28, 2013, 9:32:14 AM5/28/13
to cloju...@googlegroups.com
Any opinion on the various pipe options I suggested?  Or alternative solutions?

Marc Limotte

unread,
May 28, 2013, 9:39:13 AM5/28/13
to cloju...@googlegroups.com
Hi Stuart,

As far as I know, there is no java.shell contrib (discounting the old shell-out from clojure-contrib).  Are you suggesting a new repo under https://github.com/clojure?  I'm willing to do that if people think it would be useful.

Marc

cees van Kemenade

unread,
May 30, 2013, 5:25:22 AM5/30/13
to cloju...@googlegroups.com
Hi Marc,

When reading your four options I guess option A and option D feel most natural to me.
Option C puts the the steps from back to front, while threading macro's normally process from top to bottom.
Option B does some magic on the original shell commands to change their :out from a value to a stream, which is less intuitive I would guees.

So I'm left with option A and D. I would guess the real value from piping comes when you can pipe shell-output to a clojure function (in a streaming fashion). In option D this seems to be possible in a idiomatic way. is this also possible in option A?

Cees.

Marc Limotte

unread,
May 30, 2013, 9:12:10 AM5/30/13
to cloju...@googlegroups.com
D avoids the 'magic' of B, but you have to remember when to use sh/pipe vs. sh/sh.  In the threaded usage of D, sh/pipe for all entries except the last.

Piping to a Clojure function is a little different, but would be pretty much the same for A and D.  Two ways to do it, either grab :out as a String (or bytes) and then call a fn with the String:

(->> input 

  (sh/pipe "echo" :in) 

  (sh/sh "sort" :in)

  (#(s/split % "\n"))

  :out

  last)


Or if you want to stream the data, the fn must accept an InputStream and is used as an arg to sh/sh or sh/pipe:

(->> input 

  (sh/pipe "echo" :in) 

  (sh/sh "sort" :out (comp last line-seq io/reader) :in))


Those two examples use option D, here's one example for A:

(sh "echo" :in input :pipe {:cmd "sort" :out (comp last line-seq io/reader)}) 

I'm leaning toward D, because A doesn't lend itself well toward threading.

Marc


--
You received this message because you are subscribed to a topic in the Google Groups "Clojure Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure-dev/A6xFhcPKdws/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to clojure-dev...@googlegroups.com.

cees van Kemenade

unread,
Jun 1, 2013, 3:54:33 AM6/1/13
to cloju...@googlegroups.com
Hi Marc,

The way you suggest to handle the interleaving of shell-commands and clojure functions would not solve my current use-case. This approach:
1. Drains memory when the shell-output is a very long sequence
2. Does not allow the clojure function to be a real part of the asynchroneous pipeline (real interleaving of clojure and shell-processing on a line-by-line basis)
In my case I launch a separate JVM to perform a large (memory-intensive) processing task and I parse the shell-output to check the status of the process running in this separate JVM).

In order to handle the above the clojure function should receive an input-stream (or a lazy-sequence of lines) instead of fully realized :out.

Given how your piped-shell works with temporary files on disc it seems that real interleaving of clojure and shell-script should be possible. Imagine a clojure function (prototype):
    (defn pipelined-analysis inp out err]   ....)
where inp, out and err are all bytestreams.

In that case it would be possible (under the (magic) option B  :-)

(sh/pipe
  (sh "ls -alR" )
  (pipelined-analysis )
  (sh "sort"))

In this implementation the sh/pipe-macro would note that it pipelined-analysis is a clojure function (not being sh) and it would launch the function in a separate thread with the following parameters:
- inp:  feed the temporary output-file of the previous stage
- output:  provide a new temporary file to write the output, via a cat this output-stream can serve as the input to subsequent shell-commands.
- err:  direct write access to the error-stream as the error-stream bypasses the pipeline

Of course many/most use-cases for a shell-command assume text, so it would  be nice to have a convenience-wapper (sh/text-pipe) that changes the inp byte-stream into a line-sequence and that captures all println output of the function and writes this to a output-stream on a line by line basis.
In that case the pipeline would be something like:
(sh/pipe
  (sh "ls -alR" )
  (sh/text-pipe  (pipelined-text-analysis ))
  (sh "sort"))

So now I understand the opportunities a bit better I would guess that option B is the most interesting/flexible solution.

I like the idea of this power-shell (pipeline) very much as it offers a need way to do streaming analysis using tools written in different languages. I guess the next plumbing of the next analysis-process would be easy to implement in this new shell:
      In a C-program: Reading data from a legacy system produces a byte-stream                            |
      In Clojure: Transforming the byte-stream into a structure data (for example json format)             |
      In Python:  Applying a data-transformation using the SciPy library                                           |
      In Clojure:  Storing or visualizing the outcomes of the analysis.
If all programs stream their input and output (or process on a line-by-line basis) we get a asynchroneous pipeline. This would be a really powerful way to use Clojure as your main language while using processing steps written in other languages (almost as if these were clojure functions).

Cees.

Marc Limotte

unread,
Jun 4, 2013, 3:40:50 PM6/4/13
to cloju...@googlegroups.com
I didn't think of the use case where you are going back and forth between sub-processes and functions.  My proposal only supported streaming through one fn at the end.  I have to admit that would be pretty cool.  Let me see what I can do.

Marc

Anthony Grimes

unread,
Jun 4, 2013, 3:43:28 PM6/4/13
to cloju...@googlegroups.com
Not sure if it's helpful, but I have https://github.com/Raynes/conch for more complicated shell stuff. Might be useful.

Hugo Duncan

unread,
Jun 4, 2013, 3:57:35 PM6/4/13
to cloju...@googlegroups.com
Marc Limotte <msli...@gmail.com> writes:

> I didn't think of the use case where you are going back and forth between
> sub-processes and functions. My proposal only supported streaming through
> one fn at the end. I have to admit that would be pretty cool. Let me see
> what I can do.

This would help with pallet usage too - currently we have a modified
clojure.java.shell that allows reading of the result stream as the
command is running.

[1] https://github.com/pallet/local-transport/blob/develop/src/pallet/shell.clj

Marc Limotte

unread,
Jun 6, 2013, 4:11:24 PM6/6/13
to cloju...@googlegroups.com
Thanks, Anthony.  I saw conch, looks good, but I wanted something closer to the clojure.java.shell implementation.  Backward compat with clojure.java.shell and also stick with the idea that it closes all streams that it opens.  I'm also interested in Cees' use-case of interleaving processes and clojure functions in a pipe (with asynchronous streaming).

Marc

Anthony Grimes

unread,
Jun 6, 2013, 4:32:29 PM6/6/13
to cloju...@googlegroups.com
You can do this kind of piping with conch, fwiw. Not with the same verbatim syntax, but you can do it pretty easily. If streams in conch aren't being properly closed, that's most likely a bug and should be reported on the issue tracker. I'll give you the backwards compatibility thing though, not much I can do about that one. I'm obviously a bigger fan of the Python sh style of doing things. ;)

Cheers!

Marc Limotte

unread,
Jun 6, 2013, 4:47:26 PM6/6/13
to cloju...@googlegroups.com
Cees, Hugo,

I have something up now.  It's well documented in the docstrings,and lots of tests which serve as examples.  It supports all the use cases we talked about.  Including interleaving processes and clojure functions via asynchronous streams.  Here's an example:

           (pipe
             (sh "cat" :in inputf)
             (wrap (fn [in-seq wrt] (->> in-seq count (.println wrt)))
                   :in :line-seq :out :writer)
             (sh "sort"))

Or sync:

; count the lines in /tmp/data.txt
(pipe (sh "cat" "/tmp/data.txt")
        (wrap count :in :line-seq :out :forward)
        (sh "cat"))

; 'wrap' is a helper higher-order function.  The default is functions that take a InputStream and OutputStream as args.
; An example, with side-effects from the fn.
(def state (atom 0)
(pipe
              (sh "echo" "sweet home alabama\nno place like home\nmi casa es su casa")
              (fn [in out]
                (with-open [wrt (PrintWriter. out)]
                  (doall
                    (for [line (->> in io/reader line-seq) :when (re-find #"home" line)]
                      (do (swap! state inc)
                          (.println wrt line))))))
              (sh "wc" "-l"))

I haven't deployed to clojars, yet, I want to get a little bit of feedback before doing that and sending an annoucing.  But the code is here: https://github.com/mlimotte/java.shell2

Marc

cees van Kemenade

unread,
Jun 13, 2013, 5:21:58 AM6/13/13
to cloju...@googlegroups.com

Hi Marc,

I did some tests on clojure.java.shell2. However, the speed-penalty of splicing clojure in the process is significant.

Let me explain my test. I took a log-file of 88 Mb and tried to filter all lines containing ERROR out of it. the procedure was:

On the bash-shell:
$ time sh -c "cat /tmp/Tdat_eisAnal.log |grep "ERROR" | tee -a /tmp/errs1 | wc"
    154    5180  113686

real    0m1.346s
user    0m0.888s
sys     0m0.852s

When using clojure.java.shell2:
user=> (time (pipe (sh "cat" "/tmp/Tdat_eisAnal.log") (wrap-text-lines #(filter (partial re-find #"ERROR") %)) (sh "tee" "-a" "/tmp/errs") (sh "wc")))
"Elapsed time: 130861.357361 msecs"
[{:exit 0, :out nil, :err ""} {:exit 0, :out nil, :err nil} {:exit 0, :out nil, :err ""} {:exit 0, :out "    154    5180  113686\n", :err ""}]

Both versions produce exactly the same output. However, the clojure-shell is slower by a factor 10. I expect that the difference in speed is due to either
1. the switching between shell-process and a jvm process
2. due to the intermediate files that are used in clojure.java.shell, but that aren't needed in the bash-shell

When doing the same test with a 1Gb  the time-difference is 1.3 sec in Bash and 131 seconds for clojure.java.shell2, so about 100x slower.

Given the rich functionality you have available in clojure this speed difference does not need to be a show-stopper.

Real streaming of files
If I understood clojure.java.shell it generates intermediate files on disk for each of the pipeline stages. This might cause disk-issues for long-running processes that process that steam high volumes of data, as the temporary space can only be reclaimed after finalization of the pipeline. I guess this might be prevented by allocating two in memory byte-buffers (java.nio) that are filled and consumed alternately.
Possible this also give a significant speed-up of the clojure.java.shell2. But I would say this can wait until the next version. What do others think?

Cees.






On Friday, May 24, 2013 6:39:27 PM UTC+2, Marc Limotte wrote:

Marc Limotte

unread,
Jun 13, 2013, 9:52:09 AM6/13/13
to cloju...@googlegroups.com
Thanks for the feedback, Cees.

Interesting findings.  I need to dig up a nice 100mb log file to test with. 

One thing to note, I pushed up a change yesterday.  If you have the code before that change, then the :forward-lines option (and therefore wrap-text-lines) would not stream, it would collect all the output and then forward it at once to the next step. This could have a significant impact on memory, I'm not sure what impact it would have on performance.  But you shout check `git log` to see which version of the code you tested with.

I'm not sure what you mean by intermediate files.  shell2 uses PipedInputStream/PipedOutputStream and clojure.java.io/copy, so everything is buffered.  In this first version, though, I didn't pay a lot of attention to BufferSize which may be a significant factor here.  Unfortunately, there are several buffers involved: io/copy has a 1k char buffer, BufferedReader has an 8k char buffer, PipedInputStream has a 1k byte buffer.  And I also do a manual flush for the :forward-lines option.

BufferSize should really be tunable.  If you're concerned about throughput, like your example, then you want a large buffer. On the other-hand, if you've got a clojure program that monitors a log and wants to alert if it recognizes some condition, then you might want a small buffer, so you can see it quickly.  [I wish there was a buffer that would flush when it reaches capacity, but also also flush after some period of inactivity, like 1 second?]

Another cause of slowdown when piping back and forth to Clojure functions is that you need to convert bytes to Characters and back again for each transition.  This only applies if you're working with Strings, you can avoid it by writing a fn that works bytes on the InputStream/OutputStream.

Marc


cees van Kemenade

unread,
Jun 13, 2013, 3:32:19 PM6/13/13
to cloju...@googlegroups.com


My tests used commit 4f0e3b4dfb979cd7..., which includes the forward lines options.
If you already used in memory buffers than the byte->string->byte translations that you mentioned seem to me the most likely bottleneck
(I would guess preventing this type of transformations is more important than tuning the buffersizes).

Cees.

Marc Limotte

unread,
Jun 13, 2013, 3:59:53 PM6/13/13
to cloju...@googlegroups.com
I don't think there is anything I can do at the library level to impact the byte->string->byte transformations.  If you have bytes coming in and you want to work with Strings than it is required.  It is up to the dev to work with bytes if they are concerned about throughput.  Let me know if I'm overlooking something.

Still, I'm not sure that is the only or most significant bottleneck.  I'll try and do some testing of this when I have time.

Marc



To unsubscribe from this group and all its topics, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.

cees van Kemenade

unread,
Jun 14, 2013, 1:38:56 PM6/14/13
to cloju...@googlegroups.com
I have to correct my statement.
Processing a 1gb logfile with an 8k buffer requires 150, 000 process switches per stage.
This is a serious performance bottleneck.
Given modern mem sizes I would opt for a significant larger default buffer size.
I will do some performance measurements with larger buffers next week.

Cees.

cees van Kemenade

unread,
Jun 23, 2013, 3:48:29 PM6/23/13
to cloju...@googlegroups.com

I needed to dig a bit into the code of clojure.java.shell2 to find the relevant buffers.
The critical buffer seems to be the buffer of the PipedInputStream, an overflow of this buffer causes the Thread-switch (thread has to wait until buffer is emptied.
After some reordering of code I could change the buffer size. When processing the 88Mb logfile using different buffersizes I get the following results.

buffSize     processing-time (sec)
1k               11.4 s
2k               11.4 s
4k               4.0 s
8k               2.7 s
16k             1.9 s
32k             1.6 s
1024k          1.15s 1.17s  1.12s

As we would expect there is a strong speedup when increasing the buffer size a bit, but the curve flattens out as the buffer-increases further.

Recall that the equivalent shell process did the processing in
real    0m1.346s
user    0m0.888s
sys     0m0.852s

So for a buffer size of 16k or 32k clojure.java.shell2 is performing approximately on-par with the process in the shell.
I guess a buffer-size of 16k or 32k would a better default-value than the current 1k pipe-size (buffer-size).

Cees.

Marc Limotte

unread,
Jun 24, 2013, 9:45:05 AM6/24/13
to cloju...@googlegroups.com
Awesome, thanks for taking the time to investigate, Cees.  32k sounds like a good compromise setting.  I wonder how much this might vary from one environment/OS to another?  Setting one specific value will be a lot easier than making it a configurable dynamic var (particularly because the implementation uses futures, so I'd need a way to capture the local thread binding and propagated to the child threads.  

Probably best to implement a single value now and wait to see if there is demand for something more sophisticated.

Marc


--
You received this message because you are subscribed to a topic in the Google Groups "Clojure Dev" group.

To unsubscribe from this group and all its topics, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages