New string utilities library ready

141 views
Skip to first unread message

Stuart Sierra

unread,
Aug 19, 2009, 12:45:32 PM8/19/09
to clo...@googlegroups.com
Hey folks,

clojure.contrib.str-utils is one of the first libs I wrote, and it's
showing its age. I decided to try to start fresh, incorporating some
ideas discussed on the list. In general, I'm trying to provide an
efficient, functional API for string manipulation.

My new attempt is creatively named clojure.contrib.str-utils2.

One big change: you can't (use ...) it. That's because it reuses some
of the names in clojure.core. For example, it defines "take" and
"drop" specifically for strings.
You have to (require '[clojure.contrib.str-utils :as s]) then call
functions like s/take, s/drop. If everybody hates this, I'll change
it, but it would require adding prefixes on a bunch of functions.

Many of these functions are much faster than the equivalent using
sequences. For example, str-utils2/escape is 5-10 times faster than
(apply str (map f "foo"))

Eventually, I'd like to replace the old clojure.contrib.str-utils.
Let me know what you think.

-Stuart Sierra

Vagif Verdi

unread,
Aug 19, 2009, 1:59:23 PM8/19/09
to Clojure
I'm using str-utils2 for a couple of months now. Do not care about the
old library.

Chouser

unread,
Aug 19, 2009, 2:22:25 PM8/19/09
to clo...@googlegroups.com
On Wed, Aug 19, 2009 at 1:59 PM, Vagif Verdi<vagif...@gmail.com> wrote:
>
> I'm using str-utils2 for a couple of months now. Do not care about the
> old library.

Me too. I think it would be helpful to have a recommended
namespace alias to help keep different people's code a bit
more uniform.

I use (require '[clojure.contrib.str-utils2 :as str2]) for
now and would recommend just 'str' if the lib name changes.

--Chouser

Howard Lewis Ship

unread,
Aug 19, 2009, 3:09:47 PM8/19/09
to clo...@googlegroups.com
Have you considered splitting the str-utils2 into two namespaces, one
that can be imported, and another that needs to be required with a
namespace?
--
Howard M. Lewis Ship

Creator of Apache Tapestry

Stuart Sierra

unread,
Aug 19, 2009, 4:17:34 PM8/19/09
to Clojure
On Aug 19, 3:09 pm, Howard Lewis Ship <hls...@gmail.com> wrote:
> Have you considered splitting the str-utils2 into two namespaces, one
> that can be imported, and another that needs to be required with a
> namespace?

Hi Howard,
Hadn't thought of that, actually. There are 9 conflicts, out of 32
definitions:
take replace drop butlast partition contains? get repeat reverse

But I don't really want to split it into two libraries. It seems to
be generally agreed that (require '[.. :as ..]) is a best practice,
this just helps to encourage that.

-SS

Dan Larkin

unread,
Aug 19, 2009, 4:30:03 PM8/19/09
to clo...@googlegroups.com

On Aug 19, 2009, at 2:22 PM, Chouser wrote:

> I use (require '[clojure.contrib.str-utils2 :as str2]) for
> now and would recommend just 'str' if the lib name changes.


Except, of course, since there is already a str function, 'str' would
be a bad alias.

'strutils' or 'str-utils' sound fine to me, but I'm not so great at
the name game.

I'm in favor of str-utils2 replacing str-utils, though.

Dan

Sean Devlin

unread,
Aug 19, 2009, 5:16:16 PM8/19/09
to Clojure
Stuart,
This is a significant improvement over the original str-utils library,
and goes a long way towards making "string processing kick ass in
Clojure". I like the fact that you made some design decisions for the
library, and did everything you could to stick with them. That makes
the library more predictable.

There are two things I would like to discuss about your library.

First, I would change the names of functions functions that collide
with core to str-take, str-drop, etc. It's just as much to type, and
it is safe to use these names. Also, it would make it easier for Rich
to promote the library to the standard lib when it's done.

I suspect I am in the minority with my next concern. The library
takes the string as the first argument, so that it works well with the
-> macro. When I originally wrote my string library, I favored this
type of signature too.

However, over time I found this signature did not work well with my
code. Often I would write something like this

(map (comp (partial map (comp #(str2/drop % 2)
#(str2/take % 5)))
#(str2/split % #"\t"))
(split a-string #"[\n\r]"))

This felt a little forced, and the methods don't compose very well.
As such, I re-wrote my lib with the string call at the end of the
function. The main reason was I felt that this approach works better
with the partial function. The code above becomes like this.

;Granted, this is still a bit ugly.
;I have some other tricks to clean it up.
;That's for another day.
(map (comp (partial map (comp (partial str2/drop 2)
(partial str2/take 5)))
(partial str2/split #"\t"))
(split #"[\r\n]" a-string))

Despite these concerns, I'm still excited about the direction you are
headed with this lib. Let me know what you think about these points.

Sean Devlin

On Aug 19, 12:45 pm, Stuart Sierra <the.stuart.sie...@gmail.com>
wrote:

Stuart Sierra

unread,
Aug 19, 2009, 9:07:48 PM8/19/09
to Clojure
On Aug 19, 5:16 pm, Sean Devlin <francoisdev...@gmail.com> wrote:
> I suspect I am in the minority with my next concern.  The library
> takes the string as the first argument, so that it works well with the
> -> macro.  When I originally wrote my string library, I favored this
> type of signature too.
>
> However, over time I found this signature did not work well with my
> code.  Often I would write something like this
>
> (map (comp (partial map (comp   #(str2/drop % 2)
>                                 #(str2/take % 5)))
>                 #(str2/split % #"\t"))
>         (split a-string #"[\n\r]"))
>
> This felt a little forced, and the methods don't compose very well.
> As such, I re-wrote my lib with the string call at the end of the
> function.  The main reason was I felt that this approach works better
> with the partial function.

Hi Sean,

Good point. It's always a question whether argument order should favor
"partial" or "->". In clojure.core, the sequence functions generally
put the sequence argument last, while other collection functions (like
conj, assoc) put the collection first.

Anyone else have an opinion on this?

-SS

Stuart Sierra

unread,
Aug 19, 2009, 9:26:22 PM8/19/09
to Clojure
On Aug 19, 5:16 pm, Sean Devlin <francoisdev...@gmail.com> wrote:
> However, over time I found this signature did not work well with my
> code.  Often I would write something like this
>
> (map (comp (partial map (comp   #(str2/drop % 2)
>                                 #(str2/take % 5)))
>                 #(str2/split % #"\t"))
>         (split a-string #"[\n\r]"))
>
> This felt a little forced, and the methods don't compose very well.

On the other hand, here's another way of writing the above, using ->
with some added helpers:

(defn each [coll f]
(map f coll))

(defmacro each-> [coll & body]
`(each ~coll (fn [x#] (-> x# ~@body))))

(-> a-string
(str2/split #"[\n\r]")
(each-> (str2/split #"\t")
(each-> (str2/take 5)
(str2/drop 2))))

-SS

Sean Devlin

unread,
Aug 19, 2009, 9:54:32 PM8/19/09
to Clojure
Hmmm... that's pretty clever. Well done.

Well, if we're gonna play golf :)

(def & comp)
(def p partial)

;;I like this because the amount of white spaces tells me something
;;Almost Pythonesque
(map (& (p map (& (p str2/drop 2)
(p str2/take 5)))
(p str2/split #"\t"))
(split #"[\r\n]" a-string))

Now, I have a really, really, really strong bias towards functional
composition over ->. I think it mostly has to do with re-using
existing functionality in core.

As a second example, I have a filtering function that tests a string
for a prefix in.

;;We have a "Smart" part numbering system at my work
;;This uses the format in of a part number to test if is a designed
part.
;;123 parts designate standard hardware
(def designed? (& not #{"123"} (p str2/take 3)))

(filter (designed? :part-number) db-result)

So, my main point is to favor composition/partial in order to re-use
map, filter, etc.

Okay, enough from me on this.

CuppoJava

unread,
Aug 19, 2009, 9:56:02 PM8/19/09
to Clojure
I'm also looking for a satisfactory answer to this problem.

So far I'm slightly in favor of putting the "data" (ie. the sequence/
collection/object ...) first in the argument list and the "parameters"
following.

This is because there's so many core functions that take a function
and arguments and applies this to some data. These functions always
assume the "data" comes first in the argument list.

If I were to have my way, I would redefine all the clojure.core
functions to assume the "data" is the last argument instead of the
first. (this includes ->) This way they would play nice with both
partial and ->.

I'm not sure if there's a serious drawback to that suggestion that I
haven't noticed though.

-Patrick

Stuart Sierra

unread,
Aug 19, 2009, 11:02:55 PM8/19/09
to Clojure
On Aug 19, 9:56 pm, CuppoJava <patrickli_2...@hotmail.com> wrote:
> If I were to have my way, I would redefine all the clojure.core
> functions to assume the "data" is the last argument instead of the
> first. (this includes ->) This way they would play nice with both
> partial and ->.

That's a really interesting idea. What happens if:

(defmacro last->
([x form] (if (seq? form)
(concat form (list x))
(list form x)))
([x form & more] `(last-> (last-> ~x ~form) ~@more)))

-SS

Sean Devlin

unread,
Aug 19, 2009, 11:07:04 PM8/19/09
to Clojure
+1

On Aug 19, 11:02 pm, Stuart Sierra <the.stuart.sie...@gmail.com>
wrote:

samppi

unread,
Aug 20, 2009, 12:45:31 AM8/20/09
to Clojure
For me, I'd like it if the core functions had the "data" as the first
argument, but have a special function—I can't come up with a better
name than "partial-2"—so that (partial-2 function opt1 opt2 opt3) is
equivalent to (fn [data] (function data opt1 opt2 opt3)). That way, I
could do things like (map (partial-2 s/split #"\n" 30) vector-of-strs)
without breaking .

In fact, something like partial-2 would be useful right now for
functional composition with str-utils2's functions.

John Harrop

unread,
Aug 20, 2009, 1:14:32 AM8/20/09
to clo...@googlegroups.com
On Thu, Aug 20, 2009 at 12:45 AM, samppi <rbys...@gmail.com> wrote:

For me, I'd like it if the core functions had the "data" as the first
argument, but have a special function—I can't come up with a better
name than "partial-2"—so that (partial-2 function opt1 opt2 opt3) is
equivalent to (fn [data] (function data opt1 opt2 opt3)). That way, I
could do things like (map (partial-2 s/split #"\n" 30) vector-of-strs)
without breaking .

Is there something wrong with (map #(s/split % #"\n" 30) vector-of-strs)?

The #(...) lambda read-macro seems to me to obviate most needs for partial and partial-2.

Meikel Brandmeyer

unread,
Aug 20, 2009, 2:29:04 AM8/20/09
to Clojure
Hi,

Disclaimer: personal opinion following...

I'm sorry. I don't get the elegance of point-free style.

In mathematics f denotes the function, while f(x) denotes the value f
takes over x. This is actually a nice and easy to understand notation.
But why do I have to clutter my clojure code with `partial`s and
`comp`s because of that? In Haskell, where `partial` is automatic and
`comp` is a dot, this is maybe elegant. But not here.

I don't know, whether this example is contorted on purpose, but I
really had to very slowly step through it to see what's going on.

> (map (comp (partial map (comp #(str2/drop % 2)
> #(str2/take % 5)))
> #(str2/split % #"\t"))
> (split a-string #"[\n\r]"))

This is almost self-explaining:

(map (fn [part-numbers]
(map #(-> % (str2/take 5) (str2/drop 2))
(str2/split part-numbers #"\t")))
(str2/split a-string #"[\n\r]"))

Maybe the (-> ...) part can be further extracted as `design-id` or so.

Class count 6 vs 2.

In mathematics f and f(x) are (in general) two different things. But
in programming 'x' is also information about the intended purpose. The
`part-numbers` argument of the anonymous function conveys some
information about what's supposed to be in there.

For last->: http://www.mail-archive.com/clo...@googlegroups.com/msg08098.html
(Can someone tell me why the search of ***Google*** groups is so
crappy?)

> Is there something wrong with (map #(s/split % #"\n" 30) vector-of-strs)?
>
> The #(...) lambda read-macro seems to me to obviate most needs for partial
> and partial-2.

+1

Also +1 for keeping the string argument first.

Sincerely
Meikel

Chas Emerick

unread,
Aug 20, 2009, 6:11:44 AM8/20/09
to clo...@googlegroups.com
On Aug 20, 2009, at 2:29 AM, Meikel Brandmeyer wrote:

> Hi,
>
> Disclaimer: personal opinion following...

I think that's all we have when it comes to matters of style :-)

> I'm sorry. I don't get the elegance of point-free style.
>
> In mathematics f denotes the function, while f(x) denotes the value f
> takes over x. This is actually a nice and easy to understand notation.
> But why do I have to clutter my clojure code with `partial`s and
> `comp`s because of that? In Haskell, where `partial` is automatic and
> `comp` is a dot, this is maybe elegant. But not here.
>
> I don't know, whether this example is contorted on purpose, but I
> really had to very slowly step through it to see what's going on.
>
>> (map (comp (partial map (comp #(str2/drop % 2)
>> #(str2/take % 5)))
>> #(str2/split % #"\t"))
>> (split a-string #"[\n\r]"))
>
> This is almost self-explaining:
>
> (map (fn [part-numbers]
> (map #(-> % (str2/take 5) (str2/drop 2))
> (str2/split part-numbers #"\t")))
> (str2/split a-string #"[\n\r]"))

I agree wholeheartedly with Meikel. -> is very straightforward for me
to understand.

Outside of matters of style, changing the expected arguments of ->
would make at least two things impossible:

- use of variadic fns, e.g. (-> data (my-fn arg1 arg2 ... argN) keys
last)

- use of host platform fns (this is incredibly useful with
ByteBuffers, etc) (-> bytebuffer (.put other-data) .flip (.position
23) (.limit 89))

Back to matters of style, and agreeing with Jon upthread, doesn't #()
provide a superset of partial's functionality, with more 'literal',
easier-to-read code?

Cheers,

- Chas

Michel Salim

unread,
Aug 20, 2009, 2:45:16 AM8/20/09
to clo...@googlegroups.com
On Wed, 2009-08-19 at 23:29 -0700, Meikel Brandmeyer wrote:
> Hi,
>
> Disclaimer: personal opinion following...
>
> I'm sorry. I don't get the elegance of point-free style.
>
> In mathematics f denotes the function, while f(x) denotes the value f
> takes over x. This is actually a nice and easy to understand notation.
> But why do I have to clutter my clojure code with `partial`s and
> `comp`s because of that? In Haskell, where `partial` is automatic and
> `comp` is a dot, this is maybe elegant. But not here.
>
Plus, in Haskell, one can query the type signature of the partial
functions, and try and make sense of what they do. Even there,
points-free code can be hard to read. In a Lisp dialect like Clojure,
it's probably even worse.

Regards,

--
Michel

Stuart Sierra

unread,
Aug 20, 2009, 11:26:44 AM8/20/09
to Clojure
Seems like opinion is pretty evenly divided here. I'll leave the
library as-is for now, give it some time to see how things play out.

In the mean time, as a compromise, I've added str-utils2/partial,
which is like clojure.core/partial for functions that take their
primary argument first.

(str2/partial str2/take 2)
;;=> (fn [s] (str2/take s 2))

Now you can compose these using comp, map, whatever.

-SS

Bradbev

unread,
Aug 20, 2009, 12:20:04 PM8/20/09
to Clojure
I've never really used the partial function before, but it seems that
it should be fairly easy to write a macro that lets you replace
certain arguments.

Usage would be (partial* fun _data 1 :test _index) returning something
like (fn [_data _index] (fun _data 1 :test _index))
So you get back a 2 arg function with some parameters already filled
in by either constant data or closed over bindings. The macro would
need to use some sort of convention to recognize which symbols are
meant to be used in the new function arglist, I've used _ as the
prefix. I don't have time right now to actually write the macro :)

Brad

Brian Carper

unread,
Aug 20, 2009, 2:38:13 PM8/20/09
to Clojure
On Aug 19, 2:16 pm, Sean Devlin <francoisdev...@gmail.com> wrote:
> First, I would change the names of functions functions that collide
> with core to str-take, str-drop, etc.  It's just as much to type, and
> it is safe to use these names.  Also, it would make it easier for Rich
> to promote the library to the standard lib when it's done.

+1

I also think contains? might not be a good name given that it doesn't
do the same thing as clojure.core/contains?. Aside from one being
constant time and one being linear time, you have:

user> (clojure.core/contains? "foobar" "f")
false
user> (clojure.contrib.str-utils2/contains? "foobar" "f")
true

I think this is potentially confusing. This also may be another point
in favor of making the name str-contains?.

An str- prefix would also be consistent with the re-* group of
functions that deal with regexes.

Sean Devlin

unread,
Aug 22, 2009, 9:21:27 AM8/22/09
to Clojure
Okay, I'm not sure what the correct thing do for the entire library
is, but I think I've got a convincing argument for some functions.

The following functions share a name with core functions

butlast
contains?
drop
get
partition
repeat
reverse
take

These functions should follow their corresponding signature in core.
I found it difficult to remember "which way" the signature was
supposed to go, even when writing one post on this list. I suspect
that newcomers would find this difficult as well.

I would make an exception for the following method:
replace

I don't think the signature in core makes sense, because the core fn
takes a map. When I use str-replace, I would like certain guarantees
about the order operations are applied in, and I don't think this is
possible when the data is passed in as a map.

Just another thing to think about.

Sean

On Aug 20, 11:26 am, Stuart Sierra <the.stuart.sie...@gmail.com>
wrote:
Reply all
Reply to author
Forward
0 new messages