proposed addition to contrib:seq-utils

34 views
Skip to first unread message

Sean Devlin

unread,
Aug 27, 2009, 10:20:26 PM8/27/09
to Clojure Dev
Well, here's part 3 of our epic saga. I've got some functions I'd
like to add to seq-utils. They are designed to operate on a list of
maps, which I will refer to as a table. Table could come from an
RDBMS, REST calls, an ERP System, or an Excel Spreadsheet. The main
idea was to separate the marshaling the data from the set operations.
As the operations support guy at my job, I can't oversell how useful
it is to be able to join two different legacy schemas, or SAP data
with a spreadsheet.

These operations depend heavily on fn-tuple, and the trans closure I
described in my other two threads.

There are two main operations that I'll identify today

*pivoting
*joining (six types)

First, let's consider pivoting. This is best show by example.
Suppose you have a table with the following columns

table:sales
order_id (int)
product_id (int)
product_cost (double)
quantity (double)

Now, we use c.c.sql-utils to query the entire table.

;runs "SELECT * FROM SALES;"
(def sales-table (a-call-to-a-db ...))

How would we get the total sales for each product? Perhaps like this:

(apply merge-with +
(map
#(hash-map
(:product-id %)
(* (:product-cost %) (:quantity %)))
sales-table))

This involves 3 distinct operation

1. (:product-id %)
This is the grouping function. It determines which product the
order is associated with.

2. (* (:product-cost %) (:quantity %))
This is the mapping function. It determines the value for each row.

3. +
This is the reduction function. It combines every item in a group.

We are now ready to define the pivot function

(defn pivot [coll grouping-fn mapping-fn reduce-fn]
(apply merge-with reduce-fn
(map
#(hash-map
(grouping-fn %)
(mapping-fn %))
sales-table)))

This replaces our s-exp above with

(pivot sales-table :product-id (* (:product-cost %) (:quantity %)) +)

Now, it is very easy to change the groupings, mapping, or reductions.
For example, what if we wanted to count the quantity of units sold?
The pivot operation becomes

(pivot sales-table :product-id :quantity +)

Very clean, IMHO.

Now, suppose we wasted to count the number of times each product was
ordered. That is, how many rows is each product_id in? We simply
write the pivot operation as follows

(pivot sales-table :product-id (constantly 1) +)

This frequency pivoting operation shows up often enough that I created
the following alias

(def freq (constantly 1))
(pivot sales-table :product-id freq +)

In fact, c.c.seq-util/frequencies could be written in terms of pivot

(defn frequencies
[coll]
(pivot coll identity freq +))

That's it for pivot. Now, let's consider join operations:

* joins

There are many times when you need to perform a type of join on a list
of data, but it cannot
be performed on the back end. The back end may not support the type
of join, or you may need to join
data from two different back ends (e.g. one SQL and another REST).

In order to solve this problem, this library also includes a set of
functions to perform joins.
Currently the following type of joins are supported

* inner-join (equi, nautural, cross)
* outer-join (left, right, full)

Let's define some terms for our examples.

user=> (def test-left
[{:name "Sean" :age 27}
{:name "Ross" :age 27}
{:name "Brian" :age 22}])

user=> (def test-right
[{:owner "Sean" :item "Beer"}
{:owner "Sean" :item "Pizza"}
{:owner "Ross" :item "Computer"}
{:owner "Matt" :item "Bike"}])

user=> (def test-proj (fn-tuple :name :age :owner :item))

## inner-join (equi)
This is for performing inner joins. The join value must exist in both
lists. This function takes a left collection,
a right collection, and at least one join function. If only one join
function is provided, it is used on both the left & right hand sides.
### Signature

(inner-join left-coll right-coll join-fn)
(inner-join left-coll right-coll left-join-fn right-join-fn)

### Usage

user=> (println (to-tab-str (map test-proj (inner-join test-left test-
right :name :owner))))
Ross 27 Ross Computer
Sean 27 Sean Beer
Sean 27 Sean Pizza
nil

Yup, this behaves properly.

## natural-join (natural)
This performs a natural join on the two collections.
### Signature
(natural-join left-coll right-coll)
### Usage
user=> (println (to-tab-str (map test-proj (natural-join test-left
test-right))))
nil

Hmmm... nothing. This is what we would expect. Let's see if we can
make this work with a mapping operation on the test-right data.

user=> (println (to-tab-str (map test-proj (natural-join test-left
(map (trans :name :owner) test-right)))))
Ross 27 Ross Computer
Sean 27 Sean Beer
Sean 27 Sean Pizza
nil

Perfect.

## cross-join (cross)
### Signature
(cross-join left-coll right-coll)
### Usage
user=> (println (to-tab-str (map test-proj (cross-join test-left test-
right))))
Sean 27 Sean Beer
Sean 27 Sean Pizza
Sean 27 Ross Computer
Sean 27 Matt Bike
Ross 27 Sean Beer
Ross 27 Sean Pizza
Ross 27 Ross Computer
Ross 27 Matt Bike
Brian 22 Sean Beer
Brian 22 Sean Pizza
Brian 22 Ross Computer
Brian 22 Matt Bike
nil

Yup, it's verbose output.

## left-outer-join
This is for performing left outer joins. The join value must exist in
the left hand list. This function takes a left collection,
a right collection, and at least one join function. If only one join
function is provided, it is used on both the left & right hand sides.
### Signature
(left-inner-join left-coll right-coll join-fn)
(left-innner-join left-coll right-coll left-join-fn right-join-fn)

### Usage
user=> (println (to-tab-str (map test-proj (left-outer-join test-left
test-right :name :owner))))
Brian 22 null null
Ross 27 Ross Computer
Sean 27 Sean Beer
Sean 27 Sean Pizza
nil

## right-outer-join
This is for performing right outer joins. The join value must exist
in the right hand list. This function takes a left collection,
a right collection, and at least one join function. If only one join
function is provided, it is used on both the left & right hand sides.
### Signature
(right-inner-join left-coll right-coll join-fn)
(right-innner-join left-coll right-coll left-join-fn right-join-fn)

### Usage

user=> (println (to-tab-str (map test-proj (right-outer-join test-
left test-right :name :owner))))
null null Matt Bike
Ross 27 Ross Computer
Sean 27 Sean Beer
Sean 27 Sean Pizza
nil

## full-outer-join
This is for performing full outer joins. The join value may exist in
either list. This function takes a left collection,
a right collection, and at least one join function. If only one join
function is provided, it is used on both the left & right hand sides.
### Signature
(full-inner-join left-coll right-coll join-fn)
(full-innner-join left-coll right-coll left-join-fn right-join-fn)

### Usage

user=> (println (to-tab-str (map test-proj (full-outer-join test-left
test-right :name :owner))))
Brian 22 null null
null null Matt Bike
Ross 27 Ross Computer
Sean 27 Sean Beer
Sean 27 Sean Pizza
nil

Booya.

Well, that covers everything I wanted to explain with seq-utils. I
know it's a lot. Let me know what you think. Is this something that
should be in contrib?

Stuart Sierra

unread,
Aug 28, 2009, 10:40:16 AM8/28/09
to Clojure Dev
On Aug 27, 10:20 pm, Sean Devlin <francoisdev...@gmail.com> wrote:
> Well, that covers everything I wanted to explain with seq-utils.  I
> know it's a lot.  Let me know what you think.  Is this something that
> should be in contrib?

Yes, but in a separate namespace, something like
clojure.contrib.table.

-SS

Tom Faulhaber

unread,
Aug 28, 2009, 5:49:39 PM8/28/09
to cloju...@googlegroups.com

Very nice stuff.

But I agree with Stuart. Let's give it its own namespace.

Tom

On Aug 28, 2009 7:42 AM, "Stuart Sierra" <the.stua...@gmail.com> wrote:

On Aug 27, 10:20 pm, Sean Devlin <francoisdev...@gmail.com> wrote: > Well, that covers everything I...

Yes, but in a separate namespace, something like
clojure.contrib.table.

-SS

--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subs...

Sean Devlin

unread,
Aug 28, 2009, 5:59:06 PM8/28/09
to Clojure Dev
c.c.table sounds good to me. I'll get working on a patch for the
namespace, as well as some of the other fns I've discussed. I'll need
someone to help get it committed once the patch is up.

For now, you can play with the code here:

http://github.com/francoisdevlin/devlinsf-clojure-utils/tree/master


On Aug 28, 5:49 pm, Tom Faulhaber <tomfaulha...@gmail.com> wrote:
> Very nice stuff.
>
> But I agree with Stuart. Let's give it its own namespace.
>
> Tom
>
> On Aug 28, 2009 7:42 AM, "Stuart Sierra" <the.stuart.sie...@gmail.com>

Sean Devlin

unread,
Aug 28, 2009, 11:15:56 PM8/28/09
to Clojure Dev
Patch is up, including code, tests & docs. See the Clojure Contrib
Assembla ticket #28. Can some help me get the patch in?

Thanks,
Sean

Rich Hickey

unread,
Aug 29, 2009, 8:55:54 AM8/29/09
to cloju...@googlegroups.com
On Fri, Aug 28, 2009 at 5:49 PM, Tom Faulhaber<tomfau...@gmail.com> wrote:
> Very nice stuff.
>
> But I agree with Stuart. Let's give it its own namespace.
>
> Tom
>
> On Aug 28, 2009 7:42 AM, "Stuart Sierra" <the.stua...@gmail.com>
> wrote:
>
> On Aug 27, 10:20 pm, Sean Devlin <francoisdev...@gmail.com> wrote: > Well,
> that covers everything I...
>
> Yes, but in a separate namespace, something like
> clojure.contrib.table.
>

I'd like to ask you all to pause here. Let's not add new libraries to
contrib without asking me first, please. Everyone who is a committer
in contrib designed a new library which I approved, and was given
stewardship over enhancements to that lib. But larger contributions of
what constitute new libs need new approval.

I admit I haven't had time to fully digest Sean's proposed lib yet,
and it may be a fine addition, but some things about it don't seem
quite right yet, and maybe that's just a lack of understanding on my
part.

Rich

Rich Hickey

unread,
Aug 29, 2009, 8:57:15 AM8/29/09
to cloju...@googlegroups.com
On Fri, Aug 28, 2009 at 5:59 PM, Sean Devlin<francoi...@gmail.com> wrote:
>
> c.c.table sounds good to me.  I'll get working on a patch for the
> namespace, as well as some of the other fns I've discussed.  I'll need
> someone to help get it committed once the patch is up.
>
> For now, you can play with the code here:
>
> http://github.com/francoisdevlin/devlinsf-clojure-utils/tree/master
>
>

Where exactly is this library? I'd like to have a look.

Thanks,

Rich

Sean Devlin

unread,
Aug 29, 2009, 9:08:18 AM8/29/09
to Clojure Dev
Check the lib.devlinsf.map-utils namespace. Keep in mind this isn't
100% the same proposal as I made for contrib, because there are a few
functions I was worried wouldn't make the cut.

Also, docs/map-utils.markdown has a lot of information.

There are still a few things not 100% right (namely column
collisions), but it's close.

I'd be glad to answer any specific questions about this or any other
namespaces in my lib.

Sean


On Aug 29, 8:57 am, Rich Hickey <richhic...@gmail.com> wrote:

Sean Devlin

unread,
Sep 10, 2009, 12:33:38 AM9/10/09
to Clojure Dev
I'd like to spend some time discussing indexing collections today,
perhaps this will help everyone understand the join functions better.
It is one of the most important aspects of my join engine. Let's
consider a real world example from my job.

Over the past few weeks I have been trying to coordinate manufacturing
information for my company. As such I have been trying to make sense
of inventory levels as they relate to production. Let's consider the
following inventory table with 4 columns.

:part-num :rev :qty :name
123-01 A 15 Ancient Widget
123-01 B 15 Ancient Widget
123-02 A 0 Old Widget
123-03 A 2 Current Widget
123-03 B 0 Current Widget
... Other Parts ...

Now, the design control systems are very particular. It requires
inventory to match part-number & rev(ision) in order to be used. So,
in order to index this table we would need to group it as follows

;Assume inventory-table stores a list of maps
;Returns an indexed inventory-table
user=>(group-by (juxt :part-num :rev) inventory-table)

Now, in our example let's assume that we need to build the 123-03 B
Widget. If we do a lookup on this we'll find... no inventory.
Great. However, our lead engineer tells us that the 123-03 A is the
exact same part, except it doesn't have a product label on it. It'll
do the exact same job as the B version. Let's make a gross assumption
that all revs are interchangeable. Our grouping function becomes

user=>(group-by :part-num inventory-table)

And, if we investigate the 123-03 part, we find that we have 2 in
inventory. Awesome!

Of course, no story would be complete without a third act. Our boss
comes in, telling us that he needs 25 of 123-03 Widgets immediately.
We tell him we've only got 2 in house, and it will take months to
purchase new material. The boss asks if there is anything we can do.
He mentions that even older versions of the part will be acceptable.

Now, I need to reveal an extra detail about our part numbering
system. As you can see, each part number begins with a 123- prefix.
This is what is known as a "smart part number", in which the prefix
means something. In my real life case, any part with the same prefix
is part of the same design history. The 123-01 part is an ancient
version of the 123-03 part. So, our grouping function will look like
this

;Assume str-utils2 is required as s
user=>(group-by (comp #(s/take % 3) :part-num) inventory-table)

Now, all of 123* parts are grouped together. Fantastic.

However, my boss is still not happy. He hands me the following list
of parts

:pn :qty-req
123-03 25
124-01 20
200-10 5
205-10 10
999-01 1

I have to do determine the availability for all of these parts.
Fortunately, my boss has the same flexibility, and ANY version of the
part will do. So, we can index the
list of "boss parts" as follows

user=>(group-by (comp #(s/take % 3) :pn) boss-parts)

Now that we've determined our indexing fns, we can run the report the
boss needs

;Left outer join w/ the boss parts on the left will
;provide a complete list of stuff the boss needs
user=>(left-outer-join
boss-parts
inventory-table
(comp #(s/take % 3)
:part-num)
(comp #(s/take % 3)
:pn))

I've been calling these operations "approximate joins". I get my boss
the material he wants, and the day is finally saved!

While I am enjoying solving a problem for my boss, some of you may be
wondering something. If the 123* prefix indicates something, why
isn't it in a different field? Isn't
this a case of bad table design/normalization? What kind of database
system does this, anyway? Consider the following points

1 LEGACY DATABASES DO THIS ALL OF THE TIME
2 WORKING WITH LEGACY SYSTEMS PAYS THE BILLS
3 GOTO 2

I will skip my rant about poorly designed schemas & those who design
them, except to say that we are all on mistake away from creating a
terrible design that goes undetected for years. The reality is
developers must work with them constantly. It is difficult if not
impossible to perform the approximate join
with SQL, and that's not always an option in heterogeneous systems
anyway.

Since the mechanics of the joins is abstracted away, all that is
exposed is the application specific indexing. This applies equally to
SQL, REST, Excel, ERP, CSV, and JSON generated data sets. All that is
left of is defining a mapping operation, which every lisper is very,
very familiar with.

<conjecture>
If perfected, I think an out-of-the-box join engine could be a real
selling point for Clojure. No, it does not even come close to the
awesomeness of the STM. However, it is easier to understand, and I
speculate it's closer to a real pain point many Java programmers
have. It could serve as a simple, concrete example of when OO is the
wrong solution, and open minds to listen to the rest of the Clojure
story.

Just another $.02
</conjecture>

Sean

PS - This join proposal, and lots of other code I now write, is not
possible w/o pervasive persistent data structures. I do not know how
I would begin to write this code in non-FP languages. Thanks for
Clojure!
Reply all
Reply to author
Forward
0 new messages