Yahoo Pipes transformation questions

Serge Wroclawski

unread,

Apr 5, 2009, 8:35:51 AM4/5/09

to Clojure Study Group Washington DC

I know we have the wiki for some of this, but I wanted to ask those of
you who've used Yahoo Pipes a few questions.

I've spent a little time looking at some example Pipes as well as
going through a tutorial and I my question is which are the most
common transformations that are used.

It seems to me the transformations I've seen that look like they're
used most commonly are:

sort by (some attribute)
select by (some attribute)
truncate by (number of entries)

Any others that people see in common use?

- Serge

Luke VanderHart

unread,

Apr 5, 2009, 3:56:57 PM4/5/09

to Clojure Study Group Washington DC

I see a lot of Unions, Uniques, and If's.

The beauty of it is that, if implemented as pure lazy seqs, most of
the transformations are absolutely trivial.

-Luke

Serge Wroclawski

unread,

Apr 5, 2009, 4:01:18 PM4/5/09

to clojure-...@googlegroups.com

On Sun, Apr 5, 2009 at 3:56 PM, Luke VanderHart
<luke.va...@gmail.com> wrote:
>
> I see a lot of Unions, Uniques, and If's.

Can you give examples of "ifs". Is an if like a filter?

My implementation questions are mainly around the filters and some of the sorts.

- Serge

Luke VanderHart

unread,

Apr 5, 2009, 4:40:33 PM4/5/09

to Clojure Study Group Washington DC

Yes, I guess filters and "ifs" are the same.

As for implementation... Here's my take on it.

Filters end up being very simple. For each record, if they meet some
condition, pass it on. If not, throw it away.

Sorts are slightly more difficult because they can't be processed on a
per-record basis, and therefore can't really work lazily. In order to
sort a list of records, you have to have the whole list of records to
operate on. But once you do, in a vector or seq, it's pretty
straightforward to sort.

-Luke

On Apr 5, 4:01 pm, Serge Wroclawski <emac...@gmail.com> wrote:
> On Sun, Apr 5, 2009 at 3:56 PM, Luke VanderHart
>

Serge Wroclawski

unread,

Apr 5, 2009, 8:24:45 PM4/5/09

to clojure-...@googlegroups.com

On Sun, Apr 5, 2009 at 4:40 PM, Luke VanderHart
<luke.va...@gmail.com> wrote:
>
> Yes, I guess filters and "ifs" are the same

I've since found the documentation:

http://pipes.yahoo.com/pipes/docs?doc=modules

This maps basically onto what I thought...

Pipes seems to (and people can correct me if I'm wrong, map generally
into a set of records with a set of key:value pairs, with the
possibility to loop through the key:val pairs deeper if necessary.

> As for implementation... Here's my take on it.
>
> Filters end up being very simple. For each record, if they meet some
> condition, pass it on. If not, throw it away.

Yes, the trick is particular data types need some special attention.
For example, let's say I have two data sources, one is a CSV file with
the date as an ISO 8601 format and an RSS 2.0 feed with the date in
RFC822 format. You want to be able to say "newer" on both date
formats, so you can't simply treat them as text.

Similarly you may want to work with geo-encoded data and say "Inside
this box" or "Outside this polygon". This means you need to have an
internal representation of this new data type.

Or may not... I may be over-thinking this.

Having not worked with Pipes before, I'm getting my head around the
problem space.

- Serge

Luke VanderHart

unread,

Apr 8, 2009, 11:12:28 AM4/8/09

to Clojure Study Group Washington DC

Yes... the question of "data formats" is interesting. Strings are the
obvious and simple case, but as you mention, dates are a perfect
example of something more complex but still very common.

Also there is the difference in the names of keys, not just values.
Say I read in one inputter which labels a field "lastName" and another
which calls it "last_name". But I may want to do a uniqueness filter
in which I treat them as the same.

One solution for both problems is to enforce certain standards across
all components... For example, all inputters which generate a date
field must call it "date" and must send it in timestamp format. But I
really dislike that because you run into problems very quickly with
fields whose semantics are close, but not identical.

A better solution, in my opinion, is to provide transformers to rename
keys & reformat data, and make it the responsibility of the person
designing the pipe flow. If they wanted to compare dates from two
different data sources, they would be responsible for inserting a
"dateParser" transformer component to normalize the data before
feeding it to the comparator transformer.

Thanks,

-Luke

On Apr 5, 8:24 pm, Serge Wroclawski <emac...@gmail.com> wrote:
> On Sun, Apr 5, 2009 at 4:40 PM, Luke VanderHart
>

Keith Bennett

unread,

Apr 9, 2009, 1:53:44 PM4/9/09

to Clojure Study Group Washington DC

Luke -

On Apr 8, 11:12 am, Luke VanderHart <luke.vanderh...@gmail.com> wrote:
> Yes... the question of "data formats" is interesting. Strings are the
> obvious and simple case, but as you mention, dates are a perfect
> example of something more complex but still very common.
>
> Also there is the difference in the names of keys, not just values.
> Say I read in one inputter which labels a field "lastName" and another
> which calls it "last_name". But I may want to do a uniqueness filter
> in which I treat them as the same.
>
> One solution for both problems is to enforce certain standards across
> all components... For example, all inputters which generate a date
> field must call it "date" and must send it in timestamp format. But I
> really dislike that because you run into problems very quickly with
> fields whose semantics are close, but not identical.

IMHO, there is no way around enabling the use of arbitrary field
names. If certain fields need to be connected in some way, that
connection should be explicitly specified. We would not want to limit
field names to the English language, and there could be multiple date
fields (e.g. start date and end date). The data type should probably
be specified in some kind of metadata field descriptor; although the
Java class could usually be inferred, it could not be inferred from
null/nil.

>
> A better solution, in my opinion, is to provide transformers to rename
> keys & reformat data, and make it the responsibility of the person
> designing the pipe flow. If they wanted to compare dates from two
> different data sources, they would be responsible for inserting a
> "dateParser" transformer component to normalize the data before
> feeding it to the comparator transformer.
>

+1...

Cheers,
Keith

Serge Wroclawski

unread,

Apr 9, 2009, 2:17:00 PM4/9/09

to clojure-...@googlegroups.com

On Wed, Apr 8, 2009 at 11:12 AM, Luke VanderHart
<luke.va...@gmail.com> wrote:

> A better solution, in my opinion, is to provide transformers to rename
> keys & reformat data, and make it the responsibility of the person
> designing the pipe flow. If they wanted to compare dates from two
> different data sources, they would be responsible for inserting a
> "dateParser" transformer component to normalize the data before
> feeding it to the comparator transformer.

I've thought of this problem a little too. Here's my take:

I think ultimately we want to be always oriented toward the output.
Everything up to the output is "necessary work" to arrive at the
output.

That said, I think the easiest way to achieve this would be a
transformer that does associations between key names.

Some data begs to be normalized- time, I think, is one of them. But
other data is going to be so difficult to manage that it
but it seems overall easier to solve with a transformation of
"associations" where you'd provide a mapping between the keys in one
pipe input and that of another, but you'd just have

'pubDate' -> 'publication_date'

How does Pipes handle this BTW?

I don't know how often this problem will actually come up in real life
either, so I'm less hesitant to try to solve it in an elegant way (vs
data types, which I feel aught to be supported).

- Serge

Reply all

Reply to author

Forward