Add a tuple to each group of a grouped data

33 views
Skip to first unread message

kay

unread,
Apr 16, 2012, 1:59:19 PM4/16/12
to cascading-user
I want to add a tuple to each group. For example, I want to add an
additional category to each group of records (grouped by article_id).

This is what I mean:

Pipe testPipe= new Pipe("article category pipe");

testPipe = new GroupBy(testPipe, new Fields("article_id"));

testPipe = new Every(testPipe, new AddNewCatValueBuffer());

In the AddNewCatValueBuffer(), I want to add the additional tuple for
each group. Is this the right way to do it?

Joris Bontje

unread,
Apr 16, 2012, 2:21:54 PM4/16/12
to cascadi...@googlegroups.com
On Mon, Apr 16, 2012 at 7:59 PM, kay <kchat...@technoratimedia.com> wrote:
> I want to add a tuple to each group. For example, I want to add an
> additional category to each group of records (grouped by article_id).

Where will you be getting the category from?

kay

unread,
Apr 16, 2012, 2:23:39 PM4/16/12
to cascading-user

Its a default category which need to be added to all articles.
On Apr 16, 11:21 am, Joris Bontje <jo...@bontje.nl> wrote:

Joris Bontje

unread,
Apr 16, 2012, 2:30:20 PM4/16/12
to cascadi...@googlegroups.com
On Mon, Apr 16, 2012 at 8:23 PM, kay <kchat...@technoratimedia.com> wrote:
> Its a default category which need to be added to all articles.

Take a look at the Insert function, which does just that
http://www.cascading.org/userguide/html/ch07s04.html

kay

unread,
Apr 16, 2012, 2:44:48 PM4/16/12
to cascading-user
I guess, I was not clear in my previous answer. The category value to
be added is a default known value, but the tuple datum has other
fields which comes from the group values. For example, say the tuple
has the following fields (article_id, site_id, category_id). I group
by article_id (as each article can have several categories). Now, for
each article, I want to add a 'default' category. So, for the tuple to
be added, the article_id and site_id need to come from the other
tuples in that group (literally from any one of them).

On Apr 16, 11:30 am, Joris Bontje <jo...@bontje.nl> wrote:

kay

unread,
Apr 16, 2012, 2:51:40 PM4/16/12
to cascading-user
This is what I am doing and it works. But, I am curious if I am
missing some better in line functions.

Pipe1 = new GroupBy(Pipe1, new Fields(datum1.ARTICLE_ID));
Pipe1 = new Every(Pipe1, new AddOverallCategory(), Fields.RESULTS);

And in the buffer:

public void operate(FlowProcess process, BufferCall<NullContext>
bufferCall) {

Iterator<TupleEntry> iter = bufferCall.getArgumentsIterator();

datum1 lastArticleDatum = new datum1();

while(iter.hasNext()){
datum1 siteArticleCat = new datum1(iter.next());
bufferCall.getOutputCollector().add(siteArticleCat.getTuple());
lastArticleDatum = siteArticleCat;

}





//Add a tuple for default category
datum1 newsiteArticleCatDatum = new datum1();


newsiteArticleCatDatum.setArticleZurl(lastArticleDatum.getArticleZurl());
newsiteArticleCatDatum.setSiteId(lastArticleDatum.getSiteId());
newsiteArticleCatDatum.setCategoryId(0);




bufferCall.getOutputCollector().add(newsiteArticleCatDatum.getTuple());

kay

unread,
Apr 17, 2012, 8:28:26 PM4/17/12
to cascading-user
Any input on this? Also I have a basic question:

When should I use Fields.Replace vs. Fields.Results?

Say I have a datum (datum1 with x # fields). My functions produces
datum 2 with y # of fields. How should I call the function, with
replace or results?

pipe1 = new Each(pipe1, new somefunction(), Fields.Results/Fields/
Replace);


Thanks.

Chris K Wensel

unread,
Apr 17, 2012, 8:36:57 PM4/17/12
to cascadi...@googlegroups.com

Fields.RESULTS return only the results of your operation, discarding all the value/fields in the incoming Tuple stream.

Fields.ALL will append the results of your operation to the values/fields in the incoming Tuple stream.

Fields.REPLACE will replace the values in the fields that were arguments in the incoming Tuple stream.

REPLACE is useful if you want to make a String into a Long and re-use the field name and keep the other fields.

RESULTS when used with the declared fields Field.ARGS behaves like REPLACE by keeping the field names of the args, but all the incoming values are discarded.

what you use is a matter of what you want to get back and the level of coupling your operation has to the tuple stream etc...

ckw

> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

Reply all
Reply to author
Forward
0 new messages