MIN / MAX

John Cohen

unread,

Dec 12, 2011, 5:51:48 PM12/12/11

to HBaseHUT - HBase High Update Throughput

Is there any plan for data aggregation? like MIN, MAX

Alex Baranau

unread,

Dec 12, 2011, 6:03:14 PM12/12/11

to hbas...@googlegroups.com

Hi John,

Yeah, adding standard "update processors" like MIN, MAX, etc. is the natural next step. It may require some APIs changes to make that really flexible & usable. I have in mind (and in my notes) some ideas of how to do that in a best way. But for sake of not pushing my way of thinking, I'd love to listen your ideas first: in what form it is best to have them in API, any use-case (with the pseudo-code/code involved) would be great to look at.

Alex.

John Cohen

unread,

Dec 12, 2011, 6:06:29 PM12/12/11

to hbas...@googlegroups.com

Hi Alex,

Looking into this subject now. Nothing in code yet. Keep you posted

Alex Baranau

unread,

Dec 12, 2011, 6:23:47 PM12/12/11

to hbas...@googlegroups.com

Adding such standard functions/update processors meets my thoughts about the direction of where it should evolve. Will be glad to collaborate with you on that!

Alex.

OG

unread,

Dec 13, 2011, 9:46:52 AM12/13/11

to HBaseHUT - HBase High Update Throughput

+1 for fork, patch, and pull if you can, John! :)

Otis

On Dec 12, 6:06 pm, John Cohen <john.java.w...@gmail.com> wrote:
> Hi Alex,
> Looking into this subject now. Nothing in code yet. Keep you posted
>

> On Mon, Dec 12, 2011 at 6:03 PM, Alex Baranau <alex.barano...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Hi John,
>
> > Yeah, adding standard "update processors" like MIN, MAX, etc. is the
> > natural next step. It may require some APIs changes to make that really
> > flexible & usable. I have in mind (and in my notes) some ideas of how to do
> > that in a best way. But for sake of not pushing my way of thinking, I'd
> > love to listen your ideas first: in what form it is best to have them in
> > API, any use-case (with the pseudo-code/code involved) would be great to
> > look at.
>
> > Alex.
>

John Cohen

unread,

Dec 13, 2011, 9:51:02 AM12/13/11

to hbas...@googlegroups.com

Looking into a multidimensional structure, OLAP. The problem is that you need to have all the data before running a MIN.

The amount of data coming in is unknown, if there is too much data we run out of RAM, for example holding the data into a HashMap. So the question is how to do it without running into memory limitation due to the amount of data.

Alex Baranau

unread,

Dec 13, 2011, 10:25:15 AM12/13/11

to hbas...@googlegroups.com

Aha, looks like this is the same direction I was going to look at. Let me think a bit about the suggestions of how to better plug this in into HBaseHUT (or build on top of HBaseHUT). I will get back to you shortly (today) with suggestions.

Looking forward to collaborate on this with you!

Alex.

P. S. I guess it makes sense to switch to chat to discuss things in details. Later.

saetaes

unread,

May 10, 2012, 2:57:19 PM5/10/12

to hbas...@googlegroups.com

Just curious since it's been a while since this thread was touched: Has anything been done in regard to aggregation functions? One thing that would be interesting for me personally is calculating the median (or better, percentile) on a set of data.

I'd be willing to help out, if there's an intuitive place for someone without HBaseHUT experience to start.

Mike

Alex Baranau

unread,

May 11, 2012, 2:08:03 PM5/11/12

to hbas...@googlegroups.com

Hi Mike!

Yeah, not much was done since then as this part of HBaseHUT wasn't in my main focus. Would be really great if you can participate. Please find some thoughts below and let me know what you think.

1. How?

What we need to do with aggregating functions I think boils down to the following:

* Implement UpdateProcessor that can take list of agg functions and apply them to columns

* Implement aggregation functions. I'd start with basic ones: max/min/avg/sum/count and then added more complex: percentile, disctinct #, etc.

Basically, the use-case is the following (unless you have smth different/more specific in mind). There are bunch of input records with columns that have e.g. numeric values, say row_key1=>{column1=5.5, column2=14.3}, row_key2=>{column1=5.5, column2=14.3}. There might be multiple input records with the same key, based on which stored data should be updated. For each key we need to keep aggregated values. So:

1) with HBaseHUT we replacing updates with appends (simple puts) which makes writing much faster

2) with the help of agg functions library (implementing which is in the focus here) needed aggregates should be calculated

2. Where to start?

There is olap-agg branch that was created specifically for this work. Feel free to checkout and start from there.

There's a MaxFunction implementation there and unit-test for it. It isn't quite what we are looking for (we need to change it towards above points), but is a good place to start looking at. Will give you idea of HBaseHUT usage.

After short discussion we will add issue(s) to github issue-tracking system.

Thanks again for the interest,

Alex

Otis Gospodnetic

unread,

May 14, 2012, 1:14:23 AM5/14/12

to hbas...@googlegroups.com

Hi Mike,

On Friday, May 11, 2012 2:08:03 PM UTC-4, Alex Baranau wrote:

Hi Mike!

Yeah, not much was done since then as this part of HBaseHUT wasn't in my main focus. Would be really great if you can participate. Please find some thoughts below and let me know what you think.

1. How?
What we need to do with aggregating functions I think boils down to the following:
* Implement UpdateProcessor that can take list of agg functions and apply them to columns

Which is here: https://github.com/sematext/HBaseHUT/blob/olap-agg/src/main/java/com/sematext/hbase/hut/UpdateProcessor.java

So I guess:

* subclass UP

* have its ctor take a list of functions if you want the caller to have control of which functions are applied

* implement process method that applies functions to records from the process argument

* populate UpdateProcessingResult object in the process method

That?

Oh, here is an example:

https://github.com/sematext/HBaseHUT/blob/olap-agg/src/test/java/com/sematext/hbase/hut/TestHBaseHut.java

(look for "UpdateProcessor" there)

* Implement aggregation functions. I'd start with basic ones: max/min/avg/sum/count and then added more complex: percentile, disctinct #, etc.

Maybe this Mahout goodness could be used here?

http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/package-summary.html

http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/function/Functions.html

Basically, the use-case is the following (unless you have smth different/more specific in mind). There are bunch of input records with columns that have e.g. numeric values, say row_key1=>{column1=5.5, column2=14.3}, row_key2=>{column1=5.5, column2=14.3}. There might be multiple input records with the same key, based on which stored data should be updated. For each key we need to keep aggregated values. So:

1) with HBaseHUT we replacing updates with appends (simple puts) which makes writing much faster
2) with the help of agg functions library (implementing which is in the focus here) needed aggregates should be calculated

2. Where to start?
There is olap-agg branch that was created specifically for this work. Feel free to checkout and start from there.
There's a MaxFunction implementation there and unit-test for it. It isn't quite what we are looking for (we need to change it towards above points), but is a good place to start looking at. Will give you idea of HBaseHUT usage.

Ah, good, that's another example! :)

https://github.com/sematext/HBaseHUT/blob/olap-agg/src/main/java/com/sematext/hbase/agg/MaxFunction.java

After short discussion we will add issue(s) to github issue-tracking system.

I hope this helps, Mike!

Otis

saetaes

unread,

May 15, 2012, 10:25:12 AM5/15/12

to HBaseHUT - HBase High Update Throughput

Thanks everyone for the great pointers! I'll take a look at this
stuff and report back to the list.

Mike

On May 14, 1:14 am, Otis Gospodnetic <otis.gospodne...@gmail.com>
wrote:

> Hi Mike,
>
> On Friday, May 11, 2012 2:08:03 PM UTC-4, Alex Baranau wrote:
>
> > Hi Mike!
>
> > Yeah, not much was done since then as this part of HBaseHUT wasn't in my
> > main focus. Would be really great if you can participate. Please find some
> > thoughts below and let me know what you think.
>
> > 1. How?
> > What we need to do with aggregating functions I think boils down to the
> > following:
> > * Implement UpdateProcessor that can take list of agg functions and apply
> > them to columns
>

> Which is here:https://github.com/sematext/HBaseHUT/blob/olap-agg/src/main/java/com/...<https://github.com/sematext/HBaseHUT/blob/master/src/main/java/com/se...>

> So I guess:
> * subclass UP
> * have its ctor take a list of functions if you want the caller to have
> control of which functions are applied
> * implement process method that applies functions to records from the
> process argument
> * populate UpdateProcessingResult object in the process method
>
> That?

> Oh, here is an example:https://github.com/sematext/HBaseHUT/blob/olap-agg/src/test/java/com/...

>
> (look for "UpdateProcessor" there)
>
> * Implement aggregation functions. I'd start with basic
>
> > ones: max/min/avg/sum/count and then added more complex: percentile,
> > disctinct #, etc.
>

> Maybe this Mahout goodness could be used here?http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/...
>
> http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/functi...

>
> Basically, the use-case is the following (unless you have smth
>
>
>
>
>
>
>
>
>
> > different/more specific in mind). There are bunch of input records with
> > columns that have e.g. numeric values, say row_key1=>{column1=5.5,
> > column2=14.3}, row_key2=>{column1=5.5, column2=14.3}. There might be
> > multiple input records with the same key, based on which stored data should
> > be updated. For each key we need to keep aggregated values. So:
> > 1) with HBaseHUT we replacing updates with appends (simple puts) which
> > makes writing much faster
> > 2) with the help of agg functions library (implementing which is in the
> > focus here) needed aggregates should be calculated
>
> > 2. Where to start?
> > There is olap-agg branch that was created specifically for this work. Feel
> > free to checkout and start from there.
> > There's a MaxFunction implementation there and unit-test for it. It isn't
> > quite what we are looking for (we need to change it towards above points),
> > but is a good place to start looking at. Will give you idea of HBaseHUT
> > usage.
>

> Ah, good, that's another example! :)https://github.com/sematext/HBaseHUT/blob/olap-agg/src/main/java/com/...

>
> > After short discussion we will add issue(s) to github issue-tracking
> > system.
>
> I hope this helps, Mike!
>
> Otis
>
>
>
>
>
>
>
> > Thanks again for the interest,
> > Alex
>

Reply all

Reply to author

Forward