The question of float/double aggregators

85 views
Skip to first unread message

Roman Leventov

unread,
Mar 2, 2018, 11:23:20 AM3/2/18
to druid-de...@googlegroups.com, bs...@apache.org
Slim, could you please post about it?

Slim Bouguerra

unread,
Mar 2, 2018, 5:03:11 PM3/2/18
to Druid Development

In good old days Druid Double Sum/Min/Max/First/Last Aggregators would aggregate double values during ingestion then this double aggregated value will be stored as a truncated float. Effectively you send to Druid a double number `x` and druid will store it as `(float) x` thus this can save 32 bits at the expense of more error accumulations/confusions.  
PR https://github.com/druid-io/druid/pull/4491 has made the following changes to address the problem.
  1. Added Float Aggregator factory that can be used if the user is okay with 32 bits floating numbers. 
  2. Made Double Aggregator factory use actual 64bits floating point numbers.
  3. To make rolling upgrade migration smooth added a system-wide property that makes new Double Aggregator retain old behavior by default (eg doubles casted to float). This is done here https://github.com/druid-io/druid/pull/4944/
PRs 4491/4944 were released as part of 0.11 thus any cluster running 0.11 is capable of reading segments where Doubles are stored as 64 bits also all the indexing nodes understand float aggregators.
As discussed in  https://github.com/druid-io/druid/issues/4605 the plan was to make Double aggregator use 64bits by default in 0.12. 
As part of the last dev meeting, I proposed to release double as 64bits by default.
FYI I don't want to make this blocker for 0.12 but i will be nice to get this done one day.

In Addition to that @Roman had some concerns about keeping the same names while changing the implementation. He suggested that we add new kind of aggregator and deprecated old ones, @roman feel free to elaborate more on your point of view.

IMO  it will be confusing more than helping the user if we add new aggregators and deprecate the old one. 
In addition, we are not saving the user anything because at the end of the day you have to either move out from the deprecated ones or use the float aggregators instead of double. 

Roman Leventov

unread,
Mar 2, 2018, 5:24:17 PM3/2/18
to druid-de...@googlegroups.com
On Fri, Mar 2, 2018 at 11:03 PM, Slim Bouguerra <slim.bo...@gmail.com> wrote:
IMO  it will be confusing more than helping the user if we add new aggregators and deprecate the old one. 
In addition, we are not saving the user anything because at the end of the day you have to either move out from the deprecated ones or use the float aggregators instead of double. 

My suggestion was to keep deprecated names of forever. Like "floatSum" is a deprecated alias of "sumFloat". It's not a separate aggregator, just an alias, so just minimum code is kept around.

Slim Bouguerra

unread,
Mar 2, 2018, 5:40:16 PM3/2/18
to Druid Development
@roman then am not sure if i do understand your point of view then.
Can you please elaborate more, and let's use doubleSum as the example (floatSum is not the issue here). 
Thanks

Gian Merlino

unread,
Mar 2, 2018, 6:49:16 PM3/2/18
to druid-de...@googlegroups.com
Are we having this discussion on the mailing list or in the github issue? It looks like the same message is posted in both places.

Anyway, I would support a patch to master that simply changed the default for druid.indexing.doubleStorage to "double" and updated documentation accordingly. People that want the old behavior can change it back to "float". In a later version of Druid we can remove the properly completely, but I see no good reason to do it now. We might as well leave it there to make migration easier.

We have informed users that this might happen - our docs say:

> Druid's storage layer uses a 32-bit float representation to store columns created by the 
> doubleSum, doubleMin, and doubleMax aggregators at indexing time. To instead use 64-bit floats
> for these columns, please set the system-wide property `druid.indexing.doubleStorage=double`.
> This will become the default behavior in a future version of Druid.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/da5c166f-0906-4263-8331-4027f7cdec5f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Slim Bouguerra

unread,
Mar 2, 2018, 7:00:53 PM3/2/18
to druid-de...@googlegroups.com
On Mar 2, 2018, at 3:48 PM, Gian Merlino <gi...@imply.io> wrote:

Are we having this discussion on the mailing list or in the github issue? It looks like the same message is posted in both places.


@gian Very good question we need to make this more clear when we move to Apache, and we need to be dam serious about it. I personally prefer Github but since @roman posted answer here let’s keep it here. 


Anyway, I would support a patch to master that simply changed the default for druid.indexing.doubleStorage to "double" and updated documentation accordingly. People that want the old behavior can change it back to "float". In a later version of Druid we can remove the properly completely, but I see no good reason to do it now. We might as well leave it there to make migration easier.

@gian sounds like you are suggestion 0.13 ?


We have informed users that this might happen - our docs say:

> Druid's storage layer uses a 32-bit float representation to store columns created by the 
> doubleSum, doubleMin, and doubleMax aggregators at indexing time. To instead use 64-bit floats
> for these columns, please set the system-wide property `druid.indexing.doubleStorage=double`.
> This will become the default behavior in a future version of Druid.

Sounds good to me

Gian

On Fri, Mar 2, 2018 at 2:40 PM, Slim Bouguerra <slim.bo...@gmail.com> wrote:
@roman then am not sure if i do understand your point of view then.
Can you please elaborate more, and let's use doubleSum as the example (floatSum is not the issue here). 
Thanks


On Friday, March 2, 2018 at 2:24:17 PM UTC-8, Roman Leventov wrote:
On Fri, Mar 2, 2018 at 11:03 PM, Slim Bouguerra <slim.bo...@gmail.com> wrote:
IMO  it will be confusing more than helping the user if we add new aggregators and deprecate the old one. 
In addition, we are not saving the user anything because at the end of the day you have to either move out from the deprecated ones or use the float aggregators instead of double. 

My suggestion was to keep deprecated names of forever. Like "floatSum" is a deprecated alias of "sumFloat". It's not a separate aggregator, just an alias, so just minimum code is kept around.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/da5c166f-0906-4263-8331-4027f7cdec5f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/VEYkW_5dWd8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYAoxVQWJ61MeMnceiPwwk1AkacQTRvV6%2BAa5cF2jEWO1Q%40mail.gmail.com.

Gian Merlino

unread,
Mar 2, 2018, 7:34:54 PM3/2/18
to druid-de...@googlegroups.com
Yes I am suggesting 0.13.0.

Gian

To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.

To post to this group, send email to druid-development@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Slim Bouguerra

unread,
Mar 2, 2018, 7:36:51 PM3/2/18
to druid-de...@googlegroups.com
On Mar 2, 2018, at 4:34 PM, Gian Merlino <gi...@imply.io> wrote:

Yes I am suggesting 0.13.0.

Copy That, 0.13.0 it is!

To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYDRtKpR0wDkKeKX3%2Bp7yd694bhbe5WdysQ90cYERwfs%2BQ%40mail.gmail.com.

Roman Leventov

unread,
Mar 2, 2018, 7:54:32 PM3/2/18
to druid-de...@googlegroups.com
To me, Gian's suggestion is ok, if this change is highlighted in bold font as the first item in release notes.

Slim, if you remember from our discussion during the weekly meeting, the main concern for me was change of the behaviour of something, that is not configured in Druid configs, but rather stored somewhere else: database, ZooKeeper, third-party system, etc. Gian's suggestions gives an opportunity to keep that behaviour by applying some Druid configs, that's ok to me.

To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.

To post to this group, send email to druid-development@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Mar 2, 2018, 8:00:52 PM3/2/18
to druid-de...@googlegroups.com
Yeah, that was the point of keeping the property around. If a cluster operator know they don't want to switch any of their column types, they can set the property and retain the old behavior. I would suggest keeping this option around for a while -- I don't see a reason to get rid of it any time soon.

Gian

Reply all
Reply to author
Forward
0 new messages