A question about numbers and representation

90 views
Skip to first unread message

Henrik Lindberg

unread,
Sep 1, 2014, 4:55:03 AM9/1/14
to puppe...@googlegroups.com
Hi,
Recently I have been looking into serialization of various kinds, and
the issue of how we represent and serialize/deserialize numbers have
come up.

TL;DR - I want to specify the max values of integers and floats in the
puppet language for a number of reasons. Skip the background part
to get to "Questions and Proposal" if you are already familiar with
serialization formats, and issues regarding numeric representation.

Background
---
As you may know, Ruby has fluent handling of numbers - if a number would
overflow its current byte-size a larger representation will be used -
i.e. from 32 to 64 to (ruby) BigInteger (unlimited). Floating point
numbers undergo the same transition from 32 to 64 to BigDecimal (unlimited).

This is very flexible and helpful most of the time, but it creates
problem when serializing / deserializing. Most serialization formats
can simply not deal with > 64 bit values as regular numbers. They may do
horrible things like truncation, or use the max/min value if a value is
too big, or for floating point drastically lose precision.

YAML
- specifies integers to have arbitrary size, but recommends that an
implementation uses its native integer size. The specification says:
"In some languages (such as C), an integer may overflow the native
type’s storage capability. A YAML processor may reject such a value as
an error, truncate it with a warning, or find some other manner to
round-trip it. In general, integers representable using 32 binary digits
should safely round-trip through most systems.".
http://www.yaml.org/spec/1.2/spec.html

For floating point values, only IEEE 32 bit are safe.

In other words; it is unspecified... and means a YAML implementation may
silently truncate numbers to 32 bit values to 32 bit max int
(2,147,483,647) when running on a 32 bit machine (some implementations
as noted as "gotchas" in blog posts (google for it)).

JSON
- is similar to YAML in that it specifies a number to be an arbitrary
number of digits and it is thus up to an implementation to bind this to
a representation. It has the same problems as YAML. Notably, if used
with JavaScript which only has Number for both Integer and Real, the
largest integer number is 2^53 (after which it starts to lose precision).

MsgPack
- handles 8-16-32-64 bit integers (signed and unsigned) as well as 32
and 64 bit floating point. Does not have built in BigInteger, BigDecimal
types.

The Puppet Language Specification
---
In the Puppet Language Specification the size and precision of numbers
is currently specified as Ruby numbers (simply because this was
easiest). This is sloppy and leaves edge cases for serialization and
storage of data.

Proposal
========
I would like to cap a Puppet Integer to be a 64 signed value when used
as a resource attribute, or anywhere in external formats. This means a
value range of -2^63 to 2^63-1 which is in Exabyte range (1 exabyte = 2^60).

I would like to cap a Puppet Float to be a 64 bit (IEEE 754 binary64)
when used as a resource attribute or anywhere in external formats.

With respect to intermediate results, I propose that we specify that
values are of arbitrary size and that it is an error to store a value
that is to big for the typed representation Integer (64 bit signed). For
Float (64 bit) representation there is no error, but it looses
precision. When specifying an attribute to have Number type, automatic
conversion to Float (with loss of precision) takes place if an internal
integer number is to big for the Integer representation.

(Note, by default, attributes are typed as Any, which means that they by
default would store a Float if the integer value representation overflows).

Questions
=========
* Is it important that Javascript can be used to (accurately) read JSON
generated by Puppet? (If so, the limit needs to be 2^53 or values lose
precision).

* Is it important in Puppet Manifests to handle values larger than
2^63-1 (smaller than -2^63), and if not so, why isn't it sufficient to
use a floating point value (with reduced precision).

* If you think Puppet needs to handle very large values (yottabyte sized
disks?), should the language have convenient ways of expressing
such values e.g. 42yb ?

* Is it ok to automatically do transformation to floating point if
values overflow, and the type of an attribute is Number? (as discussed
above). I can imagine this making it difficult to efficiently represent
an attribute in a database and support may vary between different
database engines.

* Do you think it is worth the trouble to add the types BigInteger and
BigDecimal to the type system to allow the representation to be more
precise? (Note that this makes it difficult to use standard number
representation in serialization formats). This means that Number is not
allowed as an attribute/storage type (user must choose Integer, Float,
or one of the Big... types).

* Do you think it should work as in Ruby? If so, are you ok with
serialization that is non standard?

- henrik
--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

Trevor Vaughan

unread,
Sep 1, 2014, 1:15:51 PM9/1/14
to puppe...@googlegroups.com
TL;DR; BigInteger/BigDecimal is the "right" thing to do, otherwise cap at the client/server floor.

I have a few thoughts here:

1) I don't like losing precision in any case so a cap makes sense (maybe)

2) If you do cap, would you not want to cap to the lowest of the client or server? I.e. if the client is a 32 bit system and the server is a 64 bit system, you'd cap at 32 bits.

3) There may be cases where someone needs higher precision numbers. I can't think of them off hand, but I can guarantee that they'll happen so adding BigInteger and BigDecimal are probably a good idea.

4) For any fact that is retrieved that has multiple formats, I would like to see a standard set of a hash for each size so that it is easier to work with. Sure, right now, I can do variable mangling or post retrieval math, but it's so very untidy.

disk_size => {
  '/dev/sda' => {
     'B' => 10737418240,
     'kB' => 10485760,
     'MB' => 10240,
     'GB' => 10,
  }
}

But then, how far do you take this? TB, PB? EB.......?

Thanks,

Trevor


--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/lu1c8m%24a2n%241%40ger.gmane.org.
For more options, visit https://groups.google.com/d/optout.



--
Trevor Vaughan
Vice President, Onyx Point, Inc
(410) 541-6699
tvau...@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

Henrik Lindberg

unread,
Sep 1, 2014, 6:05:07 PM9/1/14
to puppe...@googlegroups.com
On 2014-01-09 19:15, Trevor Vaughan wrote:
> TL;DR; BigInteger/BigDecimal is the "right" thing to do, otherwise cap
> at the client/server floor.
>
> I have a few thoughts here:
>
> 1) I don't like losing precision in any case so a cap makes sense (maybe)
>
> 2) If you do cap, would you not want to cap to the lowest of the client
> or server? I.e. if the client is a 32 bit system and the server is a 64
> bit system, you'd cap at 32 bits.
>

There is no need to do that - today's systems handle both 32 and 64 bit
values just fine - its the max unsigned 64 bit int, and values above
that, and those that are smaller than -2^63 that causes problems. If you
had such values today, they would not roundtrip through the system.

It does not matter all that much if a 32 bit system has a bit more work
to do when adding 64 bit numbers - the main problems are serialization,
and storage formats for efficient processing at larger scale (where 64
bit systems are indeed used).

> 3) There may be cases where someone needs higher precision numbers. I
> can't think of them off hand, but I can guarantee that they'll happen so
> adding BigInteger and BigDecimal are probably a good idea.
>

I also imagine them being needed - they are needed in various
applications - just wonder what the need may be in puppet's domain.
(Total sum of diskspace in a report?)

Adding them as explicit types will work fine when we do need them BTW,
but it requires a fair amount of work as there are many touchpoints in
the system that has to deal with them.

> 4) For any fact that is retrieved that has multiple formats, I would
> like to see a standard set of a hash for each size so that it is easier
> to work with. Sure, right now, I can do variable mangling or post
> retrieval math, but it's so very untidy.
>
> disk_size => {
> '/dev/sda' => {
> 'B' => 10737418240,
> 'kB' => 10485760,
> 'MB' => 10240,
> 'GB' => 10,
> }
> }
>
> But then, how far do you take this? TB, PB? EB........?
>
We probably have to stop at Geopbyte since we are back at 'G' :-)

- henrik

> On Mon, Sep 1, 2014 at 4:54 AM, Henrik Lindberg
> <henrik....@cloudsmith.com <mailto:henrik....@cloudsmith.com>>
> http://www.yaml.org/spec/1.2/__spec..html
> http://puppet-on-the-edge.__blogspot.se/
> <http://puppet-on-the-edge.blogspot.se/>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to puppet-dev+unsubscribe@__googlegroups.com
> <mailto:puppet-dev%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/__msgid/puppet-dev/lu1c8m%24a2n%__241%40ger.gmane.org
> <https://groups.google.com/d/msgid/puppet-dev/lu1c8m%24a2n%241%40ger.gmane.org>.
> For more options, visit https://groups.google.com/d/__optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Trevor Vaughan
> Vice President, Onyx Point, Inc
> (410) 541-6699
> tvau...@onyxpoint.com <mailto:tvau...@onyxpoint.com>
>
> -- This account not approved for unencrypted proprietary information --
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-dev+...@googlegroups.com
> <mailto:puppet-dev+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-dev/CANs%2BFoUWgzwdhhEFtS6STj_POU80dtPpVvrN_dx1Ta13QCjJkQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/puppet-dev/CANs%2BFoUWgzwdhhEFtS6STj_POU80dtPpVvrN_dx1Ta13QCjJkQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


--

Luke Kanies

unread,
Sep 1, 2014, 10:18:22 PM9/1/14
to puppe...@googlegroups.com
On Sep 1, 2014, at 1:54 AM, Henrik Lindberg <henrik....@cloudsmith.com> wrote:

> Hi,
> Recently I have been looking into serialization of various kinds, and the issue of how we represent and serialize/deserialize numbers have come up.
>
> TL;DR - I want to specify the max values of integers and floats in the puppet language for a number of reasons. Skip the background part
> to get to "Questions and Proposal" if you are already familiar with serialization formats, and issues regarding numeric representation.

I’m ok with caps, but I am very much against asking users to know things like bigint and such. I’m offended any language asks humans to handle that kind of complexity these days, and for Puppet’s language to do it just seems like the wrong trade-off in UX and engineering.
> --
> You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/lu1c8m%24a2n%241%40ger.gmane.org.
> For more options, visit https://groups.google.com/d/optout.


--
http://puppetlabs.com/ | http://about.me/lak | @puppetmasterd

Trevor Vaughan

unread,
Sep 2, 2014, 12:02:00 AM9/2/14
to puppe...@googlegroups.com
Indeed. Ideally, it would "just work" and do the automatic conversion internally to the language.

Unfortunately, this may take a lot of tinkering back and forth under the covers. But, I'd certainly love to never worry about typing again.

Now, about those booleans.....

Trevor



For more options, visit https://groups.google.com/d/optout.



--
Trevor Vaughan
Vice President, Onyx Point, Inc
(410) 541-6699
tvau...@onyxpoint.com

markus

unread,
Sep 2, 2014, 2:40:09 AM9/2/14
to puppe...@googlegroups.com

> Most serialization formats can simply not deal with > 64 bit values as
> regular numbers. They may do horrible things like truncation, or use
> the max/min value if a value is too big, or for floating point
> drastically lose precision.

Eh. It's not that the serialization formats can't deal with or do
horrible things to >64 bit values, but that some languages that don't
support high precision numbers implement the formats awkwardly.

This isn't just a problem with serialization, or with >64 bit values.
Languages that don't implement math well tend to have all sorts of
problems.

For example, in javascript (I just tested on nodejs, but others have
similar issues), the 56 bit integer 0xffffffffffffff is even (nodejs
thinks it's 72057594037927940) whereas in ruby, C, python, etc. it's odd
(72057594037927935). Since the later value is also what you get from
math, I think it would be safe to call the first answer "wrong."

These sorts of cases abound.

If you're going to treat this as a problem with the puppet's
serialization, by setting caps and having puppet "fail" on results that
some other languages may choke on, I suspect you'll wind up discovering
that the lowest common denominator is uncomfortably low. There's still
some code out that thinks an integer means 16 bits, but the correct
response is to be cautious of such islands of math-fail, rather than
restricting the rest of the world to their limitations.

A better path would be to 1) make sure puppet's results are
mathematically correct, and 2) warn people about cases where where using
code in other languages could lead to problems.

IMNSHO, of course.
-- M




Daniele Sluijters

unread,
Sep 2, 2014, 8:01:45 AM9/2/14
to puppe...@googlegroups.com
> 2) warn people about cases where where using code in other languages could lead to problems.

Unless we want to keep track of every language and how the multiple libraries that exist handle serialisation and deserialisation that's pretty much impossible. "Oh but this breaks in Scala but works in Swift but it did weird things in COBOL but Fortran worked but my custom QBasic implementation barfed".

Henrik Lindberg

unread,
Sep 2, 2014, 10:29:37 AM9/2/14
to puppe...@googlegroups.com
On 2014-02-09 8:39, markus wrote:
>
>> Most serialization formats can simply not deal with > 64 bit values as
>> regular numbers. They may do horrible things like truncation, or use
>> the max/min value if a value is too big, or for floating point
>> drastically lose precision.
>
> Eh. It's not that the serialization formats can't deal with or do
> horrible things to >64 bit values, but that some languages that don't
> support high precision numbers implement the formats awkwardly.
>
> This isn't just a problem with serialization, or with >64 bit values.
> Languages that don't implement math well tend to have all sorts of
> problems.
>
> For example, in javascript (I just tested on nodejs, but others have
> similar issues), the 56 bit integer 0xffffffffffffff is even (nodejs
> thinks it's 72057594037927940) whereas in ruby, C, python, etc. it's odd
> (72057594037927935). Since the later value is also what you get from
> math, I think it would be safe to call the first answer "wrong."
>
After 2^53 Javascript loses precision since it becomes floating point.

> These sorts of cases abound.
>
> If you're going to treat this as a problem with the puppet's
> serialization, by setting caps and having puppet "fail" on results that
> some other languages may choke on, I suspect you'll wind up discovering
> that the lowest common denominator is uncomfortably low. There's still
> some code out that thinks an integer means 16 bits, but the correct
> response is to be cautious of such islands of math-fail, rather than
> restricting the rest of the world to their limitations.
>
There are naturally lots of crappy software with problems in this area.
I did look at the main implementations for MsgPack, JSON, and YAML.
Currently, even Puppet itself will not work correctly with values over
2^63-1 or smaller than -2^63 when such values are stored in PuppetDB.
If we were to change that, we would sacrifice on performance for every
normal case (where integers fit in 64 bits).

> A better path would be to 1) make sure puppet's results are
> mathematically correct, and 2) warn people about cases where where using
> code in other languages could lead to problems.
>
We will do math as good as the underlying implementation language allows
us. The value at the edge though, when assigned to an attribute that
will be serialized and stored in the DB needs to fit an Integer.

Cannot really do much about 2, it is a language/implementation thing and
we simply cannot look at every such combination.

Ken Barber

unread,
Sep 2, 2014, 1:12:45 PM9/2/14
to puppe...@googlegroups.com
> TL;DR - I want to specify the max values of integers and floats in the
> puppet language for a number of reasons. Skip the background part
> to get to "Questions and Proposal" if you are already familiar with
> serialization formats, and issues regarding numeric representation.

TL;DR: from a PuppetDB perspective this is great, lets talk next steps

I think this direction is correct, and for PuppetDB we'd like to see
this clarified without a doubt. For those at home Henrik and I
discussed this at length of #puppet-dev on IRC, and its a complex
topic that extends beyond just numbers. So in fact other items such as
string length could also do with some clarification also, maximum
amount of entries in a hash or map etc. etc.. From our PuppetDB
perspective we often have a greater amount of concerns about
limitations and such than Puppet and establishing clear constraints
will definitely help our designs going forward.

For example right now we cannot support an arbitrary decimal length
for structured facts, we only support signed 64 bit big-endian
integers. This was on purpose to some degree, because it would add
complication to the design but in a way its a bug really and the lack
of guidelines do not make it a nice bug to solve. If the community can
make decisions about this stuff (which is why I've been sitting back
and reading the threads quietly before responding) we can then go
about to work on meeting those constraints that we define in our
design as well.

There is a lot however to be said about defining smaller constraints
rather than larger ones initially, even if some use-cases might break.
Gone are the days where we can be frivolous and allow unbounded
storage - because it hurts performance. In some cases it might not be
wise to for example, store an ISO image as base64 in a fact string
:-). By letting users do crazy stuff, it actually hurts them, and it
hurts us also as we have to develop solutions for edge cases that
probably aren't smart anyway ... taking time away from more critical
development items.

I think if you are okay with it Henrik, I'd like us to both formalize
this "schema" you propose in a design doc somewhere, and the PuppetDB
contributors should be more than happy to dive in and hack it to
pieces and add our concerns about whatever constraints we propose. ie.
lets talk next steps and make this real if we can.

ken.

Henrik Lindberg

unread,
Sep 2, 2014, 3:28:19 PM9/2/14
to puppe...@googlegroups.com
That is great.

For the language - the specification is in the puppetlabs git repo, and
more specifically, here is the specification for Integer:
https://github.com/puppetlabs/puppet-specifications/blob/master/language/types_values_variables.md#integer-from-to

And here is Float;
https://github.com/puppetlabs/puppet-specifications/blob/master/language/types_values_variables.md#float-from-to

I logged PUP-3170 to track those.

The lengths of array, hash, and string are currently not specified.

We should create tickets for those as as well (or just expand on 3170,
because implementation wise it is pretty much the same thing (i.e. I
think the constraints should be checked by the type system).

John Bollinger

unread,
Sep 2, 2014, 4:23:07 PM9/2/14
to puppe...@googlegroups.com


On Monday, September 1, 2014 3:55:03 AM UTC-5, henrik lindberg wrote:
Hi,
Recently I have been looking into serialization of various kinds, and
the issue of how we represent and serialize/deserialize numbers have
come up.


[...]
 

Proposal
========
I would like to cap a Puppet Integer to be a 64 signed value when used
as a resource attribute, or anywhere in external formats. This means a
value range of -2^63 to 2^63-1 which is in Exabyte range (1 exabyte = 2^60).

I would like to cap a Puppet Float to be a 64 bit (IEEE 754 binary64)
when used as a resource attribute or anywhere in external formats.

With respect to intermediate results, I propose that we specify that
values are of arbitrary size and that it is an error to store a value


What, specifically, does it mean to "store a value"?  Does that mean to assign it to a resource attribute?

 
that is to big for the typed representation Integer (64 bit signed). For
Float (64 bit) representation there is no error, but it looses
precision.


What about numbers that overflow or underflow a 64-bit Float?

 
When specifying an attribute to have Number type, automatic
conversion to Float (with loss of precision) takes place if an internal
integer number is to big for the Integer representation.

(Note, by default, attributes are typed as Any, which means that they by
default would store a Float if the integer value representation overflows).



And if BigDecimal (and maybe BigInteger) were added to the type system, then I presume the expectation would be that over/underflowing Floats would go there?  And maybe that overflowing integers would go there if necessary to avoid loss of precision?

 
Questions
=========
* Is it important that Javascript can be used to (accurately) read JSON
generated by Puppet? (If so, the limit needs to be 2^53 or values lose
precision).



I think that question is moot.  No matter what, Javascript is limited in that it cannot with full fidelity consume or produce Puppet data having more than 53 bits of numeric precision.  I don't think it helps anyone to project that limitation into Puppet.

 
* Is it important in Puppet Manifests to handle values larger than
2^63-1 (smaller than -2^63), and if not so, why isn't it sufficient to
use a floating point value (with reduced precision).



I am not prepared to offer examples of why Puppet manifests would need to handle more than 63 bits of fixed-point precision, nor even more than 53 bits of floating-point precision.  I am uneasy about pulling back from Puppet's documented greater current capabilities, however.

 
* If you think Puppet needs to handle very large values (yottabyte sized
disks?), should the language have convenient ways of expressing
such values e.g. 42yb ?


I would prefer to avoid adding such expressions, especially if there will not be similar ones all the way down the size scale.  I would not be enthusiastic even with a full range of  such expressions.

 

* Is it ok to automatically do transformation to floating point if
values overflow, and the type of an attribute is Number? (as discussed
above). I can imagine this making it difficult to efficiently represent
an attribute in a database and support may vary between different
database engines.



It is not ok to silently lose precision.  It might be ok to lose precision if doing so is accompanied by a warning.

I'd anyway be inclined to say that the problem here is not so much possible loss of precision as it is specifying the type of the attribute as Number instead of something more specific.  OF COURSE that presents issues for recording the value in a database.

 
* Do you think it is worth the trouble to add the types BigInteger and
BigDecimal to the type system to allow the representation to be more
precise? (Note that this makes it difficult to use standard number
representation in serialization formats). This means that Number is not
allowed as an attribute/storage type (user must choose Integer, Float,
or one of the Big... types).



1) If you have BigDecimal then you don't need BigInteger.

2) Why would allowing one or both of the Bigs prevent Number from being allowed as a serializable type?

The way I see it, if you allow Bigs then Numbers must always be (de)serialized as BigDecimal.  Where you want attributes or other values to be efficiently serializable / indexable / etc. you assign them a narrower type appropriate for that purpose.  If this is too big a challenge for users accustomed to not specifying types, then perhaps the whole type system thing -- cool as it is -- is just not a good fit for Puppet.

3) Do you actually need one or both Bigs as named types in order to allow Big values?  Could it not be that Big values are representable via the Number type, but there is no (other) named numeric type that specifically allows such values?  Since you seem to prefer that users to not work with such values, would that not influence them in that direction?


* Do you think it should work as in Ruby? If so, are you ok with
serialization that is non standard?



I think disallowing Bigs in the serialization formats will present its own problems, only some of which you have touched on so far.  I think the type system should offer opportunities for greater efficiency in numeric handling, rather than serving as an excuse to limit numeric representations.


John

Henrik Lindberg

unread,
Sep 3, 2014, 1:40:45 PM9/3/14
to puppe...@googlegroups.com
On 2014-02-09 22:23, John Bollinger wrote:
>
>
> On Monday, September 1, 2014 3:55:03 AM UTC-5, henrik lindberg wrote:
>
> Hi,
> Recently I have been looking into serialization of various kinds, and
> the issue of how we represent and serialize/deserialize numbers have
> come up.
>
>
> [...]
>
>
> Proposal
> ========
> I would like to cap a Puppet Integer to be a 64 signed value when used
> as a resource attribute, or anywhere in external formats. This means a
> value range of -2^63 to 2^63-1 which is in Exabyte range (1 exabyte
> = 2^60).
>
> I would like to cap a Puppet Float to be a 64 bit (IEEE 754 binary64)
> when used as a resource attribute or anywhere in external formats.
>
> With respect to intermediate results, I propose that we specify that
> values are of arbitrary size and that it is an error to store a value
>
>
>
> What, specifically, does it mean to "store a value"? Does that mean to
> assign it to a resource attribute?

It was vague on purpose since I cannot currently enumerate the places
where this should take place, but I was thinking resource attributes at
least.

>
> that is to big for the typed representation Integer (64 bit signed).
> For
> Float (64 bit) representation there is no error, but it looses
> precision.
>
>
>
> What about numbers that overflow or underflow a 64-bit Float?
>

That would also be an error (when it cannot lose more precision).

> When specifying an attribute to have Number type, automatic
> conversion to Float (with loss of precision) takes place if an internal
> integer number is to big for the Integer representation.
>
> (Note, by default, attributes are typed as Any, which means that
> they by
> default would store a Float if the integer value representation
> overflows).
>
>
>
> And if BigDecimal (and maybe BigInteger) were added to the type system,
> then I presume the expectation would be that over/underflowing Floats
> would go there? And maybe that overflowing integers would go there if
> necessary to avoid loss of precision?
>

If we add them, then the runtime should be specified to gracefully
choose the required size while calculating and that the types Any and
Number means that they are accepted, but that Integer and Float does not
accept them (when they have values that are outside the valid range). (I
have not thought this through completely at this point I must say).
True, but BigInteger specifies that a fraction is not allowed.

> 2) Why would allowing one or both of the Bigs prevent Number from being
> allowed as a serializable type?
>
Not sure I said that. The problem is that if something is potentially
Big... then a database must be prepared to deal with it and it has a
high cost. Specifying that Number means Integer, Float, or a Big type is
perfectly fine.

> The way I see it, if you allow Bigs then Numbers must always be
> (de)serialized as BigDecimal. Where you want attributes or other values
> to be efficiently serializable / indexable / etc. you assign them a
> narrower type appropriate for that purpose. If this is too big a
> challenge for users accustomed to not specifying types, then perhaps the
> whole type system thing -- cool as it is -- is just not a good fit for
> Puppet.
>
Yes, that is how I though this could work. However, since everything is
basically untyped now (which we translate to the type Any), this means
that PuppetDB must be changed to use BigDecimal instead of integer 64
and float. That is a loose^3; it is lots of work to implement, bad
performance, and everyone needs to type everything.

> 3) Do you actually need one or both Bigs as named types in order to
> allow Big values? Could it not be that Big values are representable via
> the Number type, but there is no (other) named numeric type that
> specifically allows such values? Since you seem to prefer that users to
> not work with such values, would that not influence them in that direction?
>
Possibly. Having Number be concrete and represented as BigDecimal is ok,
it can hold any value described by subclasses.

>
> * Do you think it should work as in Ruby? If so, are you ok with
> serialization that is non standard?
>
>
>
> I think disallowing Bigs in the serialization formats will present its
> own problems, only some of which you have touched on so far. I think
> the type system should offer /opportunities/ for greater efficiency in
> numeric handling, rather than serving as an excuse to limit numeric
> representations.
>

I don't quite get the point here - the proposed cap is not something
that the type system needs. As an example MsgPack does not have standard
Big types, thus a serialization will need to be special and it is not
possible to just use something like "readInt" to get what you know
should be an integer value. The other example is PuppetDB, where a
decision has to be made how to store integers; the slower Big types, or
a more efficient 64 bit value? This is not just about storage, also
about indexing speed and query/comparisson - and if thinking that some
values are stored as 64 bits and other as big type for the same entity
that would be even slower to query for.

So - idea, make it safe and efficient for the normal cases. Only when
there is a special case (if indeed we do need the big types) then take
the less efficient route.

John Bollinger

unread,
Sep 3, 2014, 6:35:38 PM9/3/14
to puppe...@googlegroups.com


On Wednesday, September 3, 2014 12:40:45 PM UTC-5, henrik lindberg wrote:
On 2014-02-09 22:23, John Bollinger wrote:
>
>
> On Monday, September 1, 2014 3:55:03 AM UTC-5, henrik lindberg wrote:
>
>     Hi,
>     Recently I have been looking into serialization of various kinds, and
>     the issue of how we represent and serialize/deserialize numbers have
>     come up.
>
>
> [...]
>
>
>     Proposal
>     ========
>     I would like to cap a Puppet Integer to be a 64 signed value when used
>     as a resource attribute, or anywhere in external formats. This means a
>     value range of -2^63 to 2^63-1 which is in Exabyte range (1 exabyte
>     = 2^60).
>
>     I would like to cap a Puppet Float to be a 64 bit (IEEE 754 binary64)
>     when used as a resource attribute or anywhere in external formats.
>
>     With respect to intermediate results, I propose that we specify that
>     values are of arbitrary size and that it is an error to store a value
>
>
>
> What, specifically, does it mean to "store a value"?  Does that mean to
> assign it to a resource attribute?

It was vague on purpose since I cannot currently enumerate the places
where this should take place, but I was thinking resource attributes at
least.



Surely there is a medium between "vague" and "enumerating all possibilities".  Or in the alternative, a minimum set of places where Big values must be allowed could be given.  Otherwise the proposal is insufficiently defined to reason about, much less implement.

 
>
>     that is to big for the typed representation Integer (64 bit signed).
>     For
>     Float (64 bit) representation there is no error, but it looses
>     precision.
>
>
>
> What about numbers that overflow or underflow a 64-bit Float?
>

 
That would also be an error (when it cannot lose more precision). 


IEEE floating-point underflow occurs not when a number cannot lose more precision, but rather when it is nonzero but so small that it does not have a normalized representation in the chosen floating-point format.  Among IEEE 64-bit doubles, these are nonzero numbers having absolute value less than 2^-1022.  Almost all such subnormal representations can lose more precision in the sense that there are even less precise subnormals, but they already have less precision than is usual for the format.



>     When specifying an attribute to have Number type, automatic
>     conversion to Float (with loss of precision) takes place


This happens only when the value is "stored", I presume?

 
if an internal
>     integer number is to big for the Integer representation.
>
>     (Note, by default, attributes are typed as Any, which means that
>     they by
>     default would store a Float if the integer value representation
>     overflows).
>
>
>
> And if BigDecimal (and maybe BigInteger) were added to the type system,
> then I presume the expectation would be that over/underflowing Floats
> would go there?  And maybe that overflowing integers would go there if
> necessary to avoid loss of precision?
>

If we add them, then the runtime should be specified to gracefully
choose the required size while calculating


I thought the whole reason for the proposal and discussion was that Ruby already does handle these gracefully, hence Puppet already has Big values.

 
and that the types Any and
Number means that they are accepted, but that Integer and Float does not
accept them (when they have values that are outside the valid range). (I
have not thought this through completely at this point I must say).


Clarification: I have no objection to limiting the values allowed for types Integer and Float, as specified in the proposal.  What I am concerned about is Puppet pulling back from supporting the full range of numeric values it supports now (near-arbitrary range and precision).
 

>
> 1) If you have BigDecimal then you don't need BigInteger.
>
True, but BigInteger specifies that a fraction is not allowed.


Supposing that I persuaded you that the type system should include a BigDecimal type in some form, I would be completely satisfied to leave it to you to decide whether it should also include a BigInteger type.

 

> 2) Why would allowing one or both of the Bigs prevent Number from being
> allowed as a serializable type?
>
Not sure I said that. The problem is that if something is potentially
Big... then a database must be prepared to deal with it and it has a
high cost.


Every Puppet value is potentially a Big now.  What new cost is involved?  I'm having trouble seeing how a database can deal efficiently with Puppet's current implicit typing anyway, Big values notwithstanding.  Without additional type information, it must be prepared for any given value to be a boolean, an Integer, a float, or a 37kb string (among other possibilities).  Why do Big values present an especial problem in that regard?

 
since everything is
basically untyped now (which we translate to the type Any), this means
that PuppetDB must be changed to use BigDecimal instead of integer 64
and float. That is a loose^3; it is lots of work to implement, bad
performance, and everyone needs to type everything.



Well, someone needs to type everything, somehow.  Typing is an inherent aspect of any representation of any value.  Indeed, it is loose to call Puppet values "untyped"; they are definitely typed (try inline_template('<%= type($my_variable) >') some time), but the type is not necessarily known a priori.  It is also loose to call Puppet 3 expressions "untyped" -- it is more precise to say that expressions, including variable dereferences, are implicitly typed.

But yes, for efficient numeric storage representations to be used, the types of the values to be stored must be among those for which an efficient representation is available.  MOREOVER, unless the storage mechanism is prepared to adapt dynamically to the types of the values presented to it, the specific types of those values must be known in advance, and they must be consistent.  In that sense everyone does need to type everything, regardless of whether any Big types are among the possibilities.

If you do suppose a type-adaptive storage mechanism (so that people don't need to type everything) then the mere possibility of Big values does not impose any additional inefficiency.  The actual appearance of Big values might be costly, but if such a value is in fact presented for storage then is it not better to faithfully store it than to fail?

 
> I think disallowing Bigs in the serialization formats will present its
> own problems, only some of which you have touched on so far.  I think
> the type system should offer /opportunities/ for greater efficiency in
> numeric handling, rather than serving as an excuse to limit numeric
> representations.
>

I don't quite get the point here - the proposed cap is not something
that the type system needs. As an example MsgPack does not have standard
Big types, thus a serialization will need to be special and it is not
possible to just use something like "readInt" to get what you know
should be an integer value. The other example is PuppetDB, where a
decision has to be made how to store integers; the slower Big types, or
a more efficient 64 bit value? This is not just about storage, also
about indexing speed and query/comparisson - and if thinking that some
values are stored as 64 bits and other as big type for the same entity
that would be even slower to query for.


As I said already, I am not against the proposed caps.  Rather, I am urging you to not categorically forbid serialization of Big values.

Possibly you could allow serialization to some formats -- such as MsgPack -- to fail on Bigs, but it's not clear to me even in that case why failure/nothing is better than something.  The issue is the data, not the format -- Puppet (currently) supports Big values, so if it needs to serialize values then it needs to serialize Bigs.

As for PuppetDB in particular, you have storage, indexing, and comparison problems (or else fidelity problems) for any number that is not specifically typed Integer or Float.  Number is not specific enough, even without Bigs, and Any certainly isn't.  If PuppetDB is type-adaptive then Bigs shouldn't present any special problem.  If it isn't, then it needs explicit typing (as Integer or Float) for efficiency anyway, so Bigs shouldn't present any special problem.



So - idea, make it safe and efficient for the normal cases. Only when
there is a special case (if indeed we do need the big types) then take
the less efficient route.



Ok, but I don't see how making it an error to "store" a Big value serves that principle.  Safe and efficient for storage of Integers and Floats allows use of native numeric formats; safe and efficient storage for Number or Any does not (even if storing a Big were an error).  Numbers that can be represented only as Bigs will not be typed Integer or Float.  If such numbers (or such formal types) constitute a special case then fine, but let there be a "less efficient route" (with full fidelity) for that.


John

Henrik Lindberg

unread,
Sep 3, 2014, 8:37:22 PM9/3/14
to puppe...@googlegroups.com
Sorry. At this point I am interested in feedback; do we really need Big
types, how do you deal with it today. Thoughts on where it is important,
edges where values have to be represented in some other form than a live
Ruby object etc.

Two places are naturally in catalogs, and facts, but we have many other
uses of the data that is collected - so, I can't enumerate them. I am
thinking anything that is the result of evaluation of puppet logic and
being stored or sent somewhere.

> >
> > that is to big for the typed representation Integer (64 bit
> signed).
> > For
> > Float (64 bit) representation there is no error, but it looses
> > precision.
> >
> >
> >
> > What about numbers that overflow or underflow a 64-bit Float?
> >
>
> That would also be an error (when it cannot lose more precision).
>
>
>
> IEEE floating-point underflow occurs not when a number cannot lose more
> precision, but rather when it is nonzero but so small that it does not
> have a normalized representation in the chosen floating-point format.
> Among IEEE 64-bit doubles, these are nonzero numbers having absolute
> value less than 2^-1022. Almost all such subnormal representations
> /can/ lose more precision in the sense that there are even less precise
> subnormals, but they already have less precision than is usual for the
> format.
>
Thanks, I am really not an expert on floating point representation.
Clearly, I am using the wrong terms.

>
>
> > When specifying an attribute to have Number type, automatic
> > conversion to Float (with loss of precision) takes place
>
>
>
> This happens only when the value is "stored", I presume?
>
> if an internal
> > integer number is to big for the Integer representation.
> >
> > (Note, by default, attributes are typed as Any, which means that
> > they by
> > default would store a Float if the integer value representation
> > overflows).
> >
> >
> >
> > And if BigDecimal (and maybe BigInteger) were added to the type
> system,
> > then I presume the expectation would be that over/underflowing
> Floats
> > would go there? And maybe that overflowing integers would go
> there if
> > necessary to avoid loss of precision?
> >
>
> If we add them, then the runtime should be specified to gracefully
> choose the required size while calculating
>
>
>
> I thought the whole reason for the proposal and discussion was that Ruby
> already does handle these gracefully, hence Puppet already has Big values.
>
Yes, the Puppet Runtime has this since it is currently written in Ruby.
The specification is not tied to a particular implementation. If you
write a C, Java, or Haskel implementation; what is it required to do...

> and that the types Any and
> Number means that they are accepted, but that Integer and Float does
> not
> accept them (when they have values that are outside the valid
> range). (I
> have not thought this through completely at this point I must say).
>
>
>
> Clarification: I have no objection to limiting the values allowed for
> types Integer and Float, as specified in the proposal. What I am
> concerned about is Puppet pulling back from supporting the full range of
> numeric values it supports now (near-arbitrary range and precision).
>

The problem is that it does not really support them. It is not
specified, it is not tested, and you cannot roundtrip such values past
PuppetDB nor serialize catalogs with them with MsgPack.

>
> >
> > 1) If you have BigDecimal then you don't need BigInteger.
> >
> True, but BigInteger specifies that a fraction is not allowed.
>
>
>
> Supposing that I persuaded you that the type system should include a
> BigDecimal type in some form, I would be completely satisfied to leave
> it to you to decide whether it should also include a BigInteger type.
>
:-)

>
> > 2) Why would allowing one or both of the Bigs prevent Number from
> being
> > allowed as a serializable type?
> >
> Not sure I said that. The problem is that if something is potentially
> Big... then a database must be prepared to deal with it and it has a
> high cost.
>
>
>
> /Every/ Puppet value is potentially a Big /now/. What new cost is
> involved? I'm having trouble seeing how a database can deal efficiently
> with Puppet's current implicit typing anyway, Big values
> notwithstanding. Without additional type information, it must be
> prepared for any given value to be a boolean, an Integer, a float, or a
> 37kb string (among other possibilities). Why do Big values present an
> especial problem in that regard?
>
I leave that one for Ken Barber, as I am not 100% sure on the design in
Puppet DB.

> since everything is
> basically untyped now (which we translate to the type Any), this means
> that PuppetDB must be changed to use BigDecimal instead of integer 64
> and float. That is a loose^3; it is lots of work to implement, bad
> performance, and everyone needs to type everything.
>
>
>
> Well, /some/one needs to type everything, somehow. Typing is an
> inherent aspect of any representation of any value. Indeed, it is loose
> to call Puppet values "untyped"; they are definitely typed (try
> inline_template('<%= type($my_variable) >') some time), but the type is
> not necessarily known /a priori/. It is also loose to call Puppet 3
> expressions "untyped" -- it is more precise to say that expressions,
> including variable dereferences, are implicitly typed.
>
> But yes, for efficient numeric storage representations to be used, the
> types of the values to be stored must be among those for which an
> efficient representation is available. MOREOVER, unless the storage
> mechanism is prepared to adapt dynamically to the types of the values
> presented to it, the specific types of those values must be known in
> advance, and they must be consistent. In that sense everyone *does*
John, thanks for all the thoughts on this topic. There are many valid
points and things to take into consideration. Will have another go with
Ken Barber on how certain things work in Puppet DB.

Plan to come back with something that is a more coherent and complete
proposal :-)

Cheers
- henrik

>
> John
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-dev+...@googlegroups.com
> <mailto:puppet-dev+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-dev/60117603-35dc-4160-bd97-600eeb5bad63%40googlegroups.com
> <https://groups.google.com/d/msgid/puppet-dev/60117603-35dc-4160-bd97-600eeb5bad63%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


Ken Barber

unread,
Sep 4, 2014, 10:40:43 AM9/4/14
to puppe...@googlegroups.com
>> > 2) Why would allowing one or both of the Bigs prevent Number from being
>> > allowed as a serializable type?
>> >
>> Not sure I said that. The problem is that if something is potentially
>> Big... then a database must be prepared to deal with it and it has a
>> high cost.
>
>
>
> Every Puppet value is potentially a Big now. What new cost is involved?
> I'm having trouble seeing how a database can deal efficiently with Puppet's
> current implicit typing anyway, Big values notwithstanding. Without
> additional type information, it must be prepared for any given value to be a
> boolean, an Integer, a float, or a 37kb string (among other possibilities).
> Why do Big values present an especial problem in that regard?

So right now, we have alternating native postgresql columns for the
bare types: text, biginteger, boolean, double precision. This provides
us with the ability to use the most optimal index for the type, and of
course avoid storing any more then we need to. As I mentioned at the
top of the thread, we specifically do not support arbitrary precision
decimals.

At least one example in PostgreSQL, once you jump to say, a numeric
column type the performance characteristics and index efficiency
changes for the worse. Same goes for floats being stored in a numeric.
This is why you are better off avoiding the conversion until you
overflow and absolutely need a decimal usually.

ken.

Henrik Lindberg

unread,
Sep 4, 2014, 1:40:00 PM9/4/14
to puppe...@googlegroups.com
Does that mean you (i.e. Puppet DB) are already prepared to handle
"anytype" so that if we encode all numbers smaller than int64/binary64
as such, and only larger values as a BigType then it would work by
adding those as alternative storage forms?

Or do you expect a given named attribute to have a given storage type at
all times?

Ken Barber

unread,
Sep 4, 2014, 2:20:20 PM9/4/14
to puppe...@googlegroups.com
>> So right now, we have alternating native postgresql columns for the
>> bare types: text, biginteger, boolean, double precision. This provides
>> us with the ability to use the most optimal index for the type, and of
>> course avoid storing any more then we need to. As I mentioned at the
>> top of the thread, we specifically do not support arbitrary precision
>> decimals.
>>
>> At least one example in PostgreSQL, once you jump to say, a numeric
>> column type the performance characteristics and index efficiency
>> changes for the worse. Same goes for floats being stored in a numeric.
>> This is why you are better off avoiding the conversion until you
>> overflow and absolutely need a decimal usually.
>>
>> ken.
>>
> Does that mean you (i.e. Puppet DB) are already prepared to handle "anytype"
> so that if we encode all numbers smaller than int64/binary64 as such, and
> only larger values as a BigType then it would work by adding those as
> alternative storage forms?
>
> Or do you expect a given named attribute to have a given storage type at all
> times?

I presume you mean trying to force large numbers into an encoding, and
expecting the edge applications to encode/decode?

Yes this is possible, however it doesn't help the PuppetDB api
consumers if say, a large number is encoded in some format that our
query capabilities do not understand. Operators like > < won't work as
expected as a point example, and consumers would be forced to do this
decode on our response themselves.

So its probably better to store numerics (as in arbitrary precision
numbers) as numerics ... and take advantage of the db's native storage
for that in this case, otherwise we'd have to encode/decode ourselves
in the server (which we haven't done before) ... its far more optimal
to push this work into PostgreSQL is my main point. And we could
probably do this, the type in Java becomes a BigDecimal more or less
as it moves out of JSON (and thats also basically the type you get
back via JDBC when you query a numeric I believe).

I would think if we had transport encoding issues (like mspack not
supporting a larger type natively) we could decode before we store I
guess. Thats an alternative. It means traversing the tree to find
these cases and modifying them on the float perhaps. Things like
zipper in clojure make this easier.

Henrik, you keep mentioning msgpack won't support arbitrary storage,
and I can't see it in the spec either. CBOR supports it (Bignums &
Bigfloats), just wanted to point that out. I've been pondering the
CBOR/msgpack story ever since I saw a split in the community around
these extensive type issues, and looking for reasons to use either/or,
thats all. Just something to keep an eye on.

ken.

Ken Barber

unread,
Sep 4, 2014, 2:21:47 PM9/4/14
to puppe...@googlegroups.com
> I would think if we had transport encoding issues (like mspack not
> supporting a larger type natively) we could decode before we store I
> guess. Thats an alternative. It means traversing the tree to find
> these cases and modifying them on the float perhaps. Things like
> zipper in clojure make this easier.

I meant "on the fly" not on the float .... or "as we ingest the data"
is more accurate.

ken.

John Bollinger

unread,
Sep 4, 2014, 2:48:35 PM9/4/14
to puppe...@googlegroups.com


On Thursday, September 4, 2014 9:40:43 AM UTC-5, Ken Barber wrote:
>
> Every Puppet value is potentially a Big now.  What new cost is involved?
> I'm having trouble seeing how a database can deal efficiently with Puppet's
> current implicit typing anyway, Big values notwithstanding.  Without
> additional type information, it must be prepared for any given value to be a
> boolean, an Integer, a float, or a 37kb string (among other possibilities).
> Why do Big values present an especial problem in that regard?

So right now, we have alternating native postgresql columns for the
bare types: text, biginteger, boolean, double precision. This provides
us with the ability to use the most optimal index for the type, and of
course avoid storing any more then we need to. As I mentioned at the
top of the thread, we specifically do not support arbitrary precision
decimals.

At least one example in PostgreSQL, once you jump to say, a numeric
column type the performance characteristics and index efficiency
changes for the worse. Same goes for floats being stored in a numeric.
This is why you are better off avoiding the conversion until you
overflow and absolutely need a decimal usually.



Thanks, Ken.  Could you devote a few words to how PuppetDB chooses which of those alternative columns to use for any particular value, and how it afterward tracks which one has been used?

I'm also curious about whether index efficiency in PostgreSQL (as an example) takes a significant hit just from an index being defined on a Numeric/Decimal column or whether the impact depends strongly on the number of non-NULL values in that column.

Additionally, I'm curious about how (or whether) the alternative column approach interacts with queries.  Do selection predicates against Any type values typically need to consider multiple (or all) of the value columns?


John

Ken Barber

unread,
Sep 4, 2014, 3:14:25 PM9/4/14
to puppe...@googlegroups.com
> Thanks, Ken. Could you devote a few words to how PuppetDB chooses which of
> those alternative columns to use for any particular value, and how it
> afterward tracks which one has been used?

So PuppetDB, in particular fact-contents, and the way it stores leaf
values makes a decision using a very basic forensic function in
clojure:

https://github.com/puppetlabs/puppetdb/blob/master/src/com/puppetlabs/puppetdb/facts.clj#L114-L124

We store that ID, which really maps to another lookup table (more for
referential integrity purposes than anything). We also use that ID to
make the decision as to which column we use:
value_string/integer/float/boolean/null etc.

> I'm also curious about whether index efficiency in PostgreSQL (as an
> example) takes a significant hit just from an index being defined on a
> Numeric/Decimal column or whether the impact depends strongly on the number
> of non-NULL values in that column.

It takes a hit because it requires more storage, and I believe in
theory the index optimisation around decimals is different also (but I
don't have the real numbers on that) ... integers are more commonly
optimised in their code base (pg's that is) because of their
wide-spread use in id columns. Again the same is true for smallints
versus bigints. This is of course conjecture without perf numbers to
back it to a degree, but I believe I'm probably correct. Of course
when it comes time to analyse this closer we'll prove it with real
perf tests as we usually do :-).

> Additionally, I'm curious about how (or whether) the alternative column
> approach interacts with queries. Do selection predicates against Any type
> values typically need to consider multiple (or all) of the value columns?

Its kind of interesting, and largely pivots on operator. For
structured facts and the <, >, <=, >= operators ... we are forced to
interrogate both the integer and float columns (an OR clause in SQL
basically), because a user would presume thats how it worked. In a way
this is a coercement. If we introduced a decimal, we would have to do
the same again, especially if it was an overflow where its other
related numbers are still integers. In theory (and needs to be backed
with perf numbers) even while we do this, the decimal column should in
theory be sparser than the integer column (and therefore quicker
overall to traverse), we could have our cake and eat it too with just
a mild perf hit. If they were all decimals, then we are locking
ourselves into the performance of decimal for all numbers.

Another example ... the ~ operator only works on strings of course,
and in fact benefits from the new trgm indexes we've started to
introduce in 9.3. This wasn't as simple before, the normal index for a
text column actually has a maximum size limit ... this is a fact not
many devs realise. It also isn't used for regexp queries :-). But I
digress ...

So in short, we hide this internal float/integer comparison problem
from the user. In my mind, they are both "numbers" and we try to
expose it that way, but internally they are treated with the most
optimal storage we can provide.

ken.

John Bollinger

unread,
Sep 5, 2014, 4:12:04 PM9/5/14
to puppe...@googlegroups.com
Thanks again, Ken.


On Thursday, September 4, 2014 2:14:25 PM UTC-5, Ken Barber wrote:
> Thanks, Ken.  Could you devote a few words to how PuppetDB chooses which of
> those alternative columns to use for any particular value, and how it
> afterward tracks which one has been used?

So PuppetDB, in particular fact-contents, and the way it stores leaf
values makes a decision using a very basic forensic function in
clojure:

https://github.com/puppetlabs/puppetdb/blob/master/src/com/puppetlabs/puppetdb/facts.clj#L114-L124

We store that ID, which really maps to another lookup table (more for
referential integrity purposes than anything). We also use that ID to
make the decision as to which column we use:
value_string/integer/float/boolean/null etc.



I am not very fluent in clojure, but it looks like that scheme could easily be extended to support Bigs.

 
> I'm also curious about whether index efficiency in PostgreSQL (as an
> example) takes a significant hit just from an index being defined on a
> Numeric/Decimal column or whether the impact depends strongly on the number
> of non-NULL values in that column.

It takes a hit because it requires more storage, and I believe in
theory the index optimisation around decimals is different also (but I
don't have the real numbers on that) ... integers are more commonly
optimised in their code base (pg's that is) because of their
wide-spread use in id columns. Again the same is true for smallints
versus bigints. This is of course conjecture without perf numbers to
back it to a degree, but I believe I'm probably correct. Of course
when it comes time to analyse this closer we'll prove it with real
perf tests as we usually do :-).



Absolutely nothing beats bona fide tests :-).

 
> Additionally, I'm curious about how (or whether) the alternative column
> approach interacts with queries.  Do selection predicates against Any type
> values typically need to consider multiple (or all) of the value columns?

Its kind of interesting, and largely pivots on operator. For
structured facts and the <, >, <=, >= operators ... we are forced to
interrogate both the integer and float columns (an OR clause in SQL
basically), because a user would presume thats how it worked.


I suspected as much.

 
In a way
this is a coercement. If we introduced a decimal, we would have to do
the same again, especially if it was an overflow where its other
related numbers are still integers. In theory (and needs to be backed
with perf numbers) even while we do this, the decimal column should in
theory be sparser than the integer column (and therefore quicker
overall to traverse), we could have our cake and eat it too with just
a mild perf hit. If they were all decimals, then we are locking
ourselves into the performance of decimal for all numbers.



I have been supposing that both in Ruby and in PuppetDB, the numeric representation would be the smallest / most efficient one that could accommodate the value without loss of fidelity.  That would mean any additional column supporting Big values would likely be very sparsely populated indeed, and also that PuppetDB could conceivably be clever enough to avoid checking that column at all for some queries.  For example, a query with criteria  (val >= 1 and val < 100) excludes all values that would need to be represented in a Big format).  Dunno whether that would help much.

 
Another example ... the ~ operator only works on strings of course,
and in fact benefits from the new trgm indexes we've started to
introduce in 9.3. This wasn't as simple before, the normal index for a
text column actually has a maximum size limit ... this is a fact not
many devs realise. It also isn't used for regexp queries :-). But I
digress ...

So in short, we hide this internal float/integer comparison problem
from the user. In my mind, they are both "numbers" and we try to
expose it that way, but internally they are treated with the most
optimal storage we can provide.



That's pretty much what I hoped and expected.  I am supposing that the same philosophy could be extended to cover Big values without too much difficulty, but I guess the question of how costly that might be would need to be determined by testing.


John

Reply all
Reply to author
Forward
0 new messages