--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/lu1c8m%24a2n%241%40ger.gmane.org.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/7ED23FAC-7FAF-4EFE-B202-3091BBEF58BF%40puppetlabs.com.
For more options, visit https://groups.google.com/d/optout.
Hi,
Recently I have been looking into serialization of various kinds, and
the issue of how we represent and serialize/deserialize numbers have
come up.
Proposal
========
I would like to cap a Puppet Integer to be a 64 signed value when used
as a resource attribute, or anywhere in external formats. This means a
value range of -2^63 to 2^63-1 which is in Exabyte range (1 exabyte = 2^60).
I would like to cap a Puppet Float to be a 64 bit (IEEE 754 binary64)
when used as a resource attribute or anywhere in external formats.
With respect to intermediate results, I propose that we specify that
values are of arbitrary size and that it is an error to store a value
that is to big for the typed representation Integer (64 bit signed). For
Float (64 bit) representation there is no error, but it looses
precision.
When specifying an attribute to have Number type, automatic
conversion to Float (with loss of precision) takes place if an internal
integer number is to big for the Integer representation.
(Note, by default, attributes are typed as Any, which means that they by
default would store a Float if the integer value representation overflows).
Questions
=========
* Is it important that Javascript can be used to (accurately) read JSON
generated by Puppet? (If so, the limit needs to be 2^53 or values lose
precision).
* Is it important in Puppet Manifests to handle values larger than
2^63-1 (smaller than -2^63), and if not so, why isn't it sufficient to
use a floating point value (with reduced precision).
* If you think Puppet needs to handle very large values (yottabyte sized
disks?), should the language have convenient ways of expressing
such values e.g. 42yb ?
* Is it ok to automatically do transformation to floating point if
values overflow, and the type of an attribute is Number? (as discussed
above). I can imagine this making it difficult to efficiently represent
an attribute in a database and support may vary between different
database engines.
* Do you think it is worth the trouble to add the types BigInteger and
BigDecimal to the type system to allow the representation to be more
precise? (Note that this makes it difficult to use standard number
representation in serialization formats). This means that Number is not
allowed as an attribute/storage type (user must choose Integer, Float,
or one of the Big... types).
* Do you think it should work as in Ruby? If so, are you ok with
serialization that is non standard?
On 2014-02-09 22:23, John Bollinger wrote:
>
>
> On Monday, September 1, 2014 3:55:03 AM UTC-5, henrik lindberg wrote:
>
> Hi,
> Recently I have been looking into serialization of various kinds, and
> the issue of how we represent and serialize/deserialize numbers have
> come up.
>
>
> [...]
>
>
> Proposal
> ========
> I would like to cap a Puppet Integer to be a 64 signed value when used
> as a resource attribute, or anywhere in external formats. This means a
> value range of -2^63 to 2^63-1 which is in Exabyte range (1 exabyte
> = 2^60).
>
> I would like to cap a Puppet Float to be a 64 bit (IEEE 754 binary64)
> when used as a resource attribute or anywhere in external formats.
>
> With respect to intermediate results, I propose that we specify that
> values are of arbitrary size and that it is an error to store a value
>
>
>
> What, specifically, does it mean to "store a value"? Does that mean to
> assign it to a resource attribute?
It was vague on purpose since I cannot currently enumerate the places
where this should take place, but I was thinking resource attributes at
least.
>
> that is to big for the typed representation Integer (64 bit signed).
> For
> Float (64 bit) representation there is no error, but it looses
> precision.
>
>
>
> What about numbers that overflow or underflow a 64-bit Float?
>
That would also be an error (when it cannot lose more precision).
> When specifying an attribute to have Number type, automatic
> conversion to Float (with loss of precision) takes place
if an internal
> integer number is to big for the Integer representation.
>
> (Note, by default, attributes are typed as Any, which means that
> they by
> default would store a Float if the integer value representation
> overflows).
>
>
>
> And if BigDecimal (and maybe BigInteger) were added to the type system,
> then I presume the expectation would be that over/underflowing Floats
> would go there? And maybe that overflowing integers would go there if
> necessary to avoid loss of precision?
>
If we add them, then the runtime should be specified to gracefully
choose the required size while calculating
and that the types Any and
Number means that they are accepted, but that Integer and Float does not
accept them (when they have values that are outside the valid range). (I
have not thought this through completely at this point I must say).
>
> 1) If you have BigDecimal then you don't need BigInteger.
>
True, but BigInteger specifies that a fraction is not allowed.
> 2) Why would allowing one or both of the Bigs prevent Number from being
> allowed as a serializable type?
>
Not sure I said that. The problem is that if something is potentially
Big... then a database must be prepared to deal with it and it has a
high cost.
since everything is
basically untyped now (which we translate to the type Any), this means
that PuppetDB must be changed to use BigDecimal instead of integer 64
and float. That is a loose^3; it is lots of work to implement, bad
performance, and everyone needs to type everything.
> I think disallowing Bigs in the serialization formats will present its
> own problems, only some of which you have touched on so far. I think
> the type system should offer /opportunities/ for greater efficiency in
> numeric handling, rather than serving as an excuse to limit numeric
> representations.
>
I don't quite get the point here - the proposed cap is not something
that the type system needs. As an example MsgPack does not have standard
Big types, thus a serialization will need to be special and it is not
possible to just use something like "readInt" to get what you know
should be an integer value. The other example is PuppetDB, where a
decision has to be made how to store integers; the slower Big types, or
a more efficient 64 bit value? This is not just about storage, also
about indexing speed and query/comparisson - and if thinking that some
values are stored as 64 bits and other as big type for the same entity
that would be even slower to query for.
So - idea, make it safe and efficient for the normal cases. Only when
there is a special case (if indeed we do need the big types) then take
the less efficient route.
>
> Every Puppet value is potentially a Big now. What new cost is involved?
> I'm having trouble seeing how a database can deal efficiently with Puppet's
> current implicit typing anyway, Big values notwithstanding. Without
> additional type information, it must be prepared for any given value to be a
> boolean, an Integer, a float, or a 37kb string (among other possibilities).
> Why do Big values present an especial problem in that regard?
So right now, we have alternating native postgresql columns for the
bare types: text, biginteger, boolean, double precision. This provides
us with the ability to use the most optimal index for the type, and of
course avoid storing any more then we need to. As I mentioned at the
top of the thread, we specifically do not support arbitrary precision
decimals.
At least one example in PostgreSQL, once you jump to say, a numeric
column type the performance characteristics and index efficiency
changes for the worse. Same goes for floats being stored in a numeric.
This is why you are better off avoiding the conversion until you
overflow and absolutely need a decimal usually.
> Thanks, Ken. Could you devote a few words to how PuppetDB chooses which of
> those alternative columns to use for any particular value, and how it
> afterward tracks which one has been used?
So PuppetDB, in particular fact-contents, and the way it stores leaf
values makes a decision using a very basic forensic function in
clojure:
https://github.com/puppetlabs/puppetdb/blob/master/src/com/puppetlabs/puppetdb/facts.clj#L114-L124
We store that ID, which really maps to another lookup table (more for
referential integrity purposes than anything). We also use that ID to
make the decision as to which column we use:
value_string/integer/float/boolean/null etc.
> I'm also curious about whether index efficiency in PostgreSQL (as an
> example) takes a significant hit just from an index being defined on a
> Numeric/Decimal column or whether the impact depends strongly on the number
> of non-NULL values in that column.
It takes a hit because it requires more storage, and I believe in
theory the index optimisation around decimals is different also (but I
don't have the real numbers on that) ... integers are more commonly
optimised in their code base (pg's that is) because of their
wide-spread use in id columns. Again the same is true for smallints
versus bigints. This is of course conjecture without perf numbers to
back it to a degree, but I believe I'm probably correct. Of course
when it comes time to analyse this closer we'll prove it with real
perf tests as we usually do :-).
> Additionally, I'm curious about how (or whether) the alternative column
> approach interacts with queries. Do selection predicates against Any type
> values typically need to consider multiple (or all) of the value columns?
Its kind of interesting, and largely pivots on operator. For
structured facts and the <, >, <=, >= operators ... we are forced to
interrogate both the integer and float columns (an OR clause in SQL
basically), because a user would presume thats how it worked.
In a way
this is a coercement. If we introduced a decimal, we would have to do
the same again, especially if it was an overflow where its other
related numbers are still integers. In theory (and needs to be backed
with perf numbers) even while we do this, the decimal column should in
theory be sparser than the integer column (and therefore quicker
overall to traverse), we could have our cake and eat it too with just
a mild perf hit. If they were all decimals, then we are locking
ourselves into the performance of decimal for all numbers.
Another example ... the ~ operator only works on strings of course,
and in fact benefits from the new trgm indexes we've started to
introduce in 9.3. This wasn't as simple before, the normal index for a
text column actually has a maximum size limit ... this is a fact not
many devs realise. It also isn't used for regexp queries :-). But I
digress ...
So in short, we hide this internal float/integer comparison problem
from the user. In my mind, they are both "numbers" and we try to
expose it that way, but internally they are treated with the most
optimal storage we can provide.