Avoid 'int', 'long' and 'short'...

Mr Flibble

unread,

Jun 26, 2015, 3:39:48 PM6/26/15

to

... #include <cstdint> instead!

/Flibble

JiiPee

unread,

Jun 26, 2015, 4:31:45 PM6/26/15

to

On 26/06/2015 20:39, Mr Flibble wrote:
> ... #include <cstdint> instead!
>
> /Flibble

you mean using int16_t? why is that?
int is the fastest integer so why would i use something else?

Mr Flibble

unread,

Jun 26, 2015, 5:37:40 PM6/26/15

to

Because 'int' is both unsafe and non-portable. If you want the fastest
integer that is at least a certain size then there is a <cstdint>
typedef for that.

/Flibble

JiiPee

unread,

Jun 26, 2015, 6:44:55 PM6/26/15

to

On 26/06/2015 22:37, Mr Flibble wrote:
> On 26/06/2015 21:31, JiiPee wrote:
>> On 26/06/2015 20:39, Mr Flibble wrote:
>>> ... #include <cstdint> instead!
>>>
>>> /Flibble
>>
>> you mean using int16_t? why is that?
>> int is the fastest integer so why would i use something else?
>
> Because 'int' is both unsafe and non-portable.

How is is unsafe? I though int is in all system one byte integer, right?
int choses automatically the faster integer type.

> If you want the fastest integer that is at least a certain size then
> there is a <cstdint> typedef for that.
>

But if you chose the wrong one from there it would not be the fastest.
int is always the fastest quaranteed

Ian Collins

unread,

Jun 26, 2015, 7:03:02 PM6/26/15

to

JiiPee wrote:
> On 26/06/2015 22:37, Mr Flibble wrote:
>> On 26/06/2015 21:31, JiiPee wrote:
>>> On 26/06/2015 20:39, Mr Flibble wrote:
>>>> ... #include <cstdint> instead!
>>>>
>>>> /Flibble
>>>
>>> you mean using int16_t? why is that?
>>> int is the fastest integer so why would i use something else?
>>
>> Because 'int' is both unsafe and non-portable.
>
> How is is unsafe? I though int is in all system one byte integer, right?
> int choses automatically the faster integer type.

Please, not again! We've already had this thread...

--
Ian Collins

JiiPee

unread,

Jun 26, 2015, 8:17:11 PM6/26/15

to

On 26/06/2015 23:44, JiiPee wrote:
>
> How is is unsafe? I though int is in all system one byte integer,
> right? int choses automatically the faster integer type.

I forgot the minimum int size, its actually min 16 bit I think (not
always 32). So I guess to be portable with any system cannot use bigger
ints than that. Otherwise must use cstdint.

Although its quite rare a compiler would use 16 bit, right? VC++ uses 32
bit and I guess gcc as well.

Also I read that many prosessors are optimized for ints, so they run
fast(est) with int.

But you are right there can be risk, although maybe rare.
"If you want to you can statically (compile-time) assert the sizeof
these fundamental types. It will alert people to think about porting
your code if the sizeof assumptions change. "

So could check the size of the int to be sure the code works.
I dont know if I bother change my int-habit as it might make the code
more complex (as for example many library functions take int as an
parameter, so there would be integer conversions possibly). But sure
this risk must be taken into consideration.

>
>> If you want the fastest integer that is at least a certain size then
>> there is a <cstdint> typedef for that.
>>
> But if you chose the wrong one from there it would not be the fastest.
> int is always the fastest quaranteed

fastest for integers less than 4 bytes in most of the machines.

Bo Persson

unread,

Jun 27, 2015, 12:59:10 AM6/27/15

to

On 2015-06-27 02:16, JiiPee wrote:
> On 26/06/2015 23:44, JiiPee wrote:
>>
>> How is is unsafe? I though int is in all system one byte integer,
>> right? int choses automatically the faster integer type.
>
> I forgot the minimum int size, its actually min 16 bit I think (not
> always 32). So I guess to be portable with any system cannot use bigger
> ints than that. Otherwise must use cstdint.
>

No. What Mr Flibble wants to forget is that systems with 16 bit ints
will have lots of other limitations, like 640k of total RAM. Or no hard
disk.

Even if you use int32_t for your variables, an average program wouldn't
fit in available memory anyway. Or wouldn't run for other reasons. So
why bother?

Bo Persson

Öö Tiib

unread,

Jun 27, 2015, 2:27:57 AM6/27/15

to

Isn't 'int' easier to read than 'int_fast16_t' of <cstdint>?
I don't think that latter is anyway safer or more portable than
former.

BGB

unread,

Jun 27, 2015, 2:32:49 AM6/27/15

to

On 6/26/2015 11:58 PM, Bo Persson wrote:
> On 2015-06-27 02:16, JiiPee wrote:
>> On 26/06/2015 23:44, JiiPee wrote:
>>>
>>> How is is unsafe? I though int is in all system one byte integer,
>>> right? int choses automatically the faster integer type.
>>
>> I forgot the minimum int size, its actually min 16 bit I think (not
>> always 32). So I guess to be portable with any system cannot use bigger
>> ints than that. Otherwise must use cstdint.
>>
>
> No. What Mr Flibble wants to forget is that systems with 16 bit ints
> will have lots of other limitations, like 640k of total RAM. Or no hard
> disk.
>

640kB is pretty massive as far as 16-bit systems go.

the PC was sort of an oddball in that it stayed 16-bit a little longer
than other systems, but had significantly more RAM than did most other
16-bit systems which have existed (or other computers which had existed
at the time).

and, to what extent 16-bit processors have lived on (such as in
microcontrollers), it has generally been with much smaller address
spaces than on the PC. (640kB of RAM would be huge by MCU standards...).

> Even if you use int32_t for your variables, an average program wouldn't
> fit in available memory anyway. Or wouldn't run for other reasons. So
> why bother?
>

yeah.

practically, 16-bit can mostly be ignored at this point.

if dealing with 16-bit DOS or Win16, there will be a lot bigger
funkiness to deal with than assumptions about "sizeof(int)" being wrong.
chances are, there will be some fairly specific reason for why code is
being targeted at these.

and if dealing with a microcontroller... chances are the code will be
written specifically for said microcontroller (there not being enough
RAM or ROM space to try to put any generic code on it, more likely the
code will need to be written specifically for that use-case).

Jens Thoms Toerring

unread,

Jun 27, 2015, 4:45:46 AM6/27/15

to

Especially when some clever-san notices that he actually
can store values larger than 32767 in an int_fast16_t on
his machine. And, since he just learned from Mr Flibble
that the types from <cstdint> are safe and portable (in
contrast to this horribly unsafe and non-portable 'int'
type) concludes that this must work on all architectures
and is such a cute trick, saving lots of memory for sure!
And the cycles repeats again...

Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de

Message has been deleted

BGB

unread,

Jun 27, 2015, 10:06:08 AM6/27/15

to

On 6/27/2015 8:46 AM, Stefan Ram wrote:

> BGB <cr8...@hotmail.com> writes:
>> written specifically for said microcontroller (there not being enough
>> RAM or ROM space to try to put any generic code on it
>

> Templates that aren't used aren't instantiated.
> Templates can thus generate less code than non-templates!
>
> Most problems with »template code bloat« arise from
> misunderstandings of template rules or improper use
> of templates.
>

errm, wasn't talking about templates, but using 'generic' in the general
sense, as-in, 'general purpose' code.

because of the typically small ROM space, most of the code needs to be
rather specific to the project.

it depends on the microcontroller though how much ROM space there is,
but typically it is in the kB range for 16-bit devices (32kB to 128kB is
fairly typical).

Mr Flibble

unread,

Jun 27, 2015, 10:46:19 AM6/27/15

to

Because use of the typedef means you are explicitly documenting in code
that the size of this type can vary and should be treated as such.

/Flibble

Öö Tiib

unread,

Jun 27, 2015, 2:01:59 PM6/27/15

to

On Saturday, 27 June 2015 17:06:08 UTC+3, BGB wrote:
> On 6/27/2015 8:46 AM, Stefan Ram wrote:
> > BGB <cr8...@hotmail.com> writes:
> >> written specifically for said microcontroller (there not being enough
> >> RAM or ROM space to try to put any generic code on it
> >
> > Templates that aren't used aren't instantiated.
> > Templates can thus generate less code than non-templates!
> >
> > Most problems with »template code bloat« arise from
> > misunderstandings of template rules or improper use
> > of templates.
> >
>
> errm, wasn't talking about templates, but using 'generic' in the general
> sense, as-in, 'general purpose' code.

Then it is straw-man in C++ sense. We lately tend to have all less and
less general purpose non-template code. Popular C++ libraries tend to
move towards templates and header only.

The ones that don't do it lose in popularity because of the unneeded
binary bloat, intrusive interface, not doing everything that is
possible compile-time and failing to generate competitive binaries.

The libraries that may survive that trend longer are in-house libraries
but those are not general purpose anyway.

> because of the typically small ROM space, most of the code needs to be
> rather specific to the project.
>
> it depends on the microcontroller though how much ROM space there is,
> but typically it is in the kB range for 16-bit devices (32kB to 128kB is
> fairly typical).

If the controller has 2048 bytes of ROM or under then it does only
very few things. Even in assembler it can't take lot of weeks to
implement what it does. 32kB however can be quite lot of C++. Templates
(that is what we mean with "generic" in C++) are powerful tool on that
level.

Öö Tiib

unread,

Jun 27, 2015, 5:17:38 PM6/27/15

to

Maybe try to elaborate it bit more. What *is* the special treatment you
have in mind?

Usage of 'int', 'long', 'wchar_t', 'size_t', 'time_t', 'ptrdiff_t',
'fpos_t' (etc. just name it) in C or in C++ means that the byte size of
it may (and eventually will) vary per platform or next generation
of same platform. Also if two of above are exactly same type on current
platform then it is wrong to assume that these two are same on some other
platform (unlike 'int' and 'int_fast16_t' that must be are same in all
platforms where 'int_fast16_t' is defined). That all *must* be known of
course to everyone wanting to use C++ but what is the treatment?

Might be you expect too lot of the types of <cstdint>. Exact size of
type matters for example on case of binary exchange and storage formats.
Actually I don't know much other prominent cases. Are there any?

It may be that you currently deal with some such format ... but most
things are text these days like (JSON, XML or CSV). It is narrow
fraction of works where we want to squeeze last out and pick a binary
format.

Fixing physical byte size of types that participate in binary format
is unfortunately not the solution. Alignment and endianness are still
to be taken care of. Also since we took it for to squeeze everything
out of our bandwidth then why not to compress it even further? When
a value is in range 0..1000 then why not to use only 10 bits for it?
Well-made bit-packing can be faster than 'memcpy' on modern platforms
thanks to less cache misses.

A buffer of 'uint8_t' is therefore pretty much the best underlying
type for a portable binary format. Where 'uint8_t' exists at all there
it is precisely 'unsigned char' so 'uint8_t' can be considered just a
shorthand for 'unsigned char'.

Mr Flibble

unread,

Jun 27, 2015, 5:19:52 PM6/27/15

to

Because the size and value range of 'int' can vary from platform to
platform 'int' is inherently unsafe and non-portable. Use <cstdint>
typedefs instead.

/Flibble

Jens Thoms Toerring

unread,

Jun 27, 2015, 6:05:18 PM6/27/15

to

Mr Flibble <flibbleREM...@i42.co.uk> wrote:
> On 27/06/2015 07:27, Öö Tiib wrote:
> > On Saturday, 27 June 2015 00:37:40 UTC+3, Mr Flibble wrote:
> >> On 26/06/2015 21:31, JiiPee wrote:
> >>> On 26/06/2015 20:39, Mr Flibble wrote:
> >>>> ... #include <cstdint> instead!
> >>>>

> >>> you mean using int16_t? why is that?
> >>> int is the fastest integer so why would i use something else?
> >>
> >> Because 'int' is both unsafe and non-portable. If you want the fastest
> >> integer that is at least a certain size then there is a <cstdint>
> >> typedef for that.
> >
> > Isn't 'int' easier to read than 'int_fast16_t' of <cstdint>?
> > I don't think that latter is anyway safer or more portable than
> > former.

> Because use of the typedef means you are explicitly documenting in code
> that the size of this type can vary and should be treated as such.

That's a wrong expectation. For people that do know that 'int'
has a rather limited range it's just visual noise. And those
that don't know will just be puzzled what this 'magical incan-
tation' is supposed to mean and move on, not any more educated
than before.

This isn't a problem that can be solved by some coding conven-
tion (that most won't even understand what it's all about). All
that this will achieve is that a few adopt it in a cargo-cult
programming way, i.e. using 'int_fast16_t' instead of 'int' be-
cause it looks "cool" - but they will still use it to store
too large values - which, in nearly all cases, is going to work
perfectly well on their systems. Without real hard, non-off-
switchable compiler and runtime range checking you can't
stop that kind of behaviour.

Just because you know what the '16' in 'int_fast16_t' means in
no way guarantees that others also do. For those that don't it's
not any more "explicitly documenting" than a mere 'int'. I've
seen too many people, including some that had worked for "im-
portant companies" in Silicon Valley, that had no idea at all
of any portability issues - they had never even heard of the
existence of architecures with other ranges for ints, endia-
nnesses or, God beware, chars with more than 8 bits or
alignment issues etc. (admittedly, they mostly came from a
Java background, so they had been living in a "padded cell"
where they couldn't hurt themselves too much;-).

<cstdint> is an extremely useful addition when you need types
of guaranteed sizes, especially when doing low level stuff,
but it's not a silver bullet to eradicate unawareness of por-
tability issues. For that people have to become aware of the
mere existence of such problems first. Only then the stuff
from <cstdint> becomes meaningful. Just making them write

#include <cstdint>
...
for ( int_fast16_t i = 0; i < 5; ++i )
{...}

instead of, more simply,

for ( int i = 0; i < 5; ++i )
{...}

won't fix any such problems because they'll also do

#include <cstdint>
...
for ( int_fast16_t i = 99990; i < 100000; ++i )
{...}

and get away with it...

On the other hand, making unsubstantiated claims that the
use of 'int' would be inherently "unsafe" or "non-portable"
makes you look a bit "strange" to those that have used
'int' safely and in a portable manner for a quarter of
a century.

woodb...@gmail.com

unread,

Jun 27, 2015, 6:22:07 PM6/27/15

to

There are a lot of games that use binary formats. Scientific
and medical applications often use binary formats. Also I
posted an article here a few weeks ago about bandwidth hitting
a wall. This isn't it, but it's related:

https://royalsociety.org/events/2015/05/communication-networks/

Brian
Ebenezer Enterprises - So far G-d has helped us.
http://webEbenezer.net

Öö Tiib

unread,

Jun 27, 2015, 7:42:39 PM6/27/15

to

Use for what? Stop repeating those pointless round silver bullet dogmas.

Value range and byte size of types from <cstddef> can and do vary
from platform to platform but not in sync with 'int'. These are
therefore more safe and more portable for typical usages.

If someone used 'int' instead of 'ptrdiff_t' or 'size_t' then it
is indeed a mild defect but it can not be repaired by using anything
from <cstdint>.

Öö Tiib

unread,

Jun 27, 2015, 7:47:38 PM6/27/15

to

All programs that use sounds or images technically use binary formats
but those are abstracted far under some low level API from programmers.
I did not mean that.

BGB

unread,

Jun 27, 2015, 9:50:49 PM6/27/15

to

On 6/27/2015 1:01 PM, Öö Tiib wrote:
> On Saturday, 27 June 2015 17:06:08 UTC+3, BGB wrote:
>> On 6/27/2015 8:46 AM, Stefan Ram wrote:
>>> BGB <cr8...@hotmail.com> writes:
>>>> written specifically for said microcontroller (there not being enough
>>>> RAM or ROM space to try to put any generic code on it
>>>
>>> Templates that aren't used aren't instantiated.
>>> Templates can thus generate less code than non-templates!
>>>
>>> Most problems with »template code bloat« arise from
>>> misunderstandings of template rules or improper use
>>> of templates.
>>>
>>
>> errm, wasn't talking about templates, but using 'generic' in the general
>> sense, as-in, 'general purpose' code.
>
> Then it is straw-man in C++ sense. We lately tend to have all less and
> less general purpose non-template code. Popular C++ libraries tend to
> move towards templates and header only.
>
> The ones that don't do it lose in popularity because of the unneeded
> binary bloat, intrusive interface, not doing everything that is
> possible compile-time and failing to generate competitive binaries.
>
> The libraries that may survive that trend longer are in-house libraries
> but those are not general purpose anyway.
>

I wasn't speaking either for or against templates.

>> because of the typically small ROM space, most of the code needs to be
>> rather specific to the project.
>>
>> it depends on the microcontroller though how much ROM space there is,
>> but typically it is in the kB range for 16-bit devices (32kB to 128kB is
>> fairly typical).
>
> If the controller has 2048 bytes of ROM or under then it does only
> very few things. Even in assembler it can't take lot of weeks to
> implement what it does. 32kB however can be quite lot of C++. Templates
> (that is what we mean with "generic" in C++) are powerful tool on that
> level.
>

ok.

but, yeah, it depends a lot on the MCU.

Mr Flibble

unread,

Jun 27, 2015, 10:12:09 PM6/27/15

to

My claims are not unsubstantiated: see the MISRA C++ coding standard for
safety critical systems: it bans the use of 'int' BECAUSE it is
inherently non-portable and therefore unsafe. Use the typedefs from
<cstdint> instead; DO NOT use 'int'.

/Flibble

BGB

unread,

Jun 27, 2015, 10:25:39 PM6/27/15

to

bit-streams and variable-length numbers can also be useful, but are more
painful to work with.

bit-packing and byte-oriented formats can also do pretty good if done
well. though, yes, the code will need to take care of things like endianess.

sometimes there are tradeoffs, like one format which could have
technically been a little more space-efficient with differentially-coded
variable-width numbers for a table, but then needing to decode the table
prior to being able to use it. I instead used a design where the field
was 16/24/32 bits wide (for the whole table) depending on what was
needed for the target-index, but the fixed-width entries allow for
random access to the table.

this sort of things has in a few cases pointed towards arguably
old/archaic solutions, like using codepages (particularly 1252 and 437)
in a few cases due to fixed-width characters being cheaper for random
access than UTF-8, but less overly expensive vs UTF-16, and sufficient
in the vast majority of cases (and are the cheapest option in cases when
they are an option). note that this would be done per-string.

> A buffer of 'uint8_t' is therefore pretty much the best underlying
> type for a portable binary format. Where 'uint8_t' exists at all there
> it is precisely 'unsigned char' so 'uint8_t' can be considered just a
> shorthand for 'unsigned char'.
>

yeah.

pretty much every data format comes down to bytes.

Richard Damon

unread,

Jun 28, 2015, 9:40:45 AM6/28/15

to

On 6/27/15 10:12 PM, Mr Flibble wrote:
>
> My claims are not unsubstantiated: see the MISRA C++ coding standard for
> safety critical systems: it bans the use of 'int' BECAUSE it is
> inherently non-portable and therefore unsafe. Use the typedefs from
> <cstdint> instead; DO NOT use 'int'.
>
> /Flibble
>

They also totally banned the use of 'char', including for holding text
strings. They later realized the sillyness of that and changed it, but
that shows they are not infallible.

Öö Tiib

unread,

Jun 28, 2015, 9:43:41 AM6/28/15

to

We have to take care of endianness in any portable format. It is
particularly funny since neither C nor C++ contain standard way for
detecting endianness compile-time. There are some libraries that use
every known non-standard way for that and so produce minimal code.

> sometimes there are tradeoffs, like one format which could have
> technically been a little more space-efficient with differentially-coded
> variable-width numbers for a table, but then needing to decode the table
> prior to being able to use it. I instead used a design where the field
> was 16/24/32 bits wide (for the whole table) depending on what was
> needed for the target-index, but the fixed-width entries allow for
> random access to the table.

It indeed depends on anticipated amounts of data if it is worth that
or not but if we know that a table contains only for example values 0..11
then it feels reasonable at least to consider 4 bit wide entries. The
processors crunch numbers at ungodly speeds but it is 4 times shorter
table than one with 16 bit wide entries.

> this sort of things has in a few cases pointed towards arguably
> old/archaic solutions, like using codepages (particularly 1252 and 437)
> in a few cases due to fixed-width characters being cheaper for random
> access than UTF-8, but less overly expensive vs UTF-16, and sufficient
> in the vast majority of cases (and are the cheapest option in cases when
> they are an option). note that this would be done per-string.

How we moved somehow now from int32_t of Leigh to binary representations
of text? :) But that is interesting topic as well. Have you noticed that
with text the need for random access of contents is actually rather rare
case? OTOH storage for texts can be significant if there are lot of texts
or lot of translations. Number of PC software let to download and install
translations separately or optionally.

Above can't be done with embedded system so easily since it can affect price
of unit to organize flashing the very same product differently for each
market targeted. When access is sequential then polished Huffman decoding
does actually rarely affect performance. So I have seen embedded systems
keeping the text dictionaries Huffman encoded all time. If to keep texts
Huffman encoded anyway then UCS-2 or UTF-16 are perfectly fine and there
are no need for archaic tricks like Windows-1252 or Code Page 437.

Rosario19

unread,

Jun 28, 2015, 9:53:35 AM6/28/15

to

On Sun, 28 Jun 2015 06:43:29 -0700 (PDT), 嘱 Tiib wrote:

>We have to take care of endianness in any portable format.

i'm not agree
because in every endianess in the same unsigned type
and the same size
+/-*
would be the same, despite endianes
number & 8
would be the same
so wuold be ok for | too and not etc
etc

where is the problem with endianess of the number?

Öö Tiib

unread,

Jun 28, 2015, 9:54:50 AM6/28/15

to

On Sunday, 28 June 2015 16:40:45 UTC+3, Richard Damon wrote:
> On 6/27/15 10:12 PM, Mr Flibble wrote:
> >
> > My claims are not unsubstantiated: see the MISRA C++ coding standard for
> > safety critical systems: it bans the use of 'int' BECAUSE it is
> > inherently non-portable and therefore unsafe. Use the typedefs from
> > <cstdint> instead; DO NOT use 'int'.
>

> They also totally banned the use of 'char', including for holding text
> strings. They later realized the sillyness of that and changed it, but
> that shows they are not infallible.

Argument from authority is logically fallacious anyway and MISRA isn't
in list of authorities for me.

My feelings about MISRA C++ coding standard were that they had some
interesting points in it and so it was worth reading but it also
contained several apparent controversies that were not thought thru
properly so it can not be used in practice without fixing it.

Öö Tiib

unread,

Jun 28, 2015, 10:10:49 AM6/28/15

to

Rosario, we did talk about keeping data in some portable binary
format. We have portable binary formats for to achieve that one
computer saves it to disk or sends over internet and other computer
reads or receives it and both understand it in same way.

Different computers may keep the numbers in their own memory with
different endianness. So when computer reads or writes the bytes
of binary format then it must take care that those are ordered
correctly. That is what we call taking care about endianness in
portable format.

BGB

unread,

Jun 28, 2015, 1:22:01 PM6/28/15

to

yeah.

I prefer fixed endianess formats, personally.

granted, probably most everyone else does as well, as most formats are
this way. just a few formats exist where the endianess depends on
whichever computer saved the file, with magic numbers to detect when
swapping is needed. I find these annoying.

some of my tools (script language and C compiler, as an extension) have
the ability to specify the endianess for variables and pointer types
(so, you can be sure the value is stored as either big or little endian,
regardless of native endianess), and implicitly also makes it safe for
misaligned loads/stores.

namely, if you know a value needs to be a particular way, then chances
are the person is willing to pay whatever CPU cycles are needed to make
it that way.

typically, for compile-time stuff, there is a mess of #define's and
#ifdef's for figuring out the target architecture and other things,
picking appropriate type-sizes and setting values for things like
whether or not the target supports misaligned access, ...

I guess it could be nicer if more of this were standardized.

>> sometimes there are tradeoffs, like one format which could have
>> technically been a little more space-efficient with differentially-coded
>> variable-width numbers for a table, but then needing to decode the table
>> prior to being able to use it. I instead used a design where the field
>> was 16/24/32 bits wide (for the whole table) depending on what was
>> needed for the target-index, but the fixed-width entries allow for
>> random access to the table.
>
> It indeed depends on anticipated amounts of data if it is worth that
> or not but if we know that a table contains only for example values 0..11
> then it feels reasonable at least to consider 4 bit wide entries. The
> processors crunch numbers at ungodly speeds but it is 4 times shorter
> table than one with 16 bit wide entries.
>

could be, but the table entries in this case were fairly unlikely to be
that much below 16 bits (so 8 bit or smaller would not have been useful).

these were basically offsets within a TLV lump. where you would have one
TLV lump which contains lots of payload data (as an array of
variable-sized sub-lumps packed end-to-end), and a table to say where
each sub-lump is within that bigger lump (to better allow random access).

in most of the cases, the lumps were between kB or maybe up to a few MB,
so 16 or 24 bits are the most likely cases, and always using 32-bits "to
be safe" would be a waste.

it would be fairly easy to compress them further, but this would require
decoding them before they could be used, which was undesirable in this case.

>> this sort of things has in a few cases pointed towards arguably
>> old/archaic solutions, like using codepages (particularly 1252 and 437)
>> in a few cases due to fixed-width characters being cheaper for random
>> access than UTF-8, but less overly expensive vs UTF-16, and sufficient
>> in the vast majority of cases (and are the cheapest option in cases when
>> they are an option). note that this would be done per-string.
>
> How we moved somehow now from int32_t of Leigh to binary representations
> of text? :) But that is interesting topic as well. Have you noticed that
> with text the need for random access of contents is actually rather rare
> case? OTOH storage for texts can be significant if there are lot of texts
> or lot of translations. Number of PC software let to download and install
> translations separately or optionally.
>

yeah, probably should have been clearer.

this was for string literals/values in a VM.

in the predecessor VM, M-UTF-8 had been used for all the string literals
(except the UTF-16 ones), which mostly worked (since direct
per-character access is fairly rare), but it meant doing something like
"str[idx]" would take 'O(n)' time (and looping over a string
per-character would be O(n^2)...).

in the use-case for the new VM, I wanted O(1) access here (mostly to
make things more predictable, *), but also didn't want the nearly pure
waste that is UTF-16 strings.

however, the language in question uses UTF-16 as its logical model (so,
from high-level code, it appears as if all strings are UTF-16). in the
language, strings are immutable, so there is no issue with the use of
ASCII or similar for the underlying storage.

in C, it isn't really an issue mostly as C makes no attempt to gloss
over in-memory storage, so you can just return the raw byte values or
similar.

*: the VM needs to be able to keep timing latencies bounded, which
basically weighs against doing anything in the VM where the time-cost
can't be easily predicted in advance. wherever possible, all operations
need to be kept O(1), with the operation either being able to complete
in the available time-step (generally 1us per "trace"), or the VM will
need to halt and defer execution until later (blocking is not allowed,
and any operations which may result in unexpected behaviors, such as
halting, throwing an exception, ... effectively need to terminate the
current trace, which makes them more expensive).

for some related reasons, the VM is also using B-Trees rather than hash
tables in a few places (more predictable, if slower, than hashes, but
less memory waste than AVL or BST variants). likewise, because of their
structure, it is possible to predict in advance (based on the size of
the tree) approximately how long it will take to perform the operation.

> Above can't be done with embedded system so easily since it can affect price
> of unit to organize flashing the very same product differently for each
> market targeted. When access is sequential then polished Huffman decoding
> does actually rarely affect performance. So I have seen embedded systems
> keeping the text dictionaries Huffman encoded all time. If to keep texts
> Huffman encoded anyway then UCS-2 or UTF-16 are perfectly fine and there
> are no need for archaic tricks like Windows-1252 or Code Page 437.
>

granted, but in this case, it is mostly for string literals, rather than
bulk text storage.

Windows-1252 covers most general use-cases for text (and is fairly easy
to convert to/from UTF-16, as for most of the range the characters map 1:1).
CP-437 is good mostly for things like ASCII art and text-based UIs.

for literals, it will be the job of the compiler to sort out which
format to use.

bulk storage will tend to remain in compressed UTF-8.
though a more specialized format could be good.

I had good results before compressing short fragments (such as character
strings) with a combination of LZ77 and MTF+Rice Coding, which for small
pieces of data did significantly better than Deflate or LZMA. however,
the MTF makes it slower per-character than a Huffman-based option.

basically, options like Deflate or LZMA are largely ineffective for
payloads much under 200-500 bytes or so, but are much more effective as
payloads get bigger.

JiiPee

unread,

Jun 28, 2015, 3:01:37 PM6/28/15

to

On 26/06/2015 20:39, Mr Flibble wrote:
> ... #include <cstdint> instead!
>
> /Flibble

Just watching Scott Meyers videos. He seems to also always use :
int a =9;

not
fastint32 a = 9;

if int was wrong, surely they would not teach people using int, right? :)

JiiPee

unread,

Jun 28, 2015, 3:16:04 PM6/28/15

to

also note that Scott recommends to use
auto a = 9;

so using auto. so letting the ccomputer to deside the type !!!
What will you say about this??? :)

woodb...@gmail.com

unread,

Jun 28, 2015, 3:56:27 PM6/28/15

to

I didn't mean that either.

Brian
Ebenezer Enterprises - In G-d we trust.
http://webEbenezer.net

Alf P. Steinbach

unread,

Jun 28, 2015, 4:13:43 PM6/28/15

to

On 28-Jun-15 9:15 PM, JiiPee wrote:
>
> also note that Scott recommends to use
> auto a = 9;
>
> so using auto. so letting the ccomputer to deside the type !!!
> What will you say about this??? :)

Using `auto` to declare a variable without an explicit type communicates
well to the compiler but not to a human reader. Also it's longer to
write and read than just `int`. And one can't generally adopt this as a
convention, e.g. it doesn't work for a variable without initializer, so
it's not forced by a convention.

Therefore I consider it an abuse of the language.

As to why, I guess that Scott has to use all kinds of fancy features
just to grab and keep interest from the audience. And maybe so that
novices can ask "what's the `auto`, huh?", so that he can explain it.
Explaining things and giving advice is after all how he makes a living.

Cheers & hth.,

- Alf

--
Using Thunderbird as Usenet client, Eternal September as NNTP server.

JiiPee

unread,

Jun 28, 2015, 4:22:48 PM6/28/15

to

On 28/06/2015 21:13, Alf P. Steinbach wrote:
> On 28-Jun-15 9:15 PM, JiiPee wrote:
>>
>> also note that Scott recommends to use
>> auto a = 9;
>>
>> so using auto. so letting the ccomputer to deside the type !!!
>> What will you say about this??? :)
>
> Using `auto` to declare a variable without an explicit type
> communicates well to the compiler but not to a human reader.

In Visual Studio you can hoover the mouse and see the real type quite
easily, also works on other compilers.

> Also it's longer to write and read than just `int`. And one can't
> generally adopt this as a convention, e.g. it doesn't work for a
> variable without initializer, so it's not forced by a convention.

think about a funktion returning the size of elements in and
container... you could wrongly put:
unsigend long getSize();

if it actually returs a 64 bit integer. auto would find the right type
straight away.

>
> Therefore I consider it an abuse of the language.

in many places it makes coding safer because the auto always finds the
correct type. You can get bugs by putting a wrong type... and people
have done that when reading forums.

>
> As to why, I guess that Scott has to use all kinds of fancy features
> just to grab and keep interest from the audience. And maybe so that
> novices can ask "what's the `auto`, huh?", so that he can explain it.
> Explaining things and giving advice is after all how he makes a living.
>

with comples types it can increase safety, because auto always gets it
right. We might get the type wrong which migh cause bugs.

But auto definitely is not always best for sure even if somebody likes it.

JiiPee

unread,

Jun 28, 2015, 4:24:03 PM6/28/15

to

On 28/06/2015 21:13, Alf P. Steinbach wrote:

> On 28-Jun-15 9:15 PM, JiiPee wrote:
>>
>> also note that Scott recommends to use
>> auto a = 9;
>>
>> so using auto. so letting the ccomputer to deside the type !!!
>> What will you say about this??? :)
>
> Using `auto` to declare a variable without an explicit type
> communicates well to the compiler but not to a human reader. Also it's
> longer to write and read than just `int`.

but if you take an everage value of all types the auto would win big
time. on average auto makes typenames much shorter if all types are
considered.

Jens Thoms Toerring

unread,

Jun 28, 2015, 6:33:16 PM6/28/15

to

JiiPee <n...@notvalid.com> wrote:
> On 28/06/2015 21:13, Alf P. Steinbach wrote:
> > On 28-Jun-15 9:15 PM, JiiPee wrote:
> >>
> >> also note that Scott recommends to use
> >> auto a = 9;
> >>
> >> so using auto. so letting the ccomputer to deside the type !!!
> >> What will you say about this??? :)
> >
> > Using `auto` to declare a variable without an explicit type
> > communicates well to the compiler but not to a human reader.

> In Visual Studio you can hoover the mouse and see the real type quite
> easily, also works on other compilers.

Please keep in mind that VS is an IDE with an attached compiler
(beside a lot of other things). So this won't work "on other
compilers", since a compiler is a program to comile code and
not something you can "hoover over" with the mouse. You may
be surprised, but not everyone is using an IDE (for various
reasons) - or even a graphical user interface - all of the
time (and thus a mouse or something similar)...

> > Also it's longer to write and read than just `int`. And one can't
> > generally adopt this as a convention, e.g. it doesn't work for a
> > variable without initializer, so it's not forced by a convention.

> think about a funktion returning the size of elements in and
> container... you could wrongly put:
> unsigend long getSize();

> if it actually returs a 64 bit integer. auto would find the right type
> straight away.

If you define a variable and assign to it the return value
of a function then it's relatively clear what the type will
be - it can be easily found out by looking at the function
declaration. But something like

auto a = 0;

is a bit different: you have to very carefully look at that
'0' to figure out if this will end up being an 'int' or per-
haps something else? And it can be prone to getting the wrong
type by vorgetting (or mis-typing) some character after the
'0' that makes the variable have a different type. There's
definitely a readability issue.

> in many places it makes coding safer because the auto always finds the
> correct type. You can get bugs by putting a wrong type... and people
> have done that when reading forums.

Yes, but cases like

int a = 0f;

are places where this isn't the case. 'auto' is very usefulf
in cases like

for ( auto it = xyz.begin(); it != xyz.end(); ++i )

instead of maybe

for ( std::pair< std::vector< std::pair< int, char const * >, double >, std::vector< std::string > >::iterator it = xyz.begin( ); it != xyz.end( ); ++i )

since you will be aware of the type of 'xyz', but

auto a = 0ull;

is different since it makes the type of 'a' hard to recognize
at a glance. And you may not forget anything of the 'ull' bit
at the end or you'll get something you never wanted and thus
don't expect. It actually creates a new class of possible bugs.

Öö Tiib

unread,

Jun 28, 2015, 7:50:43 PM6/28/15

to

After we take care of size and we take care of endianness there is
still alignment to care about. ;)
Note that alignment requirements may come as surprise to many.
Sometimes even on platforms with relaxed ones. For example one will
get crash on Win32 because of alignment of XMVECTOR was not 16.

In real application the code that encodes or decodes information for
some binary is usually therefore separate from rest of the code to
abstract such difficulties away.

> typically, for compile-time stuff, there is a mess of #define's and
> #ifdef's for figuring out the target architecture and other things,
> picking appropriate type-sizes and setting values for things like
> whether or not the target supports misaligned access, ...
>
> I guess it could be nicer if more of this were standardized.

Yes, preprocessor metaprogramming is not much better than template
metaprogramming if to read it. So these things are better to keep
away from code, in a separate library.

> >> sometimes there are tradeoffs, like one format which could have
> >> technically been a little more space-efficient with differentially-coded
> >> variable-width numbers for a table, but then needing to decode the table
> >> prior to being able to use it. I instead used a design where the field
> >> was 16/24/32 bits wide (for the whole table) depending on what was
> >> needed for the target-index, but the fixed-width entries allow for
> >> random access to the table.
> >
> > It indeed depends on anticipated amounts of data if it is worth that
> > or not but if we know that a table contains only for example values 0..11
> > then it feels reasonable at least to consider 4 bit wide entries. The
> > processors crunch numbers at ungodly speeds but it is 4 times shorter
> > table than one with 16 bit wide entries.
> >
>
> could be, but the table entries in this case were fairly unlikely to be
> that much below 16 bits (so 8 bit or smaller would not have been useful).
>
> these were basically offsets within a TLV lump. where you would have one
> TLV lump which contains lots of payload data (as an array of
> variable-sized sub-lumps packed end-to-end), and a table to say where
> each sub-lump is within that bigger lump (to better allow random access).
>
> in most of the cases, the lumps were between kB or maybe up to a few MB,
> so 16 or 24 bits are the most likely cases, and always using 32-bits "to
> be safe" would be a waste.
>
> it would be fairly easy to compress them further, but this would require
> decoding them before they could be used, which was undesirable in this case.

Interesting. Doesn't TLV mean tag or type, length, value? Most of those
formats encode some sort of tag/type/length in few first bits/bytes and
rest are data bytes, potentially also TLV. So why you used offsets?
Weren't the "sub-lumps" TLV themselves?

One interesting tree is splay that often works best in real application.

> > Above can't be done with embedded system so easily since it can affect price
> > of unit to organize flashing the very same product differently for each
> > market targeted. When access is sequential then polished Huffman decoding
> > does actually rarely affect performance. So I have seen embedded systems
> > keeping the text dictionaries Huffman encoded all time. If to keep texts
> > Huffman encoded anyway then UCS-2 or UTF-16 are perfectly fine and there
> > are no need for archaic tricks like Windows-1252 or Code Page 437.
>
> granted, but in this case, it is mostly for string literals, rather than
> bulk text storage.

The stings of application do not typically form some sort of bulk but a
sort of dictionary of short texts.

> Windows-1252 covers most general use-cases for text (and is fairly easy
> to convert to/from UTF-16, as for most of the range the characters map 1:1).
> CP-437 is good mostly for things like ASCII art and text-based UIs.
>
> for literals, it will be the job of the compiler to sort out which
> format to use.
>
> bulk storage will tend to remain in compressed UTF-8.
> though a more specialized format could be good.
>
> I had good results before compressing short fragments (such as character
> strings) with a combination of LZ77 and MTF+Rice Coding, which for small
> pieces of data did significantly better than Deflate or LZMA. however,
> the MTF makes it slower per-character than a Huffman-based option.
>
> basically, options like Deflate or LZMA are largely ineffective for
> payloads much under 200-500 bytes or so, but are much more effective as
> payloads get bigger.

Those packing algorithms are all for larger texts. With relatively short
one-liners one can perhaps make some special sub-string-packing but it
will be computationally expensive to pack.

BGB

unread,

Jun 29, 2015, 3:26:01 AM6/29/15

to

yeah.

this is why the explicit endian variables could also be implicitly
misaligned. it could also be possible to specify explicit alignment via
an attribute, but this isn't currently supported.

the ways it is implemented generally gloss over whatever the hardware
does (so, it can still do unaligned operations even on targets which
only support aligned loads/stores, by internally using byte operations
and shifts).

currently, my C compiler/VM doesn't do SIMD, but this might get
implemented eventually.

keywords were '__bigendian' and '__ltlendian'.

ex:
__bigendian int x;
or:
i=*(__bigendian int *)ptr;

possible explicit alignment value:
__declspec(align(2)) __bigendian int x;
or (if another keyword were added):
__align(2) __bigendian int x;

>> typically, for compile-time stuff, there is a mess of #define's and
>> #ifdef's for figuring out the target architecture and other things,
>> picking appropriate type-sizes and setting values for things like
>> whether or not the target supports misaligned access, ...
>>
>> I guess it could be nicer if more of this were standardized.
>
> Yes, preprocessor metaprogramming is not much better than template
> metaprogramming if to read it. So these things are better to keep
> away from code, in a separate library.
>

usually, this is a 'whatever_conf.h' file which is copied/pasted around,
sometimes with tweaks.

it also does explicit-size types, partly because I often use a compiler
which until very recently did not support 'stdint' stuff...

offsets are used because this saves needing to scan over the payload
lumps and rebuild the table. granted, the cost of this probably wouldn't
be huge, but the table will be needed either way, and if it is already
present in the data, it doesn't need to be built.

note that the file is not read-in/loaded sequentially, but is basically
loaded into the address space and used in-place.

it is Tag/Length/Value, yes.

the format is vaguely akin to if IFF and ASN.1 BER were fused together.

ASN.1 style tags are used mostly for compact structures (with typically
about 2 bytes of overhead), with TWOCC for intermediate structures (4 or
6 bytes overhead), and FOURCC for top-level structures (8 or 12 bytes of
overhead).

* 0x00-0x1F: Public Primitive (Class=0)
* 0x20-0x3F: Public Composite (Class=1)
* 0x40-0x5F: Private Primitive (Class=2)
* 0x60-0x7F: Private Composite (Class=3)
* 0x80-0x9F: Context Primitive (Class=4)
* 0xA0-0xBF: Context Composite (Class=5)
** ccct-tttt:
*** ccc=class, ttttt=tag
*** tag=0..30, tag encoded directly
*** tag=31, tag is escape coded.
* 0xC0-0xDF: Reserved
* 0xE0-0xFF: Special Markers
** 0xE0, End Of Data
** 0xE1, len:WORD24
** 0xE2, len:BYTE
*** Context Dependent Untagged Data
** 0xE3, len:WORD24, tag:TWOCC
** 0xE4, len:WORD24, tag:FOURCC
** 0xE5, len:BYTE, tag:TWOCC
** 0xE6, len:Word56, tag:FOURCC
** 0xE7, len:WORD24, tag:EIGHTCC
** 0xE8, len:WORD24, tag:SIXTEENCC
** 0xE9, len:Word56, tag:EIGHTCC
** 0xEA, len:WORD56, tag:SIXTEENCC
*** Tagged Markers

I have used variations on this design for a number of formats (it
actually started out mostly in some of my video codecs).

>>
>> *: the VM needs to be able to keep timing latencies bounded, which
>> basically weighs against doing anything in the VM where the time-cost
>> can't be easily predicted in advance. wherever possible, all operations
>> need to be kept O(1), with the operation either being able to complete
>> in the available time-step (generally 1us per "trace"), or the VM will
>> need to halt and defer execution until later (blocking is not allowed,
>> and any operations which may result in unexpected behaviors, such as
>> halting, throwing an exception, ... effectively need to terminate the
>> current trace, which makes them more expensive).
>>
>> for some related reasons, the VM is also using B-Trees rather than hash
>> tables in a few places (more predictable, if slower, than hashes, but
>> less memory waste than AVL or BST variants). likewise, because of their
>> structure, it is possible to predict in advance (based on the size of
>> the tree) approximately how long it will take to perform the operation.
>
> One interesting tree is splay that often works best in real application.
>

yeah.

splay is fast, but not necessarily all that predictable nor memory
efficient (it is more like AVL in the memory-efficiency sense, and
leaves open the possibility of an O(n) worst case).

B-Tree is not necessarily the fastest option, but should be fairly
predictable in this case.

like, you really don't want a random edge-case throwing a wrench in the
timing, and fouling up external electronics by not updating the IO pins
at the correct times or something.

nevermind if it is questionable to do high-level logic and also deal
with hardware-level IO on the same processor core, but alas... (why have
a separate ARM chip and an MCU, when you can save some money by just
doing everything on the main ARM chip?...).

>>> Above can't be done with embedded system so easily since it can affect price
>>> of unit to organize flashing the very same product differently for each
>>> market targeted. When access is sequential then polished Huffman decoding
>>> does actually rarely affect performance. So I have seen embedded systems
>>> keeping the text dictionaries Huffman encoded all time. If to keep texts
>>> Huffman encoded anyway then UCS-2 or UTF-16 are perfectly fine and there
>>> are no need for archaic tricks like Windows-1252 or Code Page 437.
>>
>> granted, but in this case, it is mostly for string literals, rather than
>> bulk text storage.
>
> The stings of application do not typically form some sort of bulk but a
> sort of dictionary of short texts.
>

not sure I follow.

in the case of the VM, they are basically a pair of string tables, one
for ASCII and UTF-8, and the other for UTF-16. each string is terminated
with a NUL character.

for the ASCII table, string references are in terms of byte offsets,
with it depending on context how the string is encoded.

if C is compiled to the VM, it really doesn't care, since it has good
old 'char *', so will use pointers into the string table.

the script-language does care, but will implicitly declare the type as
part of the process of loading the string into a register (and, in this
case, the VM will remember the type of string).

though, granted, in this VM, the string type will be handled by
implicitly using tag bits in the reference. this allows using a
reference directly to the string-table memory, without needing an
intermediate structure (so, it is sort of like a normal 'char *'
pointer, just with a hidden type-tag in the reference).

>> Windows-1252 covers most general use-cases for text (and is fairly easy
>> to convert to/from UTF-16, as for most of the range the characters map 1:1).
>> CP-437 is good mostly for things like ASCII art and text-based UIs.
>>
>> for literals, it will be the job of the compiler to sort out which
>> format to use.
>>
>> bulk storage will tend to remain in compressed UTF-8.
>> though a more specialized format could be good.
>>
>> I had good results before compressing short fragments (such as character
>> strings) with a combination of LZ77 and MTF+Rice Coding, which for small
>> pieces of data did significantly better than Deflate or LZMA. however,
>> the MTF makes it slower per-character than a Huffman-based option.
>>
>> basically, options like Deflate or LZMA are largely ineffective for
>> payloads much under 200-500 bytes or so, but are much more effective as
>> payloads get bigger.
>
> Those packing algorithms are all for larger texts. With relatively short
> one-liners one can perhaps make some special sub-string-packing but it
> will be computationally expensive to pack.
>

well, as noted, I had some limited success with MTF+Rice.
though, there isn't much that can be done with short character strings.

Öö Tiib

unread,

Jun 29, 2015, 10:20:29 AM6/29/15

to

On Monday, 29 June 2015 10:26:01 UTC+3, BGB wrote:
> On 6/28/2015 6:50 PM, Öö Tiib wrote:
> > On Sunday, 28 June 2015 20:22:01 UTC+3, BGB wrote:
> >> On 6/28/2015 8:43 AM, Öö Tiib wrote:
> >>> On Sunday, 28 June 2015 05:25:39 UTC+3, BGB wrote:
> >>>> On 6/27/2015 4:17 PM, Öö Tiib wrote:
> >>>>> On Saturday, 27 June 2015 17:46:19 UTC+3, Mr Flibble wrote:
> >>>>>> On 27/06/2015 07:27, Öö Tiib wrote:
> >>>>>>> On Saturday, 27 June 2015 00:37:40 UTC+3, Mr Flibble wrote:
> >>>>>>>> On 26/06/2015 21:31, JiiPee wrote:
> >>>>>>>>> On 26/06/2015 20:39, Mr Flibble wrote:

... Snip for focus

>
> >>
> >> for some related reasons, the VM is also using B-Trees rather than hash
> >> tables in a few places (more predictable, if slower, than hashes, but
> >> less memory waste than AVL or BST variants). likewise, because of their
> >> structure, it is possible to predict in advance (based on the size of
> >> the tree) approximately how long it will take to perform the operation.
> >
> > One interesting tree is splay that often works best in real application.
> >
>
> yeah.
>
> splay is fast, but not necessarily all that predictable nor memory
> efficient (it is more like AVL in the memory-efficiency sense, and
> leaves open the possibility of an O(n) worst case).
>
> B-Tree is not necessarily the fastest option, but should be fairly
> predictable in this case.
>
> like, you really don't want a random edge-case throwing a wrench in the
> timing, and fouling up external electronics by not updating the IO pins
> at the correct times or something.

On that case RB tree is maybe most stable ... on any case it seems more
stable and generally quicker than AVL.

> nevermind if it is questionable to do high-level logic and also deal
> with hardware-level IO on the same processor core, but alas... (why have
> a separate ARM chip and an MCU, when you can save some money by just
> doing everything on the main ARM chip?...).

You certainly may need separate processor for more demanding I/O (WiFi
network for example) otherwise yes.

I meant that if we really wan't to have byte pointers into our string
literals then there is indeed no way to keep things compressed
behind scenes. However If we want to have a "text" available in
limited occasions, for example like GNU gettext library provides
then we can have the texts compressed most of the time.

> >> Windows-1252 covers most general use-cases for text (and is fairly easy
> >> to convert to/from UTF-16, as for most of the range the characters map 1:1).
> >> CP-437 is good mostly for things like ASCII art and text-based UIs.
> >>
> >> for literals, it will be the job of the compiler to sort out which
> >> format to use.
> >>
> >> bulk storage will tend to remain in compressed UTF-8.
> >> though a more specialized format could be good.
> >>
> >> I had good results before compressing short fragments (such as character
> >> strings) with a combination of LZ77 and MTF+Rice Coding, which for small
> >> pieces of data did significantly better than Deflate or LZMA. however,
> >> the MTF makes it slower per-character than a Huffman-based option.
> >>
> >> basically, options like Deflate or LZMA are largely ineffective for
> >> payloads much under 200-500 bytes or so, but are much more effective as
> >> payloads get bigger.
> >
> > Those packing algorithms are all for larger texts. With relatively short
> > one-liners one can perhaps make some special sub-string-packing but it
> > will be computationally expensive to pack.
>
> well, as noted, I had some limited success with MTF+Rice.
> though, there isn't much that can be done with short character strings.

If all texts in application are UTF-16 then Huffman coding compresses about
2.6 times (minus the decoder and its data of course). So the only question is
if there is big enough amount of texts to be handled by application to bother
with it.

BGB

unread,

Jun 29, 2015, 1:23:02 PM6/29/15

to

could be.

more analysis and testing may be needed.

checking for memory density.
( note: Implicitly assuming a slab allocator or similar, so negligible
overhead apart from alignment ).

say, tree node for a binary tree (BST1):
struct BiNode1 {
BiNode1 *left, *right;
SlotInfo *key;
byte depth;
//pad:3 bytes
Value val;
};
memory cost: ~24 bytes, about 33% payload.
assume: node is either a node or a leaf.
for a node, both left and right are filled and key is a pivot, nodes do
not hold a value.
for a leaf, both left and right are NULL.

or (BST2):
struct BiNode2 {
SlotInfo *key;
byte depth;
//pad: 7 bytes
union {
struct {
BiNode2 *left, *right;
}b;
Value val;
}u;
};
memory cost: ~16 bytes, about 50% payload.
same behavior as BST1, but reducing memory by eliminating
mutually-exclusive data.

BST3:
case of BST1, where each node always contains a value, reducing node counts.
each node has a value, and 1 or 2 child nodes (whereas a leaf lacks
child nodes).

for an order-12 B-Tree (BT1):
#define MSX_BT_KEY 12
struct BtNode {
BtNode *next;
SlotInfo *key[MAX_BT_KEY];
byte nkey;
byte depth;
//pad: 2 bytes
union {
BtNode *child[MAX_BT_KEY];
Value val[MAX_BT_KEY];
}u;
};

Cost: 152 bytes, 63% payload.

for an order-6 B-Tree (BT2):
Cost: 96 bytes, 50% payload.

memory estimates (count, type, total bytes, memory efficiency):
1 item(1L): BST1: 24 bytes (33.3%)
1 item(1L): BST2: 16 bytes (50.0%)
1 item(1L): BST3: 24 bytes (33.3%)
1 item(1L): BT1: 152 bytes ( 5.3%)
1 item(1N): BT2: 96 bytes ( 8.3%)

2 items(1N,2L): BST1: 72 bytes (22.2%)
2 items(1N,2L): BST2: 48 bytes (33.3%)
2 items(1N,1L): BST3: 48 bytes (33.3%)
2 items( 1L): BT1: 152 bytes (10.1%)
2 items( 1L): BT2: 96 bytes (16.6%)

4 items(3N,4L): BST1: 168 bytes (19.0%)
4 items(3N,4L): BST2: 112 bytes (28.6%)
4 items(2N,2L): BST3: 96 bytes (33.3%)
4 items( 1L): BT1: 152 bytes (21.1%)
4 items( 1L): BT2: 96 bytes (33.3%)

8 items(7N,8L): BST1: 360 bytes (17.8%)
8 items(7N,8L): BST2: 240 bytes (26.6%)
8 items(4N,4L): BST3: 192 bytes (33.3%)
8 items( 1L): BT1: 152 bytes (42.1%)
8 items(1N,2L): BT2: 288 bytes (22.2%)

16 items(15N,16L): BST1: 744 bytes (17.2%)
16 items(15N,16L): BST2: 496 bytes (25.8%)
16 items( 8N, 8L): BST3: 384 bytes (33.3%)
16 items( 1N, 2L): BT1: 456 bytes (28.1%)
16 items( 1N, 3L): BT2: 384 bytes (33.3%)

32 items(31N,32L): BST1: 1512 bytes (16.9%)
32 items(31N,32L): BST2: 1008 bytes (25.4%)
32 items(16N,16L): BST3: 768 bytes (33.3%)
32 items( 1N, 3L): BT1: 608 bytes (42.1%)
32 items( 1N, 5L): BT2: 576 bytes (44.4%)

so, it is a lot closer than I was thinking...

BST has an advantage on the memory front for small item counts, but
loses them as item counts increases.

though, it is possible that a B-Tree could suffer from an unfavorable
pattern of inserts which could reduce its memory efficiency (causing it
to under-perform vs a BST variant).

could rig-up some benchmarks and compare them...

>> nevermind if it is questionable to do high-level logic and also deal
>> with hardware-level IO on the same processor core, but alas... (why have
>> a separate ARM chip and an MCU, when you can save some money by just
>> doing everything on the main ARM chip?...).
>
> You certainly may need separate processor for more demanding I/O (WiFi
> network for example) otherwise yes.
>

a lot of things in my case are things like running motors and dealing
with external sensors and similar (some of the sensors generate PWM
signals).

a lot of this stuff runs in the kHz range (say, 25-50 kHz or so), so it
is necessarily to be able to respond quickly enough to keep everything
running smoothly.

internally, there might also be fun like interpreting G-Code or similar
(telling the machine the desired positions to move the motors to, what
RPM the tool spindle should turn, ...).

the same basic code has also been used in a small wheeled robot, and
should be applicable to other types of robots (such as ones with
articulated limbs).

networking is used to some extent to control these, but the network
hardware seems able to handle a lot on its own (doesn't need anywhere
near the levels of micro-management as motors or sensors)

ok.

>>>> Windows-1252 covers most general use-cases for text (and is fairly easy
>>>> to convert to/from UTF-16, as for most of the range the characters map 1:1).
>>>> CP-437 is good mostly for things like ASCII art and text-based UIs.
>>>>
>>>> for literals, it will be the job of the compiler to sort out which
>>>> format to use.
>>>>
>>>> bulk storage will tend to remain in compressed UTF-8.
>>>> though a more specialized format could be good.
>>>>
>>>> I had good results before compressing short fragments (such as character
>>>> strings) with a combination of LZ77 and MTF+Rice Coding, which for small
>>>> pieces of data did significantly better than Deflate or LZMA. however,
>>>> the MTF makes it slower per-character than a Huffman-based option.
>>>>
>>>> basically, options like Deflate or LZMA are largely ineffective for
>>>> payloads much under 200-500 bytes or so, but are much more effective as
>>>> payloads get bigger.
>>>
>>> Those packing algorithms are all for larger texts. With relatively short
>>> one-liners one can perhaps make some special sub-string-packing but it
>>> will be computationally expensive to pack.
>>
>> well, as noted, I had some limited success with MTF+Rice.
>> though, there isn't much that can be done with short character strings.
>
> If all texts in application are UTF-16 then Huffman coding compresses about
> 2.6 times (minus the decoder and its data of course). So the only question is
> if there is big enough amount of texts to be handled by application to bother
> with it.
>

dunno, it depends on the application.

in the VM, a lot of the string data tends to be things like imported
function names, type-signatures, ... some amount of this also tends to
be strings for any printed messages.

bytecode is also a big chunk of memory.

generally the VM's bytecode images are around 1/2 to 1/3 the size of
native x86 versions of the program (and also smaller than ARM code and
Deflate-compressed source code).

though, the bytecode was designed some with code density in mind. it
doesn't so much pull this off with the smallest possible opcodes, but
rather by trying to allow things to be done with a reasonably small
number of operations.

nevermind if the current VM goes and craps all over this by translating
the bytecode into relatively bulky threaded-code, but alas...