Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

"C++: Size Matters in Platform Compatibility"

95 views
Skip to first unread message

Lynn McGuire

unread,
Aug 16, 2019, 3:04:30 PM8/16/19
to
"C++: Size Matters in Platform Compatibility"

https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility

Interesting. Especially on storing Unicode as UTF-8.

Lynn

Jorgen Grahn

unread,
Aug 16, 2019, 4:45:58 PM8/16/19
to
["Followup-To:" header set to comp.lang.c++.]
Interesting that people still try to store/transfer data as manually
handled binary. He also didn't seem to care about endianness and
similar issues.

I mostly use text, but if I really wanted binary I suppose I'd look
into Google protobuf.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

David Brown

unread,
Aug 17, 2019, 9:09:04 AM8/17/19
to
On 16/08/2019 21:04, Lynn McGuire wrote:
> "C++: Size Matters in Platform Compatibility"
>
> <https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility>
>

Please put < > marks around links.

>
> Interesting.  Especially on storing Unicode as UTF-8.
>

If you are storing data in binary formats that need to be portable, you
need to define and document the format, making it independent of the
platform. The author here glosses over the important bits - endianness
and alignment (and consequently padding), which he doesn't mention at all.

Time should be not be stored in text format - that is very space
inefficient in a binary file, and even if you stick to ISO formats there
are huge numbers of variants possible. Often the most sensible formats
is 64 bit Unix epoch timestamps (happily covering the big bang up to
well past the death of our sun). IEEE floating point double Unix epoch
seconds will last longer than protons, and give you accurate fractions
of a second for more normal timeframes. If these are not suitable,
define a format for your particular requirements.

For character data where plain ASCII won't do, utf-8 is the only sane
choice IMHO. Other formats - utf-16, utf-32, wchar_t, etc., might be
used internally for dealing with historical API's.

The author is right to use fixed size types for integers. However, it
doesn't make sense to have pointer types at all in a stored file or data
interchange format.


The other alternative is to switch entirely to a text format such as
JSON. It is a lot less efficient to parse in C or C++ (though once we
get introspection in C++, JSON generation and parsing will be a good
deal nicer). But it is easy to pass around between OS's, languages and
applications, easy to handle in most higher level languages, and easy
for debugging.





Geoff

unread,
Aug 17, 2019, 3:43:31 PM8/17/19
to
On Sat, 17 Aug 2019 15:08:52 +0200, David Brown
<david...@hesbynett.no> wrote:

[snip]

>
>If you are storing data in binary formats that need to be portable, you
>need to define and document the format, making it independent of the
>platform. The author here glosses over the important bits - endianness
>and alignment (and consequently padding), which he doesn't mention at all.
>

He does mention endianess and he declined to discuss it, saying it
needs a separate discussion. You are the second poster here to
overlook that.

[snip]

> Often the most sensible formats
>is 64 bit Unix epoch timestamps (happily covering the big bang up to
>well past the death of our sun). IEEE floating point double Unix epoch
>seconds will last longer than protons, and give you accurate fractions
>of a second for more normal timeframes. If these are not suitable,
>define a format for your particular requirements.

The UNIX epoch-based 64-bit time is good only if you believe the sun
will only last another 2.147 billion years. The epoch time rolls over
in the year 2,147,485,547 on MacOS. The sun is calculated to have
another 5.4 billion years of hydrogen fuel left.

Proton decay, so far, is only theoretical. It has never been observed.

[snip]

David Brown

unread,
Aug 18, 2019, 6:32:32 AM8/18/19
to
On 17/08/2019 21:43, Geoff wrote:
> On Sat, 17 Aug 2019 15:08:52 +0200, David Brown
> <david...@hesbynett.no> wrote:
>
> [snip]
>
>>
>> If you are storing data in binary formats that need to be portable, you
>> need to define and document the format, making it independent of the
>> platform. The author here glosses over the important bits - endianness
>> and alignment (and consequently padding), which he doesn't mention at all.
>>
>
> He does mention endianess and he declined to discuss it, saying it
> needs a separate discussion. You are the second poster here to
> overlook that.

I said he glossed over it, which he did. I did not say he ignored it
completely - he said it was an issue but did not discuss it further.

>
> [snip]
>
>> Often the most sensible formats
>> is 64 bit Unix epoch timestamps (happily covering the big bang up to
>> well past the death of our sun). IEEE floating point double Unix epoch
>> seconds will last longer than protons, and give you accurate fractions
>> of a second for more normal timeframes. If these are not suitable,
>> define a format for your particular requirements.
>
> The UNIX epoch-based 64-bit time is good only if you believe the sun
> will only last another 2.147 billion years.

292 billion years, by my counting. By that time, any issues with
rollover will be an SEP¹.

> The epoch time rolls over
> in the year 2,147,485,547 on MacOS. The sun is calculated to have
> another 5.4 billion years of hydrogen fuel left.

That sounds like a signed 32-bit integer storing the number of years
(not an unreasonable choice for a split time/date format).

>
> Proton decay, so far, is only theoretical. It has never been observed.

True. And perhaps humankind will have moved to a different, newer
universe by that time. We /could/ standardise on 128-bit second
timestamps, but I think it is important to leave a few challenges to
keep future generations motivated :-)

>
> [snip]
>

¹ <https://en.wikipedia.org/wiki/Somebody_else%27s_problem>

Anton Shepelev

unread,
Aug 18, 2019, 8:21:04 AM8/18/19
to
Lynn McGuire:

> "C++: Size Matters in Platform Compatibility"
> https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility
> Interesting. Especially on storing Unicode as UTF-8.

My reaction to this article is in fir's manner: highly
skeptical. The author says obvious things that are much
better discussed in network data transfer and socket
programming in particular. He adds nothing new and original
except wrong conclusions, e.g.:

time_t is not guaranteed to be interoperable between
platforms, so it is best to store time as text and
convert to time_t accordingly.

whereas the natural solution is any fixed-width binary
datetime format. Loth to do calendar arithmetics, I should
pass the fields separately, e.g.:

Date: 24 Time: 24
+-----------------+--------------------+
|BC? Year Mon Day | Hours Min Sec 1/128|
| 1 14 4 5 | 5 6 6 7 |
+-----------------+--------------------+

The date and time are separable three-byte blocks. Each
block is a packed structure with the fields interpreted as
big-endian, according to the network byte order.

--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Anton Shepelev

unread,
Aug 18, 2019, 9:00:07 AM8/18/19
to
David Brown:

> However, it doesn't make sense to have pointer types at
> all in a stored file or data interchange format.

He is talking about the use of pointer types as integer
indices, as (he says) WinAPI does with DWORD_PTR.

Ben Bacarisse

unread,
Aug 18, 2019, 9:50:31 AM8/18/19
to
Anton Shepelev <anto...@gmail.com> writes:

> ... Loth to do calendar arithmetics, I should
> pass the fields separately, e.g.:
>
> Date: 24 Time: 24
> +-----------------+--------------------+
> |BC? Year Mon Day | Hours Min Sec 1/128|
> | 1 14 4 5 | 5 6 6 7 |
> +-----------------+--------------------+
>
> The date and time are separable three-byte blocks. Each
> block is a packed structure with the fields interpreted as
> big-endian, according to the network byte order.

For the price of two bytes more you can make the ordering explicit and
and avoid any shifting and masking. Make everything an octet:

century, year in century, month, day, hour, minute, second, 1/128ths

(Dates BC require the first byte to be signed: 6BC is year 94 in century
-1.)

You could, of course, use units of 512 years instead at the expense of
having a slightly less human-debuggable format. You gain just having to
do a shift rather than a multiply by 100.

--
Ben.

David Brown

unread,
Aug 18, 2019, 10:32:42 AM8/18/19
to
On 18/08/2019 14:59, Anton Shepelev wrote:
> David Brown:
>
>> However, it doesn't make sense to have pointer types at
>> all in a stored file or data interchange format.
>
> He is talking about the use of pointer types as integer
> indices, as (he says) WinAPI does with DWORD_PTR.
>

It makes sense to use integer indices in a file format - use a suitable
integer size for the job (using fixed size types, like uint32_t or
int64_t). It doesn't make sense to use pointers, because they depend on
the memory address your data happens to lie at, which will not be the
same when the file or data is read again. So using some sort of
"integer pointer type" (like a pre-C99 uintptr_t alternative) makes
little sense.

Geoff

unread,
Aug 18, 2019, 4:10:35 PM8/18/19
to
On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
<david...@hesbynett.no> wrote:

>> The UNIX epoch-based 64-bit time is good only if you believe the sun
>> will only last another 2.147 billion years.
>
>292 billion years, by my counting. By that time, any issues with
>rollover will be an SEPน.
>
>>

Where are you getting that? time_t is the number of seconds since
January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
billion years.

Bart

unread,
Aug 18, 2019, 4:15:18 PM8/18/19
to
On 18/08/2019 21:10, Geoff wrote:
> On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
> <david...@hesbynett.no> wrote:
>
>>> The UNIX epoch-based 64-bit time is good only if you believe the sun
>>> will only last another 2.147 billion years.
>>
>> 292 billion years, by my counting. By that time, any issues with
>> rollover will be an SEP¹.
>>
>>>
>
> Where are you getting that? time_t is the number of seconds since
> January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
> billion years.
>

No, it's 584 billion years. Presumably 292 billion was due to using it
as a signed value (2**63 rather than 2**64 seconds).

Keith Thompson

unread,
Aug 18, 2019, 5:49:54 PM8/18/19
to
Geoff <ge...@invalid.invalid> writes:
> On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
> <david...@hesbynett.no> wrote:
>
>>> The UNIX epoch-based 64-bit time is good only if you believe the sun
>>> will only last another 2.147 billion years.
>>
>>292 billion years, by my counting. By that time, any issues with
>>rollover will be an SEP¹.
>
> Where are you getting that? time_t is the number of seconds since
> January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
> billion years.

Yes, it is. A signed 64-bit time_t has a maximum value of 2**63-1.
2**63-1 seconds is just over 292 billion years.

The GNU coreutils "date" command gives an out of range error
for dates past Dec 31 in the year 2147485547, but that's not a
time_t issue. Similarly, the type struct tm has tm_year as an int,
which imposes the same limit on systems where int is 32 bits.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson

unread,
Aug 18, 2019, 5:51:44 PM8/18/19
to
Keith Thompson <ks...@mib.org> writes:
> Geoff <ge...@invalid.invalid> writes:
>> On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
>> <david...@hesbynett.no> wrote:
>>
>>>> The UNIX epoch-based 64-bit time is good only if you believe the sun
>>>> will only last another 2.147 billion years.
>>>
>>>292 billion years, by my counting. By that time, any issues with
>>>rollover will be an SEP¹.
>>
>> Where are you getting that? time_t is the number of seconds since
>> January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
>> billion years.
>
> Yes, it is. A signed 64-bit time_t has a maximum value of 2**63-1.
> 2**63-1 seconds is just over 292 billion years.

And I meant to mention that the C and C++ standards do not specify
that time_t represents seconds since 1970, though that is the most
common implementation.

David Brown

unread,
Aug 19, 2019, 2:43:25 AM8/19/19
to
On 18/08/2019 22:10, Geoff wrote:
> On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
> <david...@hesbynett.no> wrote:
>
>>> The UNIX epoch-based 64-bit time is good only if you believe the sun
>>> will only last another 2.147 billion years.
>>
>> 292 billion years, by my counting. By that time, any issues with
>> rollover will be an SEP¹.
>>
>>>
>
> Where are you getting that? time_t is the number of seconds since
> January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
> billion years.
>

time_t doesn't /have/ to be seconds since 1970, but it is a good choice
for the type.

2 ^ 63 = 9.223372037×10¹⁸ seconds
ans / 3600 = 2.562047788×10¹⁵ hours
ans / 24 = 1.067519912×10¹⁴ days
ans / 365.24 = 292279025208.905503028 years
ans / 10⁹ = 292.279025208906 billion years



Alf P. Steinbach

unread,
Aug 19, 2019, 3:15:24 AM8/19/19
to
On 19.08.2019 08:43, David Brown wrote:
> On 18/08/2019 22:10, Geoff wrote:
>> On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
>> <david...@hesbynett.no> wrote:
>>
>>>> The UNIX epoch-based 64-bit time is good only if you believe the sun
>>>> will only last another 2.147 billion years.
>>>
>>> 292 billion years, by my counting. By that time, any issues with
>>> rollover will be an SEP¹.
>>>
>>>>
>>
>> Where are you getting that? time_t is the number of seconds since
>> January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
>> billion years.
>>
>
> time_t doesn't /have/ to be seconds since 1970,

Doesn't /yet/.

As I recall C++20 standardizes on the conventional Unix epoch for
`<chrono>`, and it would then be weird if `time_t` had a different
epoch, so my best offhand guess is that it nails down that too.


> but it is a good choice for the type.
>
> 2 ^ 63 = 9.223372037×10¹⁸ seconds
> ans / 3600 = 2.562047788×10¹⁵ hours
> ans / 24 = 1.067519912×10¹⁴ days
> ans / 365.24 = 292279025208.905503028 years
> ans / 10⁹ = 292.279025208906 billion years

Oh look! Tabs!


Cheers!,

- Alf

Juha Nieminen

unread,
Aug 19, 2019, 3:29:43 AM8/19/19
to
In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
> For character data where plain ASCII won't do, utf-8 is the only sane
> choice IMHO. Other formats - utf-16, utf-32, wchar_t, etc., might be
> used internally for dealing with historical API's.

The problem with UTF-8 is that it's quite complicated to parse and intepret,
and is wasteful in terms of space if the majority of the text is non-ascii
(quite typically almost all non-ascii characters take at least 3 bytes to
store as UTF-8.) It's the most efficient (spacewise) if the majority of the
text is ascii, but one still has to be prepared to recover any unicode
characters from it.

UTF-16 is simpler to parse (still not completely straightforward, but
much simpler), and often more space-efficient if the majority of the
text is non-ascii (such as Japanese).

Öö Tiib

unread,
Aug 19, 2019, 4:07:12 AM8/19/19
to
On Monday, 19 August 2019 09:43:25 UTC+3, David Brown wrote:
>
> ans / 365.24 = 292279025208.905503028 years

Related issue is that lot of code uses <ctime> struct tm to manipulate
date-time. In it the tm_year member is required to be int. When int
happens to have less than about 40 (*) bits then that number does
not fit into that member. So such software will be broken in conversions
between time_t and tm.

* - "About 40" is bare eye estimation how lot of storage given number
needs.

Paavo Helde

unread,
Aug 19, 2019, 5:20:52 AM8/19/19
to
On 19.08.2019 10:29, Juha Nieminen wrote:
> In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
>> For character data where plain ASCII won't do, utf-8 is the only sane
>> choice IMHO. Other formats - utf-16, utf-32, wchar_t, etc., might be
>> used internally for dealing with historical API's.
>
> The problem with UTF-8 is that it's quite complicated to parse and intepret,

So it's fortunate that one rarely needs to parse and interpret it. Most
text manipulations like text search, followed by cut, replace, insert
can be done on UTF-8 just fine without any need to parse it. The most
demanding task would be to extract a random size-limited piece from a
large text which requires recognition of Unicode character borders, but
even this is pretty straightforward as well.

Of course, at some point the text needs to be parsed and interpreted for
display, but for example the Scintilla text editor control happily
accepts UTF-8 as its input and displays it just fine, no effort on my
part needed. In other places one might need to pre-convert UTF-8 to
platform-specific UTF-16 or UTF-32, but this is also done by library
routines, nothing special.

> and is wasteful in terms of space if the majority of the text is non-ascii
> (quite typically almost all non-ascii characters take at least 3 bytes to
> store as UTF-8.) It's the most efficient (spacewise) if the majority of the
> text is ascii, but one still has to be prepared to recover any unicode
> characters from it.
>
> UTF-16 is simpler to parse (still not completely straightforward, but
> much simpler), and often more space-efficient if the majority of the
> text is non-ascii (such as Japanese).

This is in theory. In practice I have heard most Japanese texts contain
a hefty amount of ASCII numbers, punctuation, XML and HTML tags, so the
space benefits are nowhere so clear.


David Brown

unread,
Aug 19, 2019, 5:22:29 AM8/19/19
to
On 19/08/2019 09:29, Juha Nieminen wrote:
> In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
>> For character data where plain ASCII won't do, utf-8 is the only sane
>> choice IMHO. Other formats - utf-16, utf-32, wchar_t, etc., might be
>> used internally for dealing with historical API's.
>
> The problem with UTF-8 is that it's quite complicated to parse and intepret,

It is not much worse than UTF-16. In both cases, you have to deal with
multi-unit code points. With UTF-16 you have the added complication of
endianness, which is not an issue with UTF-8.

> and is wasteful in terms of space if the majority of the text is non-ascii
> (quite typically almost all non-ascii characters take at least 3 bytes to
> store as UTF-8.)

No, a fair amount of non-ASCII takes just two bytes. That applies to
most text written in Latin, Cyrillic, Greek, Hebrew and Arabic alphabets
- including accents, combined letters, and other variants. Basically
most European scripts are handled by two bytes of UTF-8, matching the
size of UTF-16.

It is only if you have a lot of CJK characters that UTF-16 is more
efficient than UTF-8.

And the reality of document formats is that they usually contain a large
amount of ASCII characters for tags, codes, formatting, markings, etc.
For many formats, these outweigh text characters by orders of magnitude.
CJK documents are more commonly encoded as UTF-8 rather than UTF-16,
partly because it makes them /smaller/.

Documents are then often packaged together in a lump (along with
pictures or other attachments) and then compressed. With a large enough
sample, compressed Unicode documents will take similar sizes regardless
of the encoding.

UTF-8 is therefore the most space efficient in practice, and easier to
handle than UTF-16.

> It's the most efficient (spacewise) if the majority of the
> text is ascii, but one still has to be prepared to recover any unicode
> characters from it.
>
> UTF-16 is simpler to parse (still not completely straightforward, but
> much simpler), and often more space-efficient if the majority of the
> text is non-ascii (such as Japanese).
>

UTF-16 is a little simpler to parse - if you ignore the issue of
endianness. But I don't think the details make much difference - with
UTF-8 and UTF-16 you need a decoder function that handles multi-unit codes.

Lynn McGuire

unread,
Aug 19, 2019, 4:05:13 PM8/19/19
to
On 8/17/2019 8:08 AM, David Brown wrote:
> On 16/08/2019 21:04, Lynn McGuire wrote:
>> "C++: Size Matters in Platform Compatibility"
>>
>> <https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility>
>>
>
> Please put < > marks around links.

Why ?

Lynn


Scott Lurndal

unread,
Aug 19, 2019, 4:17:23 PM8/19/19
to
Because it will prevent many Usenet clients from breaking the URL into
multiple lines and it avoids confusion as to whether adjacent punctuation symbols
are part of the URL or not.

David Brown

unread,
Aug 20, 2019, 2:21:57 AM8/20/19
to
It also means that the URL will continue to work with many clients, even
if it /is/ broken into multiple lines.

Lynn, I added the marks to your URL when I made the reply. By now, it
will have wrapped to two lines - you can test it and see if it still
works as a link for your client.

Lynn McGuire

unread,
Aug 20, 2019, 3:38:52 PM8/20/19
to
It does. I use Thunderbird on Windows 7 x64 as both my email and news
client.

Thanks,
Lynn



James Kuyper

unread,
Aug 21, 2019, 9:15:57 AM8/21/19
to
Doing so is specified by one of the relevant international standards (I
don't remember which one - I last read that specification about 25 years
ago). Standards are often ignored, but applications whose designers did
pay attention to that clause of that standard may handle links enclosed
in <> better than links that are not so delimited.

Bonita Montero

unread,
Aug 21, 2019, 11:15:06 AM8/21/19
to
I prefer Win32's FILETIME which is a 64 bit integer that
counts the number of 100ns-steps since 1.1.1601 00:00.
Unfortunately that's not portable.

Eli the Bearded

unread,
Aug 21, 2019, 2:30:44 PM8/21/19
to
In comp.lang.c, Bonita Montero <Bonita....@gmail.com> wrote:
> I prefer Win32's FILETIME which is a 64 bit integer that
> counts the number of 100ns-steps since 1.1.1601 00:00.

First thing I think of when I see that is "In what calendar?" Gregorian
transition started in 1582, happend in Britain and it's colonies in
1752, and I think you have to get to 1923 for Europe to finish switching
from Julian. (Greece was 1923, Turkey was 1926.)

> Unfortunately that's not portable.

Yes. Julian is still drifting apart at a rate of 1 day per 400 years,
which is around 0.0016 seconds day. That's a big jump in nanoseconds.

Elijah
------
given MS, probably whatever calender Redmond was on in 1995


Scott Lurndal

unread,
Aug 21, 2019, 3:05:14 PM8/21/19
to
Starting dates are completely arbitrary, and no one is better than the
other. Personally, a 64-bit unix time_t works just fine.


http://www3.sympatico.ca/n.rieck/docs/calendar_time_y2k_etc.html#vax_vms_time

QUESTION:

Why is Wednesday, November 17, 1858 the base time for OpenVMS (VAX VMS)?

ANSWER:

November 17, 1858 is the base of the Modified Julian Day system.

The original Julian Day (JD) is used by astronomers and expressed in
days since noon January 1, 4713 B.C. This measure of time was
introduced by Joseph Scaliger in the 16th century. It is named in
honor of his father, Julius Caesar Scaliger (note that this Julian Day
is different from the Julian calendar named for the Roman Emperor
Julius Caesar!).

Why 4713 BC? Scaliger traced three time cycles and found that they
were all in the first year of their cyle in 4713 B.C. The three
cycles are 15, 19, and 28 years long. By multiplying these three
numbers (15 * 19 * 28 = 7980), he was able to represent any date from
4713 B.C. through 3267 A.D. The starting year was before any
historical event known to him. In fact, the Jewish calendar marks the
start of the world as 3761 B.C. Today his numbering scheme is still
used by astronomers to avoid the difficulties of converting the months
of different calendars in use during different eras.

So why 1858? The Julian Day 2,400,000 just happens to be November 17,
1858. The Modified Julian Day uses the following formula:

MJD = JD - 2,400,000.5

The .5 changed when the day starts. Astronomers had considered it
more convenient to have their day start at noon so that nighttime
observation times fall in the middle. But they changed to conform to
the commercial day.

The Modified Julian Day was adopted by the Smithsonian Astrophysical
Observatory (SAO) in 1957 for satellite tracking. SAO started
tracking satellites with an 8K (non-virtual) 36-bit IBM[R] 704 computer
in 1957, when Sputnik was launched. The Julian day was 2,435,839 on
January 1, 1957. This is 11,225,377 in octal notation, which was too
big to fit into an 18-bit field (half of the IBM standard 36-bit word).
And, with only 8K of memory, no one wanted to waste the 14 bits left
over by keeping the Julian Day in its own 36-bit word. However, they
also needed to track hours and minutes, for which 18 bits gave enough
accuracy. So, they decided to keep the number of days in the left 18
bits and the hours and minutes in the right 18 bits of a word.

Eighteen bits would allow the Modified Julian Day (the SAO day) to
grow as large as 262,143 ((2 ** 18) - 1). From Nov. 17, 1858, this
allowed for seven centuries. Using only 17 bits, the date could
possibly grow only as large as 131,071, but this still covers 3
centuries, as well as leaving the possibility of representing negative
time. The year 1858 preceded the oldest star catalog in use at SAO,
which also avoided having to use negative time in any of the satellite
tracking calculations.

This base time of Nov. 17, 1858 has since been used by TOPS-10,
TOPS-20, and VAX VMS and OpenVMS. Given this base date, the 100
nanosecond granularity implemented within OpenVMS and the 63-bit
absolute time representation (the sign bit must be clear), OpenVMS
should have no trouble with time until:

31-JUL-31086 02:48:05.47

At this time, all clocks and time-keeping operations in OpenVMS will
suddenly stop, as system time values go negative.

Note that the OpenVMS time display and manipulation routines allow for
only 4 digits in the 'YEAR' field. We expect this to be corrected in
a future release of OpenVMS sometime prior to 31-DEC-9999.

Bonita Montero

unread,
Aug 21, 2019, 3:26:02 PM8/21/19
to
> First thing I think of when I see that is "In what calendar?" Gregorian
> transition started in 1582, happend in Britain and it's colonies in
> 1752, and I think you have to get to 1923 for Europe to finish switching
> from Julian. (Greece was 1923, Turkey was 1926.)

Try to google:
"The FILETIME structure records time in the form of 100-nanosecond
intervals since January 1, 1601. Why was that date chosen?
The Gregorian calendar operates on a 400-year cycle, and 1601 is the
first year of the cycle that was active at the time Windows NT was
being designed. In other words, it was chosen to make the math come
out nicely."

Keith Thompson

unread,
Aug 21, 2019, 3:28:15 PM8/21/19
to
Eli the Bearded <*@eli.users.panix.com> writes:
> In comp.lang.c, Bonita Montero <Bonita....@gmail.com> wrote:
>> I prefer Win32's FILETIME which is a 64 bit integer that
>> counts the number of 100ns-steps since 1.1.1601 00:00.
>
> First thing I think of when I see that is "In what calendar?" Gregorian
> transition started in 1582, happend in Britain and it's colonies in
> 1752, and I think you have to get to 1923 for Europe to finish switching
> from Julian. (Greece was 1923, Turkey was 1926.)

Fortunately, there's an answer: it refers to 1601-01-01 in the Gregorian
calendar.

<https://devblogs.microsoft.com/oldnewthing/20090306-00/?p=18913>

[...]

Richard Damon

unread,
Aug 22, 2019, 10:58:39 PM8/22/19
to
On 8/21/19 2:30 PM, Eli the Bearded wrote:

> Yes. Julian is still drifting apart at a rate of 1 day per 400 years,
> which is around 0.0016 seconds day. That's a big jump in nanoseconds.
>
> Elijah
> ------
> given MS, probably whatever calender Redmond was on in 1995
>
>

I thought it was 3 days per 400 years, the multiples of 100 which aren't
multiples of 400 which are the ones the Gregorian Calendar omitted.

Richard Damon

unread,
Aug 23, 2019, 2:14:28 PM8/23/19
to

UTF-16 takes
2 bytes for values 0-0xFFFF (Unicode Base Plane)
4 bytes for values 0x10000 - 0x10FFFF (All of current Unicode)

it CAN'T handle higher code points without some other extension which
would get ugly. That is why Unicode has currently defined that 0x10FFFF
will be the highest code point. I suspect at some point there will be
pressure to change that.

UTF-8 takes
1 byte for values 0 - 0x7F (basically ASCII)
2 bytes for values 0x80 - 0xFFF (most western languages)
3 bytes for values 0x1000 - 0xFFFF (Unicode base plane, much of this is
the CJK group)
4 bytes for values 0x10000 - 0x10FFFF, but can actually get to 0x1FFFF
if they extend unicode to higher code points)

in the extension (and earlier definition)
5 bytes for values 0x200000 - 0x3FFFFFF
6 bytes for values 0x400000 - 0x7FFFFFFF

UTF-16 is more compact only for code that has more characters in the
range 0x1000 - 0xFFFF than in the range 0-0x7F. This MIGHT hold for
Chinese/Japanese/Korean documents that have minimal markup in them. It
largely also requires that CJK document has a lot of 'basic' CJK
characters, as one big reason for extending the Unicode standard past 16
bits was that there were a LOT more characters needed then expected to
handle them.

It turns out that UTF-16 was a mistake. It came about because the first
attempts at Unicode thought it could be a 16 bit characters set, and
thus would be UCS-2. It only lives on really because Microsoft decided
to use UCS-2 in windows instead of UTF-8, which admittedly might not
have been a bad decision at the time. When UCS-2 had to become UTF-16 to
handle the full character set, it gets the worse of both sides, it still
is a multi-code character set (but maybe a lot of code can get away with
ignoring it and just not work right with 'esoteric' characters), is less
compact in the most common cases, and has a byte order issue for byte
streams (which most data streams are).

Eli the Bearded

unread,
Aug 23, 2019, 3:15:50 PM8/23/19
to
In comp.lang.c, Richard Damon <Ric...@Damon-Family.org> wrote:
> On 8/21/19 2:30 PM, Eli the Bearded wrote:
>> Yes. Julian is still drifting apart at a rate of 1 day per 400 years,
>> which is around 0.0016 seconds day. That's a big jump in nanoseconds.
> I thought it was 3 days per 400 years, the multiples of 100 which aren't
> multiples of 400 which are the ones the Gregorian Calendar omitted.

Hmmm. Yes. This is why I'm happy to use date libraries instead of
writing my own.

Elijah
------
there's too much to remember

David Brown

unread,
Aug 24, 2019, 2:57:02 PM8/24/19
to
On 23/08/2019 20:14, Richard Damon wrote:
>
> UTF-16 takes
> 2 bytes for values 0-0xFFFF (Unicode Base Plane)
> 4 bytes for values 0x10000 - 0x10FFFF (All of current Unicode)
>
> it CAN'T handle higher code points without some other extension which
> would get ugly. That is why Unicode has currently defined that 0x10FFFF
> will be the highest code point. I suspect at some point there will be
> pressure to change that.

It is true that UTF-16 is limited to 1,114,111 characters. (I didn't
know that until recently.) However, Unicode says:
<https://www.unicode.org/faq/utf_bom.html#utf16-6>

"""
Q: Will UTF-16 ever be extended to more than a million characters?

A: No. Both Unicode and ISO 10646 have policies in place that formally
limit future code assignment to the integer range that can be expressed
with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e.
other UTFs) can represent larger intergers, these policies mean that all
encoding forms will always represent the same set of characters. Over a
million possible codes is far more than enough for the goal of Unicode
of encoding characters, not glyphs. Unicode is not designed to encode
arbitrary data. If you wanted, for example, to give each “instance of a
character on paper throughout history” its own code, you might need
trillions or quadrillions of such codes; noble as this effort might be,
you would not use Unicode for such an encoding. [AF]
"""


>
> UTF-8 takes
> 1 byte for values 0 - 0x7F (basically ASCII)
> 2 bytes for values 0x80 - 0xFFF (most western languages)
> 3 bytes for values 0x1000 - 0xFFFF (Unicode base plane, much of this is
> the CJK group)
> 4 bytes for values 0x10000 - 0x10FFFF, but can actually get to 0x1FFFF
> if they extend unicode to higher code points)
>
> in the extension (and earlier definition)
> 5 bytes for values 0x200000 - 0x3FFFFFF
> 6 bytes for values 0x400000 - 0x7FFFFFFF
>
> UTF-16 is more compact only for code that has more characters in the
> range 0x1000 - 0xFFFF than in the range 0-0x7F. This MIGHT hold for
> Chinese/Japanese/Korean documents that have minimal markup in them. It
> largely also requires that CJK document has a lot of 'basic' CJK
> characters, as one big reason for extending the Unicode standard past 16
> bits was that there were a LOT more characters needed then expected to
> handle them.

Yes. The reality is that there are few documents that fall into this
category, thus it is easier to stick to UTF-8 for all documents.

>
> It turns out that UTF-16 was a mistake. It came about because the first
> attempts at Unicode thought it could be a 16 bit characters set, and
> thus would be UCS-2. It only lives on really because Microsoft decided
> to use UCS-2 in windows instead of UTF-8, which admittedly might not
> have been a bad decision at the time. When UCS-2 had to become UTF-16 to
> handle the full character set, it gets the worse of both sides, it still
> is a multi-code character set (but maybe a lot of code can get away with
> ignoring it and just not work right with 'esoteric' characters), is less
> compact in the most common cases, and has a byte order issue for byte
> streams (which most data streams are).
>

Agreed.

Robert Wessel

unread,
Aug 25, 2019, 12:42:26 AM8/25/19
to
OTOH, the original Unicode spec also specifically said it was a 16-bit
code. So these things sometime change.

David Brown

unread,
Aug 25, 2019, 6:11:55 AM8/25/19
to
But surely now that Unicode has support for Ogham Runes, it /can/ be
carved in stone :-)

You are right, of course - guarantees about future changes are hard to
make. I guess that if Unicode ever starts to encroach on the UTF-16
limits, then UTF-16 will be dropped.

Richard Damon

unread,
Aug 25, 2019, 10:33:15 AM8/25/19
to
Yes, I know that statement, but I also know that practically, at some
point the Unicode consortium will have to make some tough decisions on
code points, and at some point will renege on the policy that UTF-16 as
it currently exists will be able to access every code point in all
future versions of Unicode. One thing to remember, and was pointed out,
UTF-16 only exists BECAUSE a previous promise was broken, and UCS-2
wasn't good enough anymore for all of Unicode.

One thing that could cause this is if the CJK set gets fixed and removed
from being second class code points. It may not be well known, but in
the process of developing Unicode, in an attempt to make it work, and to
have enough code points, the characters in the three major pictograph
languages where 'unified' and characters that meant roughly the same
thing were combined into a single code point. It really wouldn't be that
different than saying that we don't need a separate code point for the
character Pi, as that is just the way to write P in Greek. (It isn't
just a typography issue, as it isn't just how you write a given
character, but they are 'similar' words from distinct languages.) The
unification was accepted as it seemed to only possible option to try to
make Unicode a 16 bit character set, and it did allow them to make the
most common characters in the 16 bit basic plane.

Robert Wessel

unread,
Aug 25, 2019, 11:22:12 AM8/25/19
to
On Sun, 25 Aug 2019 12:11:46 +0200, David Brown
The problem is that dropping UTF-16 is a nasty hit to Windows, Java
and Javascript. Those are not minor users in the world.

I suspect what would happen at that point is that an additional block
of surrogate characters will be assigned, so that we can have both the
surrogate pairs we now have for UTF-16, plus some new surrogate
"triples". You could assign a block of 64 code points from the BMP,
for example (if nothing else, grab some more of the private use area
like they did the last time), and get six extra bits for the "triple"
(or assign two 64-code-point blocks and be able to do "quads" as
well). So the UTF-16 users would get hit with about the same thing
that happened in the transition from UCS-2.

With hindsight*, something like that should have been done the first
time. Maybe have three or four blocks of 512 surrogate code points
(instead of 2x2048), and allow two or three (or four) surrogate
characters in a sequences (not really unlike what was done with
UTF-8).


*Although calling that a lack of foresight may be too kind. How many
times have we been bitten by inadequate address spaces now? 65536,
errr..., 1,114,112 character ought to be enough for anyone!

David Brown

unread,
Aug 25, 2019, 2:27:13 PM8/25/19
to
That is interesting information - thanks for posting it.

Personally, I won't shed a tear if UTF-16 disappears. But the number of
characters has to increase by a factor of 8 or so before it becomes
necessary, which could allow for splitting up C, J and K.

0 new messages