locale was NEVER the complete i18n solution. It provides a few key
features, like the representation of numbers and currency, and does a
fairly decent job at that IF you feed it the information that is needed
(which is sometimes tougher to know). Yes, it doesn't handle thing like
I want numbers printed in European format except for things intended to
be cut and pasted into Excel which I have setup for a different format.
The programmer needs to figure out which locale set to use for what.
>
> Unicode is formally well-defined only issue with it is that it has gone
> from 6.0.0 to 13.0.0 during last 10 years. With so moving target
> it sometimes makes version indication kind of desirable.
The thing to note is that Unicode has been incredibly backwards
compatible, and most of the 'changes' have been defining what new
code-points represent, which requires updating character classification
tables if (and only if) you want to process 'correctly' those new
characters. These changes do NOT invalidate any older processing
>
> My take on any converter/filter is that based on data in input
> sometimes it can be fully round-trip, sometimes it loses and/or
> adds something and sometimes fails to convert.
The 'illegal' Unicode that I have mentioned hasn't changed since the
VERY early days (once Unicode became a 21 bit character set). Valid (in
that sense) UTF-8 should fully round trip to UTF-16 or UCS-4 without any
changes.
Yes, an arbitrary string of bytes is very likely to be marked invalid,
and not round trip. There are also some not uncommon enough errors that
people make encoding data that won't round trip if done right (and an
application being strict is supposed to mark these cases with the
replacement character, but many will just silently 'fix' them.
> Best is when it has some fully defined default behavior that
> can be configured and also that it indicates whatever it did to
> caller. How to react to each of those cases is then all up to caller.
> We have C++ so I would love compile-time configurable but
> dynamic is fine as bottle-neck is usually speed of channel or
> media. I do not understand what is so tricky about it as I do it
> all the time.
The problem here is that range of possible desired error recovery is so
broad that it becomes unwieldy to implement.
>
> Filter being "for output" you meant in sense that quality of its
> input can be blamed on programmer? Also on case of "for
> input" it can be blamed to some programmer ... just that
> chances are that the programmer is more anonymous. In
> both directions it is bad excuse for weak work.
>
Output routines can specify their calling conditions, as the programmer
using them has essential control over the data going into them. Yes, if
he doesn't meet the published requreements for the routine, he can be
'blamed' for them not working.
Input routines take some of their input from something at least
potentially outside of the control of the programmer using them. This
input potentially even comes from a source that is adversarial to the
program.
Specification for processing inputs can sometimes get quite involved,
especially if the input is possibly coming from an untrained user,
specifying not only the primary expected inputs, but possible variations
that users might try, as well as safeguards for dealing with hostile
input (sometimes you want to do more than just ignore it).
Yes, if the input is securely from a trusted source and known to be free
from errors, you can be a bit more lax with parsing, and perhaps some of
the simple input processing routines from the standard library can be used.