Here are some more elements relevant for this discussion
- I'm 100% favorable for FLTK 1.4 to require correct UTF-8 encoded strings
and to output an error message when encountering an incorrect string.
It's time to abandon the nightmare of pagecode tables.
- The intention behind commit 6e5f3f7 was to make the Pango-based FLTK
backend behave as other backends do, which is what we usually aim at.
- All other backends process incorrect UTF-8 strings as though they were
CP-1252 encoded. I agree that the bet that CP-1252 is used if UTF-8
is not is not safe.
- Before commit 6e5f3f7, Pango draws each incorrect byte of an
input string with character 0xFFFD defined by Unicode as
"Replacement Character used to replace an incoming character
unknown or unrepresentable in Unicode"
and spits an error message on stderr of the form
Pango-WARNING **: 08:54:40: Invalid UTF-8 string passed to pango_layout_set_text()
That is essentially what point (2) above advocates.
- It's worth knowing that currently ALL non-Pango FLTK backends transform ALL
the strings they draw (or measure their length) from UTF-8 to either
UTF-16 or UTF-32 before drawing or measuring them [1]. They all ultimately
call fl_utf8decode() for each Unicode character in that process.
They all store the transformed string in a private memory zone
which is transmitted to the system call that draws or measures the text.
The same memory zone gets re-used to store the next drawn or measured string.
- Commit 6e5f3f7 uses the same process except that if the input string
is correctly UTF-8 encoded, no private memory zone is needed: the Pango function
uses the input string itself.
- Here are the options I see among which we could choose for this issue
(i) leave as before commit 6e5f3f7: the drawn text uses Unicode's Replacement
Character when Pango encounters non-conformant input data; an error
message is output to stderr. Document that the FLTK Pango-based backend
requires UTF-8 conformant strings.
(ii) draw the Replacement Character when appropriate and output an error message
through Fl::warning(). This requires to parse the input string with fl_utf8decode()
or with a slightly lighter procedure and to copy any non-conformant input
string to a private memory zone. Overall, the compute cost will be very similar
to that of commit 6e5f3f7.
(iii) Keep commit 6e5f3f7. The Pango backend would behave as all other FLTK backends
with some computation cost. This commit could be improved a bit for conformant strings
where the 2 full function calls
unsigned codepoint = fl_utf8decode(p, end, &len);
len2 = fl_utf8encode(codepoint, buf4);
can probably be simplified if their goal is only to detect non-conformant bytes.
[1]:
macOS: mac_Utf8_to_Utf16() calls fl_utf8toUtf16() calls fl_utf8decode()
Xft: calls utf8reformat() calls fl_utf8towc() calls fl_utf8decode()
Windows: calls fl_utf8toUtf16() calls fl_utf8decode()