RFC: Make Pango-handled text accept legacy CP1252-encoded text

35 views
Skip to first unread message

Albrecht Schlosser

unread,
Oct 7, 2024, 12:49:00 PM10/7/24
to fltk.coredev
On 10/7/24 16:32 ManoloFLTK committed :
commit 6e5f3f7ecb3cc0039e113a5e3b2409ba0f7e7cea
Author:     ManoloFLTK <41016272+...@users.noreply.github.com>
AuthorDate: Mon Oct 7 16:20:44 2024 +0200
Commit:     ManoloFLTK <41016272+...@users.noreply.github.com>
CommitDate: Mon Oct 7 16:20:59 2024 +0200

    Make Pango-handled text accept legacy CP1252-encoded text

I'm not happy with this commit for several reasons and I would like to read others' opinions.

(1) The overhead to scan and convert **ALL** strings is not negligible. For all conforming programs this is unnecessary and not "Fast and Light" (FLTK).

(2) Personally I would like more if non-UTF-8 strings were showing errors rather than being "fixed" silently.

(3) The assumption that all non-conforming strings are encoded in CP-1252 is likely but not safe.


At the very least I would "repair" this by making this string conversion optional. It could be enabled by a (CMake) build option and the default should be disabled.

I assume that this commit resulted from a discussion in fltk.general where a user had issues with non-UTF-8 strings: "utf8 support in fltk-1.4.x" started on Oct 5, 2024.
https://groups.google.com/g/fltkgeneral/c/lyIiUZK13iA/m/tyKqJromAwAJ

However, I think that the introduction of Pango for text rendering is a good starting point for relying on UTF-8 conforming strings and rendering non-conforming strings as errors. This could and should be documented in the 1.4 docs.

I wouldn't mind if "old string rendering" (as done in 1.3) would still fall back to "converting" strings as attempted here (as it seems to do from the discussion in fltk.general), but for rendering with Pango it should display errors in 1.4 and in the future.

Comments and opinions, please. Thanks in advance.

Manolo

unread,
Oct 8, 2024, 4:37:10 AM10/8/24
to fltk.coredev
Here are some more elements relevant for this discussion

- I'm 100% favorable for FLTK 1.4 to require correct UTF-8 encoded strings
and to output an error message when encountering an incorrect string.
It's time to abandon the nightmare of pagecode tables.

- The intention behind commit 6e5f3f7 was to make the Pango-based FLTK
backend behave as other backends do, which is what we usually aim at.

- All other backends process incorrect UTF-8 strings as though they were
CP-1252 encoded. I agree that the bet that CP-1252 is used if UTF-8
is not is not safe.

- Before commit 6e5f3f7, Pango draws each incorrect byte of an
input string with character 0xFFFD defined by Unicode as
  "Replacement Character used to replace an incoming character
  unknown or unrepresentable in Unicode"
and spits an error message on stderr of the form
  Pango-WARNING **: 08:54:40: Invalid UTF-8 string passed to pango_layout_set_text()
That is essentially what point (2) above advocates.

- It's worth knowing that currently ALL non-Pango FLTK backends transform ALL
the strings they draw (or measure their length) from UTF-8 to either
UTF-16 or UTF-32 before drawing or measuring them [1]. They all ultimately
call fl_utf8decode() for each Unicode character in that process.
They all store the transformed string in a private memory zone
which is transmitted to the system call that draws or measures the text.
The same memory zone gets re-used to store the next drawn or measured string.

- Commit 6e5f3f7 uses the same process except that if the input string
is correctly UTF-8 encoded, no private memory zone is needed: the Pango function
uses the input string itself.

- Here are the options I see among which we could choose for this issue

(i) leave as before commit 6e5f3f7: the drawn text uses Unicode's Replacement
Character when Pango encounters non-conformant input data; an error
message is output to stderr. Document that the FLTK Pango-based backend
requires UTF-8 conformant strings.

(ii) draw the Replacement Character when appropriate and output an error message
through Fl::warning(). This requires to parse the input string with fl_utf8decode()
or with a slightly lighter procedure and to copy any non-conformant input
string to a private memory zone. Overall, the compute cost will be very similar
to that of commit 6e5f3f7.

(iii) Keep commit 6e5f3f7. The Pango backend would behave as all other FLTK backends
with some computation cost. This commit could be improved a bit for conformant strings
where the 2 full function calls
    unsigned codepoint = fl_utf8decode(p, end, &len);
    len2 = fl_utf8encode(codepoint, buf4);
can probably be simplified if their goal is only to detect non-conformant bytes.


[1]:
macOS: mac_Utf8_to_Utf16() calls fl_utf8toUtf16() calls fl_utf8decode()
Xft: calls utf8reformat() calls fl_utf8towc() calls fl_utf8decode()
Windows: calls fl_utf8toUtf16() calls fl_utf8decode()


imacarthur

unread,
Oct 8, 2024, 5:18:59 AM10/8/24
to fltk.coredev
On Monday 7 October 2024 at 17:49:00 UTC+1 Albrecht-S wrote:

(1) The overhead to scan and convert **ALL** strings is not negligible. For all conforming programs this is unnecessary and not "Fast and Light" (FLTK).

Though, in effect, we were more or less doing that anyway for fltk-1.3, so it maybe isn't a show stopper...
However, what was a Good Option twenty years ago (trapping for invalid UTF8 text) makes less sense now, I think.
 

(2) Personally I would like more if non-UTF-8 strings were showing errors rather than being "fixed" silently.

Agree.
I'd like to think that a lot of text would now be encoded correctly.
I'd _like_ to think that but I do not actually _believe_ that...


(3) The assumption that all non-conforming strings are encoded in CP-1252 is likely but not safe.

We have actually three compile guards in fltk-1.3:

ERRORS_TO_ISO8859_1 (ON by default in 1.3) - this basically allows anything that looks like an invalid UTF8 codepoint through as a single byte character
ERRORS_TO_CP1252  (ON by default in 1.3) - this ONLY remaps the codes 0x80 to 0x9f (onto utf8 replacements)
STRICT_RFC3629 (OFF by default in 1.3) - this traps for a few Unicode codepoints that are explicitly invalid, and for some surrogate pairs. TBH, we can probably leave this OFF...

So the "main thing" here is the ERRORS_TO_ISO8859_1 flag, perhaps? Rather than the ERRORS_TO_CP1252 per se (though they will interact.)

 
At the very least I would "repair" this by making this string conversion optional. It could be enabled by a (CMake) build option and the default should be disabled.

Agree, if we keep this at all, it should be optional (compile time or run time? Historically it was only selectable at compile time and AFAIK no one ever changed the defaults...)
I think off-by-default (at least for the pango use case) is OK - this is the first time anyone's reported an issue and (as noted below) the underlying cause in this case was a gettext locale issue rather than a fltk issue as such.

 

I assume that this commit resulted from a discussion in fltk.general where a user had issues with non-UTF-8 strings: "utf8 support in fltk-1.4.x" started on Oct 5, 2024.
https://groups.google.com/g/fltkgeneral/c/lyIiUZK13iA/m/tyKqJromAwAJ

And note that, at root, that issue arose because of an unexpected locale behaviour form gettext - it wasn't really a fltk string handling issue. Once the gettext locale was sorted this problem went away.

 

gnuwimp

unread,
Oct 8, 2024, 8:43:39 AM10/8/24
to fltkc...@googlegroups.com
Problems with configure.
I dont have cairo installed.
./configure -enable-use_std

Compiling drivers/Xlib/Fl_Xlib_Graphics_Driver_font_xft.cxx...
In file included from drivers/Xlib/Fl_Xlib_Graphics_Driver_font_xft.cxx:21:
drivers/Xlib/../Cairo/Fl_Cairo_Graphics_Driver.H:25:10: fatal error:
cairo/cairo.h: No such file or directory
25 | #include <cairo/cairo.h>

Albrecht Schlosser

unread,
Oct 8, 2024, 8:46:37 AM10/8/24
to fltkc...@googlegroups.com
@gnuwimp: Please don't post your question(s) to a thread that deals with
a different topic.
Also, your question is OT in fltk.coredev, please open a new thread in
fltk.general instead.
Thanks.

gnuwimp

unread,
Oct 8, 2024, 1:23:03 PM10/8/24
to fltkc...@googlegroups.com
Ok sorry.
Although it was not a question, more like a bug report.
Commit 6e5f3f7ecb3cc0039e113a5e3b2409ba0f7e7cea which you are talking
about broke building FLTK, if you dont have cairo installed.
And it is not caught in configure.

Albrecht Schlosser

unread,
Oct 8, 2024, 2:22:47 PM10/8/24
to fltkc...@googlegroups.com
On 10/8/24 18:00 gnuwimp wrote:
On 10/8/24 14:26 gnuwimp wrote:
Problems with configure.
I dont have cairo installed.
./configure -enable-use_std

Compiling drivers/Xlib/Fl_Xlib_Graphics_Driver_font_xft.cxx...
In file included from drivers/Xlib/Fl_Xlib_Graphics_Driver_font_xft.cxx:21:
drivers/Xlib/../Cairo/Fl_Cairo_Graphics_Driver.H:25:10: fatal error:
cairo/cairo.h: No such file or directory
    25 | #include <cairo/cairo.h>
Ok sorry.
Although it was not a question, more like a bug report.
Commit 6e5f3f7ecb3cc0039e113a5e3b2409ba0f7e7cea which you are talking
about broke building FLTK, if you dont have cairo installed.
And it is not caught in configure.

Ah, OK, I didn't notice the relation to this commit which wasn't obvious from your post. Sometimes we get unrelated posts like this seemed because the Google Groups web interface is ... let's say, suboptimal.

Looking closer it seems wrong that src/drivers/Xlib/Fl_Xlib_Graphics_Driver_font_xft.cxx includes Fl_Cairo_Graphics_Driver.H which includes other Cairo stuff. Independent of the outcome of this discussion (do we want the string conversion or not?) this needs to be fixed - unless the commit gets inverted. May I ask you to open a GitHub Issue as a bug report? This would be helpful, otherwise this issue might get lost in the discussion. Thanks in advance.

Albrecht Schlosser

unread,
Oct 8, 2024, 3:09:57 PM10/8/24
to fltkc...@googlegroups.com
[Sorry for top posting, this is intentional]

Thanks to Manolo for the detailed information and to Ian for his comments and clarifications as well.

So, yes, it's not easy but we need a decision if we want this string conversion in Pango rendering or not, or if we want to make it optional.

I see Manolo's point to make all platforms identical. However, if we had this silent string conversion already implemented, the user in the mentioned thread would never have found their - as Ian pointed out, gettext and not FLTK related - bug.

As it seems to me users could disable the conversion on all platforms if they defined the two preprocessor macros ERRORS_TO_ISO8859_1 and ERRORS_TO_CP1252 to 0. This would not only disable the legacy (X11/Xft) string conversions on Linux/X11 but also those used on macOS and Windows (where we seem to need to convert to UTF-16 for system API reasons anyway, as Manolo pointed out). We don't know what the results on these platforms would be if we didn't convert non-conforming UTF-8 to something "useful". Would there be hard faults or benign character replacements like we see when using Pango w/o conversion? I don't know, this would need investigation but it's not the time to change this short before the release of 1.4.0.

OTOH the usage of Pango (and Wayland which uses Pango) is new in FLTK 1.4, thus it seems to me that we could accept not to convert the strings and live with a documented platform specific difference. My argument pro this difference is that we should IMHO remove these string conversions from all platforms in the future anyway because - as Ian wrote - "what was a Good Option twenty years ago ... makes less sense now". That could (should!) be a valid goal in 1.5. On platforms where we need to convert to UTF-16/32 anyway we could do the replacement with an error character as well.

However I also see the backwards compatibility issue. As I wrote before (elsewhere, not in this thread), my opinion is that porting software from FLTK 1.3 to 1.4 should work flawlessly, and thus removing the "conversions" from the "old code" might yield unwanted effects, hence we shouldn't do it in 1.4.

That said, my conclusion is to let users choose if they need this string conversion. The macros mentioned above have been user changeable since 1.3.4 (by compiler commandlines, w/o editing the library) but they were still hard coded in the library. The new option for Pango related string conversions should be a runtime option such that it can be enabled at runtime per program or maybe as a system or user option. Linux and other distros would thus build the FLTK libs without these Pango string conversions and users (software developers) who need to support ISO-8859-1 or CP-1252 encodings could still use FLTK 1.4 - this should be very rare exceptions.

As opposed to my earlier comment I'd favor an Fl::option(...) setting with default 'OFF'. In FLTK 1.5 we could make the "legacy string conversions" also runtime options and mark them as deprecated. In 1.6 or any later version we could remove these options and conversions completely and require conformant UTF-8 strings everywhere.

My proposal:

If you agree we'd leave Manolo's Pango related string conversions but use (enable) them only if Fl::option(SOME_OPTION_NAME) is true. That's all we need to do (except documentation).

Fixing the Cairo dependency introduced in commit 6e5f3f7ecb3c as mentioned by @gnuwimp in this thread is a separate issue.

What do you think?

Manolo

unread,
Oct 8, 2024, 6:15:29 PM10/8/24
to fltk.coredev
I disagree with the proposal to make the string conversion a run-time option for 3 reasons
of distinct nature:

1) the proposal is based on the premise that the Pango platform incurs a computation overhead
if strings are converted. That's not correct because ALL other platforms do this conversion work.
I have found that the procedure can be made lighter and call only fl_utf8decode() until a non-
conformant byte is found. Only in theses rare cases would fl_utf8encode() be called.,

2) I find inadequate to keep defining new Fl::option(SOME_NAME) for microscopic purposes
rarely used if ever.

3) the easy porting of 1.3 source code to 1.4 from 1.3 is an important feature of 1.4 that requires
the Pango backend to behave as other backends do.

Albrecht Schlosser

unread,
Oct 10, 2024, 8:44:52 AM10/10/24
to fltkc...@googlegroups.com
I withdraw my proposal to make this conversion stuff optional for the sake of a quick release of FLTK 1.4.0. Let's end this discussion for now and revisit this when we start implementing 1.5.0.

@Manolo: please implement the "lighter version that calls only fl_utf8decode() until a non-conformant byte is found".

Honestly, I didn't expect such a strong resistance by Manolo to my (IMHO useful) proposal and I'm disappointed by the lack of comments from other devs except Ian (thanks for that). Under these circumstances it doesn't make sense to vote, hence my withdrawal.

But please see my comments below...


On 10/9/24 00:15 Manolo wrote:
I disagree with the proposal to make the string conversion a run-time option for 3 reasons
of distinct nature:

1) the proposal is based on the premise that the Pango platform incurs a computation overhead
if strings are converted. That's not correct because ALL other platforms do this conversion work.

That's not correct. According to your own (Manolo's) previous list all these platforms NEED to convert the UTF-8 strings to another encoding (UTF-16 or UTF-32). Converting illegal characters to something "legal" under the assumption that the original string is in CP-1252 (or ISO-8859-1) encoding is an accidental byproduct that is enabled or disabled (!) by the two preprocessor macros as pointed out by Ian.

When using Pango this re-encoding is not necessary (because Pango takes UTF-8 as input) nor is it useful if the original string is correctly encoded in UTF-8, and that's the point of making it optional. Note: if the two macros mentioned above were set to 0 this entire string "conversion" would be a no-op anyway but waste resources for scanning and "converting" the string.

The fact that other platforms do some extra work doesn't justify to do such unnecessary extra work on a platform that doesn't need it.


I have found that the procedure can be made lighter and call only fl_utf8decode() until a non-
conformant byte is found. Only in theses rare cases would fl_utf8encode() be called.,

As I wrote above, please implement and commit this.


2) I find inadequate to keep defining new Fl::option(SOME_NAME) for microscopic purposes
rarely used if ever.

It's disputable whether this option would be "for microscopic purposes" but such an option should be "rarely used" by definition. It's the purpose of an option that users can enable it only for their specific needs and the default value should be designed so it is good for the majority of users.


3) the easy porting of 1.3 source code to 1.4 from 1.3 is an important feature of 1.4 that requires
the Pango backend to behave as other backends do.

Well, this is IMNSHO the only valid argument concerning the interpretation of invalid UTF-8 strings, i.e. to make it behave the same as on other platforms.

imacarthur

unread,
Oct 10, 2024, 9:05:18 AM10/10/24
to fltk.coredev
On Thursday 10 October 2024 at 13:44:52 UTC+1 Albrecht-S wrote:
I withdraw my proposal to make this conversion stuff optional for the sake of a quick release of FLTK 1.4.0. Let's end this discussion for now and revisit this when we start implementing 1.5.0.

@Manolo: please implement the "lighter version that calls only fl_utf8decode() until a non-conformant byte is found".


Just a "for the record", but our implementation of fl_utf8decode() would correctly return the Unicode replacement character code as it stands, IF we did not set ERRORS_TO_ISO8859_1 and ERRORS_TO_CP1252 at compile time anyway, so I'm not sure how much there is to do.

If we set ERRORS_TO_ISO8859_1 and ERRORS_TO_CP1252 at compile time, as we currently do, then  fl_utf8decode() will pass bytes that are codepage specific (and invalid UTF8) to pango, and in that case I'm not sure what happens - I assume pango inserts a Unicode replacement character at that point.
So we could just let fl_utf8decode() set the Unicode replacement character anyway?
Then at least the pango and other platforms would all do the same thing, albeit different to what 1.3 does for invalid strings.


For sure there is no way we can make fl_utf8decode() correctly replace the codepage-specific values with the correct UTF8 values - in essence that is what ERRORS_TO_CP1252 does but only for a very small number of common characters; we can't do that for all possible code pages though.

In the end, I think the time is long past for us to be trying to fix bad strings - we should just look for valid UTF8 now.

But if the majority view is to leave it as-is then fair enough, though I think that will end up being more hassle somewhere on down the line...


Manolo

unread,
Oct 10, 2024, 12:28:14 PM10/10/24
to fltk.coredev
Le jeudi 10 octobre 2024 à 14:44:52 UTC+2, Albrecht-S a écrit :
I withdraw my proposal to make this conversion stuff optional for the sake of a quick release of FLTK 1.4.0. Let's end this discussion for now and revisit this when we start implementing 1.5.0.

@Manolo: please implement the "lighter version that calls only fl_utf8decode() until a non-conformant byte is found".

Done at 013e939.

The core of my position is that I want to have all platforms behave equally.
I'm not at all against abandoning compatibility with CP1252.
I'm against abandoning it optionally for the Pango backend, creating a new Fl::option
for that, that will become obsolete later when we, possibly at 1.5, abandon compatibility 
with CP1252.

Reply all
Reply to author
Forward
0 new messages