fl_call_main.c 's WinMain function creates broken UTF-8.

72 views
Skip to first unread message

Gonzalo Garramuño

unread,
Nov 18, 2023, 1:53:32 PM11/18/23
to fltkc...@googlegroups.com
I borrowed the code in fl_call_main.c's WinMain function to turn the
wstrings into UTF-8.   My user was testing it with Chinese characters
and reported it did not work properly.

I have changed the code to the following simpler C++ code. Someone with
more knowledge of UTF-8 should address this in FLTK 1.4's code:


#ifdef _WIN32

#    include <iostream>
#    include <locale>
#    include <codecvt>

// Function to convert wstring to UTF-8 string
std::string wstring_to_utf8(const std::wstring& ws)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
    return converter.to_bytes(ws);
}

int WinMain(
    HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int
nCmdShow)
{
    // Convert the command line arguments to UTF-8
    int argc;
    char** argv;

    // Get the command line as a wide string
    LPWSTR* wideArgv = CommandLineToArgvW(GetCommandLineW(), &argc);

    // Convert each wide string argument to UTF-8
    argv = new char*[argc];
    for (int i = 0; i < argc; ++i)
    {
        argv[i] = strdup(wstring_to_utf8(wideArgv[i]).c_str());
    }

    // Free the wide string array
    LocalFree(wideArgv);

    int ret = main(argc, argv);

    // Cleanup allocated memory for argv
    for (int i = 0; i < argc; ++i)
    {
        free(argv[i]);
    }
    delete[] argv;

    return ret;
}

#endif


--
Gonzalo Garramuño
ggar...@gmail.com

melcher....@googlemail.com

unread,
Nov 19, 2023, 6:44:41 AM11/19/23
to fltk.coredev
Thanks, I will try to take look at this. Meanwhile, I'll make this into a GitHub Issue.

imm

unread,
Nov 19, 2023, 6:56:59 AM11/19/23
to coredev fltk
On Sun, 19 Nov 2023, 11:44 Matt wrote:
Thanks, I will try to take look at this. Meanwhile, I'll make this into a GitHub Issue.


Matt,

If you do look at this, it's probably better to use our own wc to utf conversion functions, from the fl_utf8 stuff, than to pull in external functions to do the job... Not sure all the compilers like that ..

I never really understood Microsoft's winmain approach, never really "got" it...

--
Ian
From my Fairphone FP3
   

melcher....@googlemail.com

unread,
Nov 19, 2023, 7:50:13 AM11/19/23
to fltk.coredev
https://github.com/fltk/fltk/issues/840

imacarthur schrieb am Sonntag, 19. November 2023 um 12:56:59 UTC+1:
If you do look at this, it's probably better to use our own wc to utf conversion functions, from the fl_utf8 stuff, than to pull in external functions to do the job... Not sure all the compilers like that ..

Yes, that will be the easiest way. We already wrap all the file access functions for the same reasons.
 
I never really understood Microsoft's winmain approach, never really "got" it...

There is really nothing to it. Microsoft at the time did not really want to support command line at all. They decided that having their version of a `main()` entry point that provided the information that they needed for launching a GUI program was the way to go. That's all.

It get's really nasty in Android where they don't want native apps, so the main entry point is in a Java call. You could not, until some time ago write a C++ app for Android without writing at least some Java. After tons of complaints by game writers who did not want to install a full Java dev environment just to write five lines of Java code to call their game, Google gave in and added Native Activity. You still need a painful amount of stuff around that to make a working Android app and is probably the main reason that the FLTK Android port never worked well.

Albrecht Schlosser

unread,
Nov 19, 2023, 1:38:09 PM11/19/23
to fltkc...@googlegroups.com
On 11/19/23 12:56 imm wrote:
Matt,

If you do look at this, it's probably better to use our own wc to utf conversion functions, from the fl_utf8 stuff, than to pull in external functions to do the job... Not sure all the compilers like that ..

Thanks, Ian, for this hint. Meanwhile I took the issue and implemented a solution in commit 7e8994c4a295e8709d4940656248c231de62a8a6.

Regarding our own "Wide Character to UTF-8" functions for Windows: we have several and I still need to investigate details.

I discovered: "we're using WideCharToMultiByte() and MultiByteToWideChar() already in src/Fl_Native_File_Chooser_WIN32.cxx and src/drivers/WinAPI/Fl_WinAPI_System_Driver.cxx" (see GitHub issue #840, https://github.com/fltk/fltk/issues/840#issuecomment-1817928735).

Since this a Windows-only issue and this specific function (i.e. WinMain() calling main()) is *only* for Visual Studio (!) I decided to use WideCharToMultiByte() in this context. This should be available on all our supported platforms and has already been used for a while...

I'm not sure what the best solution would be. As Matt wrote: "It sure makes sense to use WideCharToMultiByte() if it is available on all versions of MSWindows. It would replace a whole page of handmade code."

Since we're using it already in FLTK 1.3 it should be safe to use it for all our conversions on the Windows platform.

Is there a reason to believe that "our version" could be faster or better than the MS version?

What do you and others think?

imacarthur

unread,
Nov 21, 2023, 4:42:59 AM11/21/23
to fltk.coredev
On Sunday, 19 November 2023 at 18:38:09 UTC Albrecht Schlosser wrote:

Regarding our own "Wide Character to UTF-8" functions for Windows: we have several and I still need to investigate details.

I discovered: "we're using WideCharToMultiByte() and MultiByteToWideChar() already in src/Fl_Native_File_Chooser_WIN32.cxx and src/drivers/WinAPI/Fl_WinAPI_System_Driver.cxx" (see GitHub issue #840, https://github.com/fltk/fltk/issues/840#issuecomment-1817928735).

Since this a Windows-only issue and this specific function (i.e. WinMain() calling main()) is *only* for Visual Studio (!) I decided to use WideCharToMultiByte() in this context. This should be available on all our supported platforms and has already been used for a while...

I'm not sure what the best solution would be. As Matt wrote: "It sure makes sense to use WideCharToMultiByte() if it is available on all versions of MSWindows. It would replace a whole page of handmade code."

Since we're using it already in FLTK 1.3 it should be safe to use it for all our conversions on the Windows platform.

Is there a reason to believe that "our version" could be faster or better than the MS version?

What do you and others think?

From memory (I have not checked this recently) there was some issue with conversion "to" and "from" for code points that are "marginal"... 
I'll try and explain that!
I think our conversion routines (which are pretty old now) made some effort to catch some code points that occurred commonly in internet text (like, IIRC, 0xA0 which commonly occur[ed | s] as a non-breaking space in web text) that are (or were) common but strictly speaking were invalid UTF code points... The idea was that it made a lot of pre-existing text Just Work.
So.... I think our functions were coded so that, to a large extent, if you converted "there" and "back again" you ended up with what you started with, whereas for some of the OS provided functions that wasn't always the case.
Or something like that. It was a while ago and I may be talking nonsense, and it may well all have changed in the meantime anyway.
And it may not matter anyway.

That said, another "advantage" of using our own functions is that the conversion will be the same on all hosts (even if it is wrong, at least we'd be consistently wrong!)

Bill Spitzak

unread,
Nov 21, 2023, 11:15:21 AM11/21/23
to fltkc...@googlegroups.com
For conversion from UTF-16 the Windows function is probably safe to use. I would check exactly what it does with "invalid UTF-16" which is where there are "first surrogate" or "last surrogate" codes not next to each other. If this throws an exception or crashes or truncates the string then the Windows function cannot be used, as this provides a DOS security bug. Ideally what it does is turn these into the matching 3-byte UTF-8 sequence. But it is probably ok if it turns them into other UTF-8 or removes them entirely.

The opposite direction, converting from UTF-8 to UTF-16, is more problematic, but the test is the same. Check what it does with "invalid UTF-8". It cannot crash, throw an exception, or truncate the string. Also check what it does with whatever the from-UTF-16 converter did with the unpaired surrogates. Ideally it turns back into the original (invalid) UTF-16. IMHO the ideal result is to turn the bytes of any "error" into individual characters based on CP1252, that is what the FLTK functions did.

--
You received this message because you are subscribed to the Google Groups "fltk.coredev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fltkcoredev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fltkcoredev/9071e549-26a9-4041-b7bc-baf466b06568n%40googlegroups.com.

Albrecht Schlosser

unread,
Nov 23, 2023, 1:22:34 PM11/23/23
to fltkc...@googlegroups.com
Thanks, Bill and Ian, these are good points, and sorry for the late reply.

I think we need to consolidate our UTF-8 to UTF-16 and vice versa conversions. Ideally we would use only one function in each direction.

I remember also (w/o looking at the code) that there were attempts to convert some illegal characters (as mentioned by Ian) as if they were Windows CP-1252 or ISO-8859-1 single byte characters (range 0x80 to 0xff) in our functions, very likely controlled by some compiler macros. If this is true then we should probably not use the Windows functions at all and use ours exclusively.

Regarding cross-platform consistency across hosts: I'm not sure if UTF-16 is used on any host, as far as the FLTK library is concerned. I assume it's only used for Windows hosts but users might use our conversion functions for any input (maybe files) that are encoded in UTF-16 or if they want to output UTF-16.

That said, I'll look into it and take care of the points made by Ian and Bill. Thanks.

Bill Spitzak

unread,
Nov 23, 2023, 1:32:23 PM11/23/23
to fltkc...@googlegroups.com
Though I like converting errors into CP1252 (as it makes old strings still work), it is not vital. What is vital is that the functions don't do anything other than return all the valid characters in the string if there is an encoding error in the middle, and replaces the errors with something that is a valid encoding. It is also mildly important that unpaired surrogates (an invalid sequenece in UTF-16) if converted to UTF-8 and bac produces the same unpaired surrogates. I think some quick tests of the Windows functions will reveal what happens with them, maybe then they are the only ones used (it would mean no converter functions are in fltk for non-Windows, but this might be a good idea as there should be no reason to use them).


--
You received this message because you are subscribed to the Google Groups "fltk.coredev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fltkcoredev...@googlegroups.com.

Albrecht Schlosser

unread,
Nov 23, 2023, 1:35:51 PM11/23/23
to fltkc...@googlegroups.com
On 11/23/23 19:22 Albrecht Schlosser wrote:
> I think we need to consolidate our UTF-8 to UTF-16 and vice versa
> conversions. Ideally we would use only one function in each direction.
>
> ...
>
> I'll look into it and take care of the points made by Ian and Bill.
> Thanks.

Note: I opened GitHub Issue #846 for this, see
https://github.com/fltk/fltk/issues/846.

Greg Ercolano

unread,
Nov 23, 2023, 2:28:12 PM11/23/23
to fltkc...@googlegroups.com

On 11/23/23 10:32, Bill Spitzak wrote:

Though I like converting errors into CP1252 (as it makes old strings still work), it is not vital. What is vital is that the functions don't do anything other than return all the valid characters in the string if there is an encoding error in the middle, and replaces the errors with something that is a valid encoding.

    FWIW in Fl_Terminal I've been using the "upside down question mark" for errors, e.g.

      const char *unknown = "¿";

    That's seems to be a popular character in spanish, and is in the CP-1252 set.
    I think old IBM terminals I used did the same thing for unprintable characters.

    I use it for showing bad ANSI/xterm sequences and unprintable/unsupported
    control characters.

    The Fl_Terminal::show_unknown(true|false) method turns that feature on or off.
    When it's off, it shows nothing instead of that character.


Reply all
Reply to author
Forward
0 new messages