Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

"Doing UTF-8 in Windows" by Mircea Neacsu

77 views
Skip to first unread message

Lynn McGuire

unread,
Feb 19, 2020, 4:35:17 PM2/19/20
to
"Doing UTF-8 in Windows" by Mircea Neacsu
https://www.codeproject.com/Articles/5252037/Doing-UTF-8-in-Windows

"This is (yet another!) article on how to handle UTF-8 encoding on a
platform that still encourages the UTF-16 encoding. I am also providing
a small library for this purpose. The code works, it is clean, easy to
understand and small."

"This is an implementation of the solution advocated in the UTF-8
Everywhere manifesto. I would strongly encourage you to go read the
whole document to get indoctrinated ☺."
http://utf8everywhere.org/

We are finally moving our software to UTF-8. It is horrendous so far.

Lynn

Jorgen Grahn

unread,
Feb 19, 2020, 5:16:19 PM2/19/20
to
On Wed, 2020-02-19, Lynn McGuire wrote:
...
> We are finally moving our software to UTF-8. It is horrendous so far.

Can you expand on that? E.g. moving from what?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Lynn McGuire

unread,
Feb 19, 2020, 5:26:44 PM2/19/20
to
On 2/19/2020 4:16 PM, Jorgen Grahn wrote:
> On Wed, 2020-02-19, Lynn McGuire wrote:
> ...
>> We are finally moving our software to UTF-8. It is horrendous so far.
>
> Can you expand on that? E.g. moving from what?
>
> /Jorgen

ASCII. Our Windows user interface has 450,000 lines of code in C++.
Our Calculation Engine has 700,000 lines of F77 and 10,000+ lines of C
and C++.

Lynn




Jorgen Grahn

unread,
Feb 19, 2020, 5:48:54 PM2/19/20
to
On Wed, 2020-02-19, Lynn McGuire wrote:
> On 2/19/2020 4:16 PM, Jorgen Grahn wrote:
>> On Wed, 2020-02-19, Lynn McGuire wrote:
>> ...
>>> We are finally moving our software to UTF-8. It is horrendous so far.
>>
>> Can you expand on that? E.g. moving from what?
>>
>> /Jorgen
>
> ASCII.

Then you're already doing UTF-8! (Only half-joking.)

> Our Windows user interface has 450,000 lines of code in C++.
> Our Calculation Engine has 700,000 lines of F77 and 10,000+ lines of C
> and C++.

I guess this is much work or little, depending on how much that code cares
about the actual contents of strings.

Lynn McGuire

unread,
Feb 19, 2020, 6:06:20 PM2/19/20
to
On 2/19/2020 4:48 PM, Jorgen Grahn wrote:
> On Wed, 2020-02-19, Lynn McGuire wrote:
>> On 2/19/2020 4:16 PM, Jorgen Grahn wrote:
>>> On Wed, 2020-02-19, Lynn McGuire wrote:
>>> ...
>>>> We are finally moving our software to UTF-8. It is horrendous so far.
>>>
>>> Can you expand on that? E.g. moving from what?
>>>
>>> /Jorgen
>>
>> ASCII.
>
> Then you're already doing UTF-8! (Only half-joking.)
>
>> Our Windows user interface has 450,000 lines of code in C++.
>> Our Calculation Engine has 700,000 lines of F77 and 10,000+ lines of C
>> and C++.
>
> I guess this is much work or little, depending on how much that code cares
> about the actual contents of strings.
>
> /Jorgen

Anything that calls the Win32 API or opens a file ...

Lynn


Christian Gollwitzer

unread,
Feb 20, 2020, 1:51:01 AM2/20/20
to
Am 20.02.20 um 00:06 schrieb Lynn McGuire:
> Anything that calls the Win32 API or opens a file ...
>

Migrate to Linux?

SCNR

But it's true - fopen("blöä€λκΠни", "r") simply works as expected on any
modern Linux system.

Öö Tiib

unread,
Feb 20, 2020, 3:07:25 AM2/20/20
to
On Thursday, 20 February 2020 08:51:01 UTC+2, Christian Gollwitzer wrote:
> Am 20.02.20 um 00:06 schrieb Lynn McGuire:
> > On 2/19/2020 4:48 PM, Jorgen Grahn wrote:
> >> On Wed, 2020-02-19, Lynn McGuire wrote:
> >>> On 2/19/2020 4:16 PM, Jorgen Grahn wrote:
> >>>> On Wed, 2020-02-19, Lynn McGuire wrote:
> >>>> ...
> >>>>> We are finally moving our software to UTF-8.  It is horrendous so far.
> >>>>
> >>>> Can you expand on that?  E.g. moving from what?
> >>>>
> >>>> /Jorgen
> >>>
> >>> ASCII.
> >>
> >> Then you're already doing UTF-8!  (Only half-joking.)
> >>
> >>> Our Windows user interface has 450,000 lines of code in C++.
> >>> Our Calculation Engine has 700,000 lines of F77 and 10,000+ lines of C
> >>> and C++.
> >>
> >> I guess this is much work or little, depending on how much that code
> >> cares
> >> about the actual contents of strings.
> >
> > Anything that calls the Win32 API or opens a file ...
> >
>
> Migrate to Linux?

That will only open new (likely smaller) market and not
resolve the issue on Lynn's current, Windows-using market.

Windows-API-calling code-base of described size (together
with unit-tests and other support stuff) can take tens
of man-years to migrate to Mac and/or Linux. So couple
millions of budget is needed for to start.

Lynn McGuire

unread,
Feb 20, 2020, 2:22:37 PM2/20/20
to
None of my customers are running Linux on their desktops. All are
businesses running Windows. But, who knows what the future will bring.

Lynn

Lynn McGuire

unread,
Feb 20, 2020, 2:23:54 PM2/20/20
to
Thanks for understanding the issues of working in a very small shop. We
actually want to go to the cloud but that is a total disaster for us.

Lynn

Bart

unread,
Feb 20, 2020, 2:46:07 PM2/20/20
to
I was last writing commercial software in the 90s, to run on Windows
because that was what everybody had. Windows PCs could also be bought
anywhere off-the-shelf, ready to go, and with a vast number of
peripherals available, that could be trusted to just work.

I had no idea at the time where one would even buy a Linux PC.

Perhaps one in a thousand asked about running on Linux, but it wasn't
worth it to us to do anything about that. Some had success though
emulating Windows on Macs.

20-25 years on, I'm not sure that much has changed; I still have no idea
where one would buy an actual Linux desktop PC. (Tablets and phones are
a completely different market, not really suitable for the kind of
software we wrote.)

Even if Linux PCs were everywhere - each one is running a different
version of Linux. Each could have a different processor. How do you even
distribute binaries on such systems? That is also a different world.

It's only on newsgroups like this that Windows developers are in a minority.

David Brown

unread,
Feb 20, 2020, 3:49:07 PM2/20/20
to
Usually you don't buy a Linux PC. You buy a PC, and install Linux on
it. Some PC's can be bought without an OS, and some few specialist
places sell them with Linux installed already (usually Ubuntu). Often
you get the option of no OS or your preference of Linux from sites that
let you build your own configuration. The same applies to machines you
build yourself from parts. And of course, any company or organisation
buying in enough quantity gets to make their own choices.

But for the solid majority of cases, you get a PC with Windows and
either dual-boot or scrub Windows, and install Linux yourself.
Economics is a weird thing - it is cheaper for most manufacturers to
sell a PC with Windows than one without, even when they have to pay for
the Windows license. If they sell you Windows, they offset the cost of
the license by installing crapware - time-limited versions of MS Office,
internet "security" software, and all the other limited and demo
software that can take hours to clear out. These are adverts, and the
manufacturer makes money from the deal. So they lose out if you want a
system without Windows.

Installing a common Linux system - such as Ubuntu, Red Hat, or Mint (my
personal favourite for desktops) is generally a simple business, and I
find it takes a lot less time than getting many Windows systems
installed from their hard disk, configured (with all the painful process
of persuading it to work without giving Microsoft the keys to your
life), and removing the crapware. And then you have to start installing
a proper browser or two, email program, office package, compiler,
editor, and so on - things that come out of the box on any Linux
desktop. Windows and Linux each have their advantages and
disadvantages, and each are far from perfect - but there is no doubt in
my mind that Linux is normally a great deal faster to go from
out-of-the-box to usable system.

You worry about different versions of Linux on different systems - that
is a valid concern. For most software it is not a big issue, but it is
certainly quite an effort if you want your software to look good on a
range of systems - people can use very different desktops, for example.
(Equally, for some kinds of software it can be difficult supporting all
the different Windows versions around.)

Linux desktops will invariably be x86 systems, and all but the oldest
will be 64-bit - just as in the Windows world. cpus are only a concern
if you want to take advantage of special features, the latest SIMD
instructions, and so on - again, just like for Windows.

The solid majority of user desktop systems are Windows, with Mac's in a
low second place, and Linux behind. For laptops, ChromeOS is a growing
market share - it is Linux. But most programs written for it will be
higher level languages or web applications, and work on any system.

On servers, especially more powerful ones, Linux is a lot more common.
I don't think we've used Windows on a new server since the turn of the
century.

You see Linux on the desktop or laptop in more specialist use. You see
it a lot in software development - it is simply a hugely more efficient
environment for most software development tasks. For anyone involved in
IT or networking, Linux is the natural choice for anything except
perhaps managing Windows servers. High power software - simulations,
modelling, big data, high-end CAD, etc., is often done on *nix of some kind.

As for people in a newsgroup like this, I think a lot of it is a
generation thing. Much of the "old guard" started with Unix systems
(some were pre-Unix), and many will have moved to Linux. The middle
group will mostly be from a Windows-dominated age, while newer
programmers are on Linux again.

Lynn McGuire

unread,
Feb 20, 2020, 9:38:55 PM2/20/20
to
I started writing Fortran IV code on a Univac 1108 in 1975.

Lyn


Alf P. Steinbach

unread,
Feb 20, 2020, 9:40:47 PM2/20/20
to
Uhm, half a year after Windows finally got support for UTF-8 as process
ANSI codepage, you start rewriting your code base to replace calls of
the ordinary ASCII based functions with ditto UTF-8 ones.

That's sort of counter-productive, in-the-wrong-place-at-the-wrong-time.

- Alf

Cholo Lennon

unread,
Feb 20, 2020, 10:35:18 PM2/20/20
to
On 2/20/20 4:45 PM, Bart wrote:
> On 20/02/2020 19:22, Lynn McGuire wrote:
>> On 2/20/2020 12:50 AM, Christian Gollwitzer wrote:
>
>>> Migrate to Linux?
>>>
>>> SCNR
>>>
>>> But it's true - fopen("blöä€λκΠни", "r") simply works as expected on
>>> any modern Linux system.
>>
>> None of my customers are running Linux on their desktops.  All are
>> businesses running Windows.  But, who knows what the future will bring.
>
> I was last writing commercial software in the 90s, to run on Windows
> because that was what everybody had. Windows PCs could also be bought
> anywhere off-the-shelf, ready to go, and with a vast number of
> peripherals available, that could be trusted to just work.
>
> I had no idea at the time where one would even buy a Linux PC.
>
> Perhaps one in a thousand asked about running on Linux, but it wasn't
> worth it to us to do anything about that. Some had success though
> emulating Windows on Macs.
>
> 20-25 years on, I'm not sure that much has changed; I still have no idea
> where one would buy an actual Linux desktop PC. (Tablets and phones are
> a completely different market, not really suitable for the kind of
> software we wrote.)
>

There are many places, here is one of them:
https://slimbook.es/en/

> Even if Linux PCs were everywhere - each one is running a different
> version of Linux. Each could have a different processor. How do you even
> distribute binaries on such systems? That is also a different world.
>

Well most systems (specially desktops) are x86/x86_64, for those you
have many alternatives like static compilation, dockers, flatpak,
AppImage, Snap, etc.

I worked several years in the telecom industry (before docker started to
simplify our lives, and Linux took the crown against Windows/Solaris on
that domain). Our (very) complex server side applications ran on Windows
(32/64 bits), Linux (different versions of RHEL 32/64 bits) and Solaris
Intel/Sparc. It was a great effort to have (design/build/maintain) a C++
cross-platform code base, and also to deploy those applications (using
home-made deployer scripts; so we distributed our binaries using special
scripts), but it wasn't impossible. Yeah, it wasn't impossible, but
costly, so costly in terms of time and money that I spent my final years
in that industry writing new software in Java (Business decisions) and
maintaining "legacy" C++ code.

--
Cholo Lennon
Bs.As.
ARG

Cholo Lennon

unread,
Feb 20, 2020, 10:56:22 PM2/20/20
to
Wow, I used F77 (a Microsoft variant) at university in 1993. It was
ancient at that time with poor control sentences. I have some scientist
friends who still use it for calculus/simulations, but I can't believe
that it is still present in commercial applications :-D

Lynn McGuire

unread,
Feb 20, 2020, 11:37:10 PM2/20/20
to
Many commercial applications. Our software was first released in 1969.
Any calculational software of that vintage is Fortran.

Lynn

Lynn McGuire

unread,
Feb 20, 2020, 11:38:41 PM2/20/20
to
Nope. We are converting our data storage to UTF-8. Our Win32 calls are
being converted to UTF-16. There is no UTF-8 Win32 API.

Lynn

Alf P. Steinbach

unread,
Feb 21, 2020, 12:12:31 AM2/21/20
to
Oh that's good. I misunderstood what you're doing, it wasn't all that clear.


>  There is no UTF-8 Win32 API.

If you mean your company is not using it, that's one thing.

If you mean there's no such, that's incorrect.

When you set the process codepage to UTF-8, then the ANSI API is an
UTF-8 based API. The most notable function that has no ANSI wrapper, the
command line parsing function, isn't needed because the `main` arguments
are then UTF-8 (with the compilers I tried, VC and g++). And from what
I've seen use of the API for command line parsing is exceedingly rare.

- Alf

Cholo Lennon

unread,
Feb 21, 2020, 12:17:45 AM2/21/20
to
Wow again. I received support tickets for my own creations after more
than 10 years in production or other tickets after more than 15 years in
production, but 1969! that year is way ahead of my "personal records",
actually I wasn't born :-O

Clearly software will survive us :-P

Christian Gollwitzer

unread,
Feb 21, 2020, 1:20:40 AM2/21/20
to
Am 21.02.20 um 06:12 schrieb Alf P. Steinbach:
> When you set the process codepage to UTF-8, then the ANSI API is an
> UTF-8 based API. The most notable function that has no ANSI wrapper, the
> command line parsing function, isn't needed because the `main` arguments
> are then UTF-8 (with the compilers I tried, VC and g++). And from what
> I've seen use of the API for command line parsing is exceedingly rare.

Hold on. Does it mean, that there is a single line like this

#ifdef WINDOWS
magic_set_locale_toUTF8();
#endif

which I can put into my program so that it works the same as on Linux,
i.e. std::ifstream(argv[1]) will work?

Christian

Alf P. Steinbach

unread,
Feb 21, 2020, 3:20:09 AM2/21/20
to
Almost.

It's not done in source code (at least not until Microsoft documents the
internals), but via a resource embedded in the executable, or by
configuring Windows.

And it only works on Windowses since May 2019: it's not been backported.

Also, heads-up: while Visual C++ 2019's runtime library has been updated
so that the C level locale (`setlocale` and friends) supports UTF-8,
this is not so with MinGW g++ 9.2. With that compiler `main` arguments
works, and also UTF-8 file paths, because those just come from or are
passed to ANSI API functions, but the standard library's conversion
machinery doesn't work.

Hopefully the gcc/MinGW folks will get on top this Real Soon Now.

---

/Configuring Windows/

Change the item ACP to value 65001, in the semi-documented registry key
Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

---

/Use a resource embedded in executable/

Example application manifest (don't add whitespace around "UTF-8"):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="UTF-8 message box"
version="1.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
>UTF-8</activeCodePage>
</windowsSettings>
</application>
<dependency>
<dependentAssembly>
<assemblyIdentity
type="win32"
name="Microsoft.Windows.Common-Controls"
version="6.0.0.0"
processorArchitecture="*"
publicKeyToken="6595b64144ccf1df"
language="*"
/>
</dependentAssembly>
</dependency>
</assembly>

Resource file that embeds it, called "resources.rc" in following example
build commands:

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"


Example build commands with g++:

[G:\code\minimal_gui\binaries]
> set CFG=-std=c++17 -Wall -pedantic-errors

[G:\code\minimal_gui\binaries]
> g++ %CFG% ..\minimal.cpp -c -o minimal.o

[G:\code\minimal_gui\binaries]
> windres ..\resources.rc -o resources.o

[G:\code\minimal_gui\binaries]
> g++ minimal.o resources.o -mwindows

- Alf

Ralf Fassel

unread,
Feb 21, 2020, 4:07:23 AM2/21/20
to
* "Alf P. Steinbach" <alf.p.stein...@gmail.com>
| On 21.02.2020 07:20, Christian Gollwitzer wrote:
| > Hold on. Does it mean, that there is a single line like this
| >
| > #ifdef WINDOWS
| > magic_set_locale_toUTF8();
| > #endif
| >
| > which I can put into my program so that it works the same as on
| > Linux, i.e. std::ifstream(argv[1]) will work?
| >
>
| Almost.
>
| It's not done in source code (at least not until Microsoft documents
| the internals), but via a resource embedded in the executable, or by
| configuring Windows.
>
| And it only works on Windowses since May 2019: it's not been backported.

Do you have a MSDN reference handy where one could read up on this
topic?

TNX
R'

Alf P. Steinbach

unread,
Feb 21, 2020, 4:35:15 AM2/21/20
to
The little documentation I know of is unfortunately both incomplete and
super vague, but such as it is:

<url:
https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page#set-a-process-code-page-to-utf-8>

This is fast becoming off-topic for clc++.

However, I first learned about this feature here; I don't recall who.


- Alf

Thiago Adams

unread,
Feb 22, 2020, 6:22:32 PM2/22/20
to
On Friday, February 21, 2020 at 6:35:15 AM UTC-3, Alf P. Steinbach wrote:
> On 21.02.2020 10:07, Ralf Fassel wrote:
> > * "Alf P. Steinbach"
In case someone wants to print utf8 chars on windows console

UINT oldcp = GetConsoleOutputCP();
SetConsoleOutputCP(CP_UTF8);

printf(u8"maçã");

SetConsoleOutputCP(oldcp);




Öö Tiib

unread,
Feb 22, 2020, 10:53:13 PM2/22/20
to
Thanks, I'll try when I get something windows ... possibly Monday. ;)
0 new messages