Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Hello World in Russian

217 views
Skip to first unread message

Joseph Hesse

unread,
Jan 28, 2022, 10:51:28 PM1/28/22
to
In the following program, I have written "hello world", copied from the
iternet, in Russian. I don't understand Russian but I understand their
alphabet is different from the English alphabet. The program is
supposed to output the original message and the characters in the
original message. My question is why are they different?
Thank you,
Joe

$ more Hello.cpp
#include <iostream>
using namespace std;

int main()
{
const wchar_t russian[] = L"Привет мир"; //Russian for "hello, world"

wcout << russian << endl;

for(const wchar_t &x : russian)
wcout << x << L", ";
wcout << endl;
}
$ g++ -std=c++14 Hello.cpp
$ ./a.out
Privet mir <== original message
P, r, i, v, e, t, , m, i, r, , <== wchars in original message
$

James Kuyper

unread,
Jan 28, 2022, 11:27:21 PM1/28/22
to
On 1/28/22 22:51, Joseph Hesse wrote:
> In the following program, I have written "hello world", copied from
> the iternet, in Russian. I don't understand Russian but I understand
> their alphabet is different from the English alphabet. The program is
> supposed to output the original message and the characters in the
> original message. My question is why are they different?
> Thank you,
> Joe
>
> $ more Hello.cpp
> #include <iostream>
> using namespace std;
>
> int main()
> {
> const wchar_t russian[] = L"Привет мир"; //Russian for "hello, world"

I do understand Russian (a little), and that looks correct to me.

> wcout << russian << endl;
>
> for(const wchar_t &x : russian)
> wcout << x << L", ";
> wcout << endl;
> }
> $ g++ -std=c++14 Hello.cpp
> $ ./a.out
> Privet mir <== original message
> P, r, i, v, e, t, , m, i, r, , <== wchars in original message
> $

The problem is that the behavior of your program depends upon the
locale. The default locale is the "C" locale, which doesn't support
those Russian characters. How those characters are interpreted is
implementation-defined - I think it's amusing that they get
misinterpreted as latin letters with the same sounds.

I added two lines to your program:

#include <clocale>

and as the first line of main():

std::setlocale(LC_CTYPE, "");

"" is the name of the native locale, what that means is up to your
implementation. On my desktop machine, it's the locale identified by the
LANG environment variable. I've currently got that set to en_US.UTF-8.
Because that locale supports UTF-8, it in particular supports those
Russian letters, and I get the following output:

Привет мир
П, р, и, в, е, т, , м, и, р, ,

You'll need to determine how the native locale is set on your machine,
and what the list of supported locales is, and they you'll have to find
out which one(s) support the characters you want to print.

Andrey Tarasevich

unread,
Jan 29, 2022, 1:42:48 AM1/29/22
to
Um... How did you manage to make your program output `L"Привет мир"` as

Privet mir

?

--
Best regards,
Andrey Tarasevich


Mut...@dastardlyhq.com

unread,
Jan 29, 2022, 5:21:45 AM1/29/22
to
Just use UTF8 and you won't need to worry about wide char nonsense. Just
use ordinary chars and let the terminal deal with translating the encoding.
On *nix anyway. Windows is probably somewhat more primitive in its approach.


Alf P. Steinbach

unread,
Jan 29, 2022, 5:49:08 AM1/29/22
to
I reproduced your result in Ubuntu.

The reason this happens is:

* The wide streams translate to/from the external byte-oriented encoding.
* With no locale specified the external encoding wcout knows about is
the one in the default "C" locale, namely pure ASCII.
* It uses a translation where symbols are mapped to phonetically similar
characters in the result encoding.

Generally it's a pretty ungood idea to use the wide streams. Since they
don't work in modern Windows (they get confused about the external
encoding, in all implementations) they now have the direct opposite
effect of the intended one. Namely, they make the code non-portable.

The modern way to do Russian or whatever output is to assume UTF-8:

#include <iostream>
using namespace std;

auto main() -> int
{
const auto& russian = "Привет мир"; // Russian for "hello,
world".
cout << russian << endl;
}

This works when it's built and invoked correctly, but unfortunately
there's no error when it's not done correctly.

---

The general UTF-8 assumption can however be easily checked for the first
two point below, about assuming UTF-8

* as the source encoding,
* as the C++ literal constants encoding (the "execution character set"), and
* as the external environment's byte oriented encoding.

The following code statically asserts that the two first points hold:

#include <iostream>
using namespace std;

constexpr auto utf8_is_the_execution_character_set()
-> bool
{
constexpr auto& slashed_o = "ø";
return (sizeof( slashed_o ) == 3 and slashed_o[0] == '\xC3' and
slashed_o[1] == '\xB8');
}

static_assert(
utf8_is_the_execution_character_set(),
"The execution character set must be UTF-8 (e.g. MSVC option
\"/utf-8\")."
);

auto main() -> int
{
const auto& russian = "Привет мир"; // Russian for "hello,
world".
cout << russian << endl;
}


You have to ensure the third aspect, that UTF-8 is the external
encoding, in some way. In Linux it's the default. In Windows it can be
ensured /for output/ by setting the console window's active codepage to
65001, e.g. via a command such as `chcp 65001`, or programmatically.

---

Unfortunately, still as of Windows 11 console windows do not support
input of UTF-8, other than the ASCII subset. :(

You can however use various 3rd party libraries to do portable UTF-8 input.

Those libraries include Boost Nowide and my own still under construction
library code, currently called Kickstart. At the start of the Boost
adoption process, when Nowide had been formally adopted but not yet
released with Boost, the library still got Windows CRLF endings wrong,
so it's perhaps not the highest quality library in the world. And
Kickstart has been changing, e.g. filenames and file locations, about
every time I've looked at it, in spite of my claims that /now/ it's
pretty stable, so probably neither of them are suitable for commercial code.

The common theme is that these libraries require you to /explicitly/ use
their UTF-8 input, instead of just overriding the behavior of `std::cin`
or `stdin`. I believe because there are so many parties, including
Microsoft, doing that that it gets real messy. I did that once, in an
intended library, but it turned out that Microsoft changed things,
introduced new sabotage-like stuff, faster than I could keep up.

And unfortunately the people in charge of the standardization of C++
just plain refuse to fix things, such as e.g. an ensured UTF-8 mode of
the console i/o, and instead insist on re-educating C++ programmers into
using modern `char8_t` instead of `char`. Like re-educating
heterosexuals into adopting modern trans-sexual behavior. To this end in
C++20 they broke all code that made use of critical functionality of
`std::filesystem::path` (critical for Windows use, that is), and they
were very much made aware of that and refused to reconsider, so it's
very political, after 25 years, which means it will never get fixed.


- Alf

David Brown

unread,
Jan 29, 2022, 7:02:10 AM1/29/22
to
That is the Latin alphabet transliteration of the Cyrillic letters. As
to why that conversion was done, I expect it is something to do with
locales and internationalisation libraries. It doesn't work for all
non-ASCII characters, which is hardly surprising. A quick test with
Norwegian letters "å", "ø" and "æ" resulted in "?", "?" and "ae".

The simple answer, at least in the *nix world, is to drop "wchar_t"
completely and forget that disaster ever existed.


#include <iostream>
using namespace std;

int main()
{
const char russian[] = "Привет мир"; //Russian for "hello, world"

cout << russian << endl;

for (const char_t &x : russian)
cout << x << L", ";
cout << endl;
}

As long as you have UTF8 locales and fonts (and you'd need an ancient
setup, or a very limited one, not to have them), you'll get the right
output.


What I have learned from all this, is that "мир" ("mir") means "world"
as well as "peace". I find that an interesting linguistic titbit.

Alf P. Steinbach

unread,
Jan 29, 2022, 8:12:48 AM1/29/22
to
Correct.

The wide streams must translate to ASCII in that situation with no
selected locale, i.e. with the default "C" locale.


> It doesn't work for all
> non-ASCII characters, which is hardly surprising. A quick test with
> Norwegian letters "å", "ø" and "æ" resulted in "?", "?" and "ae".
>
> The simple answer, at least in the *nix world, is to drop "wchar_t"
> completely and forget that disaster ever existed.

Agreed, including the qualification.


> #include <iostream>
> using namespace std;
>
> int main()
> {
> const char russian[] = "Привет мир"; //Russian for "hello, world"

I would just write

const auto& russian = "Привет мир"; //Russian for "hello, world"

... preserving the length information, and not relying on the compiler
to optimize away the copying, where the technically redundant `const` is
for clarity and supports fast visual scanning of code.

In particular this supports forming a `constexpr stringview`.


> cout << russian << endl;
>
> for (const char_t &x : russian)
> cout << x << L", ";
> cout << endl;
> }
>
> As long as you have UTF8 locales and fonts (and you'd need an ancient
> setup, or a very limited one, not to have them), you'll get the right
> output.
>
>
> What I have learned from all this, is that "мир" ("mir") means "world"
> as well as "peace". I find that an interesting linguistic titbit.

And, a space station. :-o

I think in these troublesome times (North Stream 2 etc., not to mention
Hunter Biden's non-doings in Ukraine) it's worth mentioning, also a
little off-topic, that, according to Wikipedia, Russia supported the
American Revolution, and not just by later selling them Alaska. If not
for Russia the American Revolution would IMO have failed, i.e., Russia
helped to /create/ the United States. The first quoted sentence below is
at best opinion presented as fact (unlike my
opinion-presented-as-opinion), using words with negative connotations
(again unlike mine), and it says a lot that it's there uncontested in
Wikipedia, but anyway:


<<
Catherine the Great, a Russian empress who ruled from 1762–1796, played
a modest role in the American Revolutionary War through her politicking
with other European heads of state. Initially, the tsarina took a keen
interest in the American struggle because it affected "English and
European politics" and frankly believed that Britain was to blame for
the conflict. She held a negative opinion of King George and his
diplomats, often treating them with contempt. Nonetheless, the British
crown still formally requested 20,000 troops in 1775 and sought an
alliance.[20] She refused both pleas. Upon Spain's entry into the war,
Britain once again turned to the Russian Empire, but this time, the
English hoped for naval support. Catherine II once again ignored the
British requests.

Perhaps Catherine the Great's greatest diplomatic contribution came from
the creation and proclamation of the First League of Armed Neutrality in
1780. This declaration of armed neutrality had several stipulations, but
three crucial ones: first, "that neutral ships may freely visit the
ports of belligerent Powers;" second, "that the goods of belligerent
Powers on neutral ships are permitted to pass without hindrance, with
the exception of war contraband;" and, third, "under the definition of a
blockaded port falls only a port into which entry is actually hampered
by naval forces." Most European nations agreed to these terms, but
Britain refused to recognize the arrangement because it undermined the
blockade, its most effective military strategy. After establishing a
league of neutral parties, Catherine the Great attempted to act as a
mediator between the United States and Britain by submitting a ceasefire
plan. During her attempts at mediation, though, the Battle of Yorktown
thwarted any hope of a peaceful and diplomatic solution to the American
Revolutionary War.

These negotiations were accompanied by political intrigue. In 1780,
during the period of Catherine II's mediating, Britain attempted to
bribe the Russian Empire into an alliance. London offered St. Petersburg
the island of Menorca if the Russians would agree to join the British in
the war. Despite the economic boost such an acquisition offered,
Catherine the Great refused this bribe and utilized it as an opportunity
to make George III a laughing–stock of the European powers.
>>


- Alf

Paavo Helde

unread,
Jan 29, 2022, 10:39:50 AM1/29/22
to
29.01.2022 12:48 Alf P. Steinbach kirjutas:

> At the start of the Boost
> adoption process, when Nowide had been formally adopted but not yet
> released with Boost, the library still got Windows CRLF endings wrong,
> so it's perhaps not the highest quality library in the world.

Nowadays even Notepad has learned to cope with LF linebreaks, so there
is essentially no reason any more to use CRLF anywhere at all. As it has
happened with utf-8, Microsoft will be finally enforced to give up its
stubbornness and join the sane world.

The whole notion of "text files" is seriously out of date. What's the
point of having the file content in memory differ from the content on
disk? Especially considering memory mapping, HTTP packets, cloud storage
etc.

I predict the same will happen with backslashes as well, but this might
take another 20 years. All the backslash madness in Windows is about to
avoid typing a single space after the command name, it's about time to
get rid of this.

David Brown

unread,
Jan 29, 2022, 11:34:26 AM1/29/22
to
I read your other post after I had written this one - it was very
informative.

>
>
>>  It doesn't work for all
>> non-ASCII characters, which is hardly surprising.  A quick test with
>> Norwegian letters "å", "ø" and "æ" resulted in "?", "?" and "ae".
>>
>> The simple answer, at least in the *nix world, is to drop "wchar_t"
>> completely and forget that disaster ever existed.
>
> Agreed, including the qualification.
>
>
>> #include <iostream>
>> using namespace std;
>>
>> int main()
>> {
>>     const char russian[] = "Привет мир";  //Russian for "hello, world"
>
> I would just write
>
>     const auto& russian = "Привет мир";  //Russian for "hello, world"
>
> ... preserving the length information, and not relying on the compiler
> to optimize away the copying, where the technically redundant `const` is
> for clarity and supports fast visual scanning of code.
>
> In particular this supports forming a `constexpr stringview`.
>

Sure. I was just making the minor modification to the original to
remove the wide characters.

>
>>     cout << russian << endl;
>>
>>     for (const char_t &x : russian)
>>       cout << x << L", ";
>>     cout << endl;
>> }
>>
>> As long as you have UTF8 locales and fonts (and you'd need an ancient
>> setup, or a very limited one, not to have them), you'll get the right
>> output.
>>
>>
>> What I have learned from all this, is that "мир" ("mir") means "world"
>> as well as "peace".  I find that an interesting linguistic titbit.
>
> And, a space station. :-o

Yes - that's why I know it means "peace".

(I'm snipping the history - while I enjoy history, discussing it can run
the risk of mixing it with politics and getting into very off-topic
discussions.)

Alf P. Steinbach

unread,
Jan 30, 2022, 3:19:16 AM1/30/22
to
On 29 Jan 2022 16:39, Paavo Helde wrote:
> 29.01.2022 12:48 Alf P. Steinbach kirjutas:
>
>> At the start of the Boost adoption process, when Nowide had been
>> formally adopted but not yet released with Boost, the library still
>> got Windows CRLF endings wrong, so it's perhaps not the highest
>> quality library in the world.
>
> Nowadays even Notepad has learned to cope with LF linebreaks, so there
> is essentially no reason any more to use CRLF anywhere at all.

Oh, just all the internet protocols, like NNTP. ;-)


> As it has
> happened with utf-8, Microsoft will be finally enforced to give up its
> stubbornness and join the sane world.
>
> The whole notion of "text files" is seriously out of date. What's the
> point of having the file content in memory differ from the content on
> disk? Especially considering memory mapping, HTTP packets, cloud storage
> etc.

My opinion is that C (and hence also C++) text mode is an abomination
that should never have been introduced, and that in addition, given that
it was introduced, it's designed in a stupid way with the data
conversion applied underneath the buffer level so that one can't get a
clear view of the raw data.

As an example, the design means that Unix `cat` can't be faithfully
implemented in Windows using only standard C or C++, which IMO is extreme.

That said, from a C++ programming perspective data in memory is usually
statically typed, while data on disk is untyped or effectively
dynamically typed. In memory one knows that that thing is UTF-8 encoded
text. On disk on doesn't know and must assume, where in Windows such
assumptions can be partially checked (a good thing) via UTF-8 BOM.


> I predict the same will happen with backslashes as well, but this might
> take another 20 years. All the backslash madness in Windows is about to
> avoid typing a single space after the command name, it's about time to
> get rid of this.

The Windows API level supports forward slashes, as did DOS before, and
especially for `#include` directives they should be used, not backslash.

Why the Windows shells and applications generally don't support them is
a mystery.

In some cases the fiction about what's allowed is imperfectly
implemented. I remember in the 1990's (when I still worked) I had some
fun demonstrating to colleagues how to completely and utterly hide some
data on disk, using commands like


<<
[C:\root\temp]
> type nul >poem.txt

[C:\root\temp]
> echo "Very important secret!" > poem.txt:secret

[C:\root\temp]
> dir | find "poem"
30 Jan 2022 09:14 0 poem.txt

[C:\root\temp]
> find /v "" < poem.txt:secret
"Very important secret!"
>>


The file doesn't need to be empty, it can e.g. contain an actual poem if
one feels like security by obscurity is a great thing.

This is just a bug in cmd.exe where it fails to check that the file name
is "allowed" for ordinary users, so one is able to specify an internal
NTFS stream. :-)


- Alf

Marcel Mueller

unread,
Jan 30, 2022, 6:09:28 AM1/30/22
to
Am 29.01.22 um 16:39 schrieb Paavo Helde:
> Nowadays even Notepad has learned to cope with LF linebreaks, so there
> is essentially no reason any more to use CRLF anywhere at all. As it has
> happened with utf-8, Microsoft will be finally enforced to give up its
> stubbornness and join the sane world.

I would wonder if this ever happens.
But many implemetations are tolerant to different line end encodings.

> The whole notion of "text files" is seriously out of date. What's the
> point of having the file content in memory differ from the content on
> disk? Especially considering memory mapping, HTTP packets, cloud storage
> etc.

In HTTP LF w/o CR is not officially supported. ;-)

> I predict the same will happen with backslashes as well, but this might
> take another 20 years. All the backslash madness in Windows is about to
> avoid typing a single space after the command name, it's about time to
> get rid of this.

ntosknrl as well as its predecessor os2knrl can deal with '/' as path
separator for a long time too. Basically the same as with line ending:
they are tolerant.
But the command line parsers of may programs can not handle this since
the use '/' as escape character to denote an option. This makes the use
of forward slash unhandy.

And although UTF-8 is quite common nowadays it raises several problems
in certain situations. E.g. database fields with restricted length
accept different string lengths depending on the number of characters
with longer UTF-8 encoding used. No user will ever understand this.
In Chinese and several other "non-ASCII" languages the UTF-8 encoding is
furthermore less compact that UCS2. So the encoding issues will persist too.
And well the fact that on Unix-like OSes file names are just binary
blobs rather than a string with a known encoding raises further problems.


Marcel

Ben Bacarisse

unread,
Jan 30, 2022, 7:12:48 AM1/30/22
to
"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:

> My opinion is that C (and hence also C++) text mode is an abomination
> that should never have been introduced, and that in addition, given
> that it was introduced, it's designed in a stupid way with the data
> conversion applied underneath the buffer level so that one can't get a
> clear view of the raw data.
>
> As an example, the design means that Unix `cat` can't be faithfully
> implemented in Windows using only standard C or C++, which IMO is
> extreme.

I assume you are talking about the cases where cat defaults to reading
stdin and/or writing stdout, If so, it could be argued that it's not the
fault of the C and C++ standards, but more the fault of the
implementations not providing a useful freopen function.

But then maybe freopen simply can't be implemented in Windows for some
mysterious reason I don't get.

--
Ben.

Paavo Helde

unread,
Jan 30, 2022, 7:26:30 AM1/30/22
to
30.01.2022 13:09 Marcel Mueller kirjutas:
>
> In HTTP LF w/o CR is not officially supported. ;-)

Fixed encoding is better than random encoding. In HTTP headers one must
indeed use CR LF. But the (text) files themselves are sent as HTTP
bodies and there is zero reason to convert them to some other
representation than they have on the server disk.

>
>> I predict the same will happen with backslashes as well, but this
>> might take another 20 years. All the backslash madness in Windows is
>> about to avoid typing a single space after the command name, it's
>> about time to get rid of this.
>
> ntosknrl as well as its predecessor os2knrl can deal with '/' as path
> separator for a long time too. Basically the same as with line ending:
> they are tolerant.
> But the command line parsers of may programs can not handle this since
> the use '/' as escape character to denote an option. This makes the use
> of forward slash unhandy.

The slash can be still used as an option, there is no need to change
that. One just needs to start to demand to separate options by spaces,
something what a sane person has always done anyway, so that

dir C:/B

would list the file B in the root folder and

dir C: /B

would list all files in cwd of C: in bare format.

>
> And although UTF-8 is quite common nowadays it raises several problems
> in certain situations. E.g. database fields with restricted length
> accept different string lengths depending on the number of characters
> with longer UTF-8 encoding used. No user will ever understand this.

In real world UCS-2 is rarely used. Windows SDK, Java et al are using
UTF-16, which has the same string length problem. And UCS-4 is
definitely wasting space.

> In Chinese and several other "non-ASCII" languages the UTF-8 encoding is
> furthermore less compact that UCS2.

I have heard this claim is not supported by actual data. In real usage
Chinese is most often heavily interspersed with punctuation, numbers and
English keywords so that there is little or no benefit in using UCS2
over UTF-8.

> And well the fact that on Unix-like OSes file names are just binary
> blobs rather than a string with a known encoding raises further problems.

This has actually helped to standardize them all to UTF-8 in practice.

Alf P. Steinbach

unread,
Jan 30, 2022, 8:02:37 AM1/30/22
to
I guess you're talking about using standard C++ code with some system
specific knowledge such as how to specify the standard input stream as a
file name.

To reopen the standard input so that it connects to the original source,
which might a pipe or a file or the console, one needs to (1) identify
that source, and (2) open that source, possibly after closing the
original connection. Neither is feasible in general, even (AFAIK) in
Unix environment. But one might attempt to sidestep (1) by using a
filename that in the relevant OS denotes standard input.

A demonstration of an approach that fails is not a proof that no
solution exists, e.g. failure to open a door doesn't prove that the door
is stuck. Maybe the person just failed to consider using the doorknob,
dragged the door instead of pushing, failed to note that it opens
sideways, didn't swipe the id card, deliberately made it look as if the
door didn't open, or something. But anyway I cooked up some code:


#include <stdlib.h> // EXIT_...
#include <stdio.h>

#include <stdexcept> // std::runtime_error, std::exception
#include <string> // std::string
using namespace std;

#ifdef PORTABLE
constexpr bool portable = true;
#else
constexpr bool portable = false;
#endif

auto fail( const string& s ) -> bool { throw runtime_error( s ); }

void cpp_main()
{
if constexpr( portable ) {
freopen( nullptr, "rb", stdin ) or fail( "freopen(stdin) failed" );
freopen( nullptr, "wb", stdout ) or fail( "freopen(stdout)
failed" );
} else {
freopen( "conin$", "rb", stdin ) or fail( "freopen(stdin)
failed" );
freopen( "conout$", "wb", stdout ) or fail( "freopen(stdout)
failed" );
}
}

auto main() -> int
{
fprintf( stderr, "%s\n", (portable? "Portable code" :
"Windows-specific code") );
try {
cpp_main();
return EXIT_SUCCESS;
} catch( const exception& x ) {
fprintf( stderr, "!%s\n", x.what() );
}
return EXIT_FAILURE;
}


It fails both as both portable code and as Windows-specific code (here
"cl" is the Visual C++ compiler):


[C:\root\temp]
> cl x.cpp
x.cpp

[C:\root\temp]
> x
Windows-specific code

[C:\root\temp]
> echo blah | x
Windows-specific code
The process tried to write to a nonexistent pipe.

[C:\root\temp]
> cl x.cpp /D PORTABLE
x.cpp

[C:\root\temp]
> x
Portable code
!freopen(stdin) failed


So.

- Alf

Richard Damon

unread,
Jan 30, 2022, 8:59:18 AM1/30/22
to
On 1/30/22 7:26 AM, Paavo Helde wrote:

> The slash can be still used as an option, there is no need to change
> that. One just needs to start to demand to separate options by spaces,
> something what a sane person has always done anyway, so that
>
> dir C:/B
>
> would list the file B in the root folder and
>
> dir C: /B
>
> would list all files in cwd of C: in bare format.
>

and what should dir /B do?

Both path and options are optional.


When it was decided to use / as the options separator, it essentially
forced the use of some other character for the directory separator in
that context.

In hindsight, maybe it was the wrong choice, but that was a LONG time ago.

But remember, at that point in time filenames were 8.3 ASCII only, with
limited special characters. Using - in the file name as a 'word'
separator wasn't uncommon, so - was an allowed filename character, so
not good for the option leadin character, that would make dir -B
ambiguous, so using / made some sense.

Ben Bacarisse

unread,
Jan 30, 2022, 9:46:12 AM1/30/22
to
"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:

> On 30 Jan 2022 13:12, Ben Bacarisse wrote:
>> "Alf P. Steinbach" <alf.p.s...@gmail.com> writes:
>>
>>> My opinion is that C (and hence also C++) text mode is an abomination
>>> that should never have been introduced, and that in addition, given
>>> that it was introduced, it's designed in a stupid way with the data
>>> conversion applied underneath the buffer level so that one can't get a
>>> clear view of the raw data.
>>>
>>> As an example, the design means that Unix `cat` can't be faithfully
>>> implemented in Windows using only standard C or C++, which IMO is
>>> extreme.
>> I assume you are talking about the cases where cat defaults to reading
>> stdin and/or writing stdout, If so, it could be argued that it's not the
>> fault of the C and C++ standards, but more the fault of the
>> implementations not providing a useful freopen function.
>> But then maybe freopen simply can't be implemented in Windows for some
>> mysterious reason I don't get.
>
> I guess you're talking about using standard C++ code with some system
> specific knowledge such as how to specify the standard input stream as
> a file name.

No. I was just talking about using freopen.

> To reopen the standard input so that it connects to the original
> source, which might a pipe or a file or the console, one needs to (1)
> identify that source, and (2) open that source, possibly after closing
> the original connection.

I don't see why you need to do that. freopen (were it fully supported)
would allow a program to change the mode of stdin and stdout without
knowing anything about the sources.

<snip code>
> Portable code
> !freopen(stdin) failed
>
> So.

Not sure what the "so" is about. You appeared to be blaming the C and
C++ standard for not allowing cat to be written on Windows. I blame the
Windows standard library for not supporting freopen.

--
Ben.

Paavo Helde

unread,
Jan 30, 2022, 11:21:46 AM1/30/22
to
30.01.2022 15:59 Richard Damon kirjutas:
> On 1/30/22 7:26 AM, Paavo Helde wrote:
>
>> The slash can be still used as an option, there is no need to change
>> that. One just needs to start to demand to separate options by spaces,
>> something what a sane person has always done anyway, so that
>>
>> dir C:/B
>>
>> would list the file B in the root folder and
>>
>> dir C: /B
>>
>> would list all files in cwd of C: in bare format.
>>
>
> and what should dir /B do?

In 'dir /B', /B would be an option. To specify a file, one would need to
use 'dir C:/B' for example. That's the same as needing to use special
syntax like 'ls ./-filename' in Unix for filenames starting with a dash.

>
> Both path and options are optional.
>
>
> When it was decided to use / as the options separator, it essentially
> forced the use of some other character for the directory separator in
> that context.

IIRC, / as the option separator came from an OS (CP/M?) which did not
support subdirectories, so there was no problem at that point.

>
> In hindsight, maybe it was the wrong choice, but that was a LONG time ago.
>
> But remember, at that point in time filenames were 8.3 ASCII only, with
> limited special characters. Using - in the file name as a 'word'
> separator wasn't uncommon, so - was an allowed filename character, so
> not good for the option leadin character, that would make dir -B
> ambiguous, so using / made some sense.

Even though '-' is allowed in file names, I believe there has never been
many people wanting to *start* their filenames with it.

Also, '-' is allowed in Unix file names as well, and they can somehow
cope with this, see the ./-filename example above.

Of course it is hard to foresee all future implications of current
decisions. But when it becomes clear some decision was wrong, it should
be corrected, with some gradual switch-over period if needed. For
example, the Windows Explorer and standard file dialogs could easily
have a regime to display file paths with forward slashes. This would
make my life a bit easier, for copy-pasting file paths from-to Cygwin
bash terminal or program code. I also keep hearing Windows has a Linux
inside nowadays, I am sure it is also using forward slashes.

Richard Damon

unread,
Jan 30, 2022, 12:56:15 PM1/30/22
to
On 1/30/22 11:21 AM, Paavo Helde wrote:
> 30.01.2022 15:59 Richard Damon kirjutas:
>> On 1/30/22 7:26 AM, Paavo Helde wrote:
>>
>>> The slash can be still used as an option, there is no need to change
>>> that. One just needs to start to demand to separate options by
>>> spaces, something what a sane person has always done anyway, so that
>>>
>>> dir C:/B
>>>
>>> would list the file B in the root folder and
>>>
>>> dir C: /B
>>>
>>> would list all files in cwd of C: in bare format.
>>>
>>
>> and what should dir /B do?
>
> In 'dir /B', /B would be an option. To specify a file, one would need to
> use 'dir C:/B' for example. That's the same as needing to use special
> syntax like 'ls ./-filename' in Unix for filenames starting with a dash.

But that isn't the same!!!

It might need to be dir A:/B or dir B:/B or D:/D, the omission of the
drive is an intentional feature.

>
>>
>> Both path and options are optional.
>>
>>
>> When it was decided to use / as the options separator, it essentially
>> forced the use of some other character for the directory separator in
>> that context.
>
> IIRC, / as the option separator came from an OS (CP/M?) which did not
> support subdirectories, so there was no problem at that point.

Right, and backwards compatibility can be a bitch.

I thought CP/M did eventually add sub-directories as file system got
bigger, they just weren't that needed initially on the small 8" floppy
with only 1/4 MB of storage.

Also, when CP/M started, Unix was a small niche OS so compatibility with
it wasn't that important, at that time there were LOTS of different and
incompatible systems in use.

>
>>
>> In hindsight, maybe it was the wrong choice, but that was a LONG time
>> ago.
>>
>> But remember, at that point in time filenames were 8.3 ASCII only,
>> with limited special characters. Using - in the file name as a 'word'
>> separator wasn't uncommon, so - was an allowed filename character, so
>> not good for the option leadin character, that would make dir -B
>> ambiguous, so using / made some sense.
>
> Even though '-' is allowed in file names, I believe there has never been
> many people wanting to *start* their filenames with it.

It was a common character to put things at the top of the alphabetical
sort list. -README.TXT was at one point a very common file name.

>
> Also, '-' is allowed in Unix file names as well, and they can somehow
> cope with this, see the ./-filename example above.

Yes, and ./ always works to indicate a file in the current directory
(but does normally kill path lookups if applicable for the file), the
problem is adding C: as a prefix doesn't always work, and definitely
wouldn't on the early systems, it would have more likely have been A:
then (but might have been B:)

>
> Of course it is hard to foresee all future implications of current
> decisions. But when it becomes clear some decision was wrong, it should
> be corrected, with some gradual switch-over period if needed. For
> example, the Windows Explorer and standard file dialogs could easily
> have a regime to display file paths with forward slashes. This would
> make my life a bit easier, for copy-pasting file paths from-to Cygwin
> bash terminal or program code. I also keep hearing Windows has a Linux
> inside nowadays, I am sure it is also using forward slashes.

Note, that operation system itself supports using either / or \ as the
path separator, and I thought there was even an system option (just not
set by default) to change the option character from / to -, so you CAN
do what you want, it just isn't that way be default.

James Kuyper

unread,
Jan 30, 2022, 2:02:42 PM1/30/22
to
No, freopen doesn't require a file name, or any of those other things.

"If filename is a null pointer, the freopen function attempts to change
the mode of the stream to that specified by mode , as if the name of the
file currently associated with the stream had been used. It is
implementation-defined which changes of mode are permitted (if any), and
under what circumstances." (C standard, 7.21.5.4p3).

If the implementation you're using fails to support changing text mode
to binary mode, that's hardly the fault of the C standard, that's up to
the implementation.

James Kuyper

unread,
Jan 30, 2022, 2:17:16 PM1/30/22
to
On 1/29/22 17:50, Andrey Tarasevich wrote:
> On 1/29/2022 9:19 AM, James Kuyper wrote:
[Note: I accidentally posted that message to comp.lang.c rather than
comp.lang.c++, breaking the thread. Andrey's response was also on that
newsgroup]
>> On 1/29/22 07:01, David Brown wrote:
>> ...
>>> What I have learned from all this, is that "мир" ("mir") means "world"
>>> as well as "peace". I find that an interesting linguistic titbit.
>>
>> Which means that if a Russian leader says "Я хочу мир.", it's ambiguous
>> whether he's saying he wants peace, or he wants the world.
>
> Not really.
>
> The subject for "хотеть" can be placed in Accusative or Genitive case,
> depending on the nature of the subject. This is the case for many words
> that can used to refer to a specific object or to a some sort of
> non-quantifiable concept or resource ("peace", "water", "fish" etc.).
>
> "I want peace" would normally require "мир" in Genitive case, which is
> "мира": "Я хочу мира".
>
> If you attempt to use "Я хочу мир" in this meaning, people would
> normally understand you properly (given enough context), but you'll
> still be perceived as an "untrained" speaker (e.g. child, uneducated,
> non-native, etc.)
>
> "I want [the] world" would take "мир" in Accusative case, which is still
> "мир": "Я хочу мир".
>
> And this later version still doesn't sound right, since literary norms
> usually require an extra verb specifying/qualifying the desire. Not "Я
> хочу мир", but "Я хочу владеть миром" ("I want to own the world") или "Я
> хочу править миром" ("I want to rule the world") etc.

I studied Russian for two years, about four decades ago, and never made
any practical use of the language. My Russian skills are sufficiently
rusty that I actually used translate.google.com to double-check before
posting. It translated "I want peace." and "I want the world." the same
way, which is a hint why translate.google.com shouldn't be used for
serious translation efforts.
However, I vaguely remember asking my Russian teacher about this same
point when I first learned about this word, and I don't remember her
answer mentioning any of the details that you just mentioned. I think
she just said something about being able to figure out the intended
meaning from context.

> I think that "I want [the] world" has the same feel in English as well.
> It doesn't sound quite right without extra specification.



Alf P. Steinbach

unread,
Jan 30, 2022, 3:53:43 PM1/30/22
to
On 30 Jan 2022 15:45, Ben Bacarisse wrote:
>
> Not sure what the "so" is about. You appeared to be blaming the C and
> C++ standard for not allowing cat to be written on Windows. I blame the
> Windows standard library for not supporting freopen.

No, I used it as an example of a consequence of the unreasonable design.

With a more sane design with access to the raw bytes, the problem
wouldn't have existed, regardless of buggy or perfect `freopen`.

Still you have a point in that I gave that example based on just old
memories. So now I added code to actually copy input to output (getchar
+ putchar). Still fails in Windows with both Visual C++ and MinGW g++
for both portable and Windows-specific `freopen` args, but works in WSL
Ubuntu for portable:


[C:\root\temp]
> bash
alf@Alf-Windows-PC:/mnt/c/root/temp$ rm *.exe
alf@Alf-Windows-PC:/mnt/c/root/temp$ g++ -std++17 x.cpp
g++: error: unrecognized command line option ‘-std++17’; did you mean
‘-std=c++17’?
alf@Alf-Windows-PC:/mnt/c/root/temp$ g++ -std=c++17 x.cpp
alf@Alf-Windows-PC:/mnt/c/root/temp$ echo blah | ./a.out
Windows-specific code
!freopen(stdin) failed
alf@Alf-Windows-PC:/mnt/c/root/temp$ g++ -std=c++17 x.cpp -D PORTABLE
alf@Alf-Windows-PC:/mnt/c/root/temp$ echo blah | ./a.out
Portable code
blah


That's sort of a Pyrrhic victory for the standard library design, that
this thing that could save the day in Windows or at least be claimed to
be the intended way to do things, works in Ubuntu where it's not needed.

Anyway. :-)


- Alf [too lazy to fix the Ubuntu prompt]

Ben Bacarisse

unread,
Jan 30, 2022, 4:00:47 PM1/30/22
to
"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:

> On 30 Jan 2022 15:45, Ben Bacarisse wrote:
>> Not sure what the "so" is about. You appeared to be blaming the C and
>> C++ standard for not allowing cat to be written on Windows. I blame the
>> Windows standard library for not supporting freopen.
>
> No, I used it as an example of a consequence of the unreasonable
> design.

Yes, and my point was that the inability to write "cat" on Windows is
not just down the CR/LF translation design. A mechanism was added to
mitigate some of the problems, but it's not been taken up. There is
blame to share around.

--
Ben.

Andrey Tarasevich

unread,
Jan 30, 2022, 11:17:44 PM1/30/22
to
On 1/30/2022 11:17 AM, James Kuyper wrote:
> However, I vaguely remember asking my Russian teacher about this same
> point when I first learned about this word, and I don't remember her
> answer mentioning any of the details that you just mentioned. I think
> she just said something about being able to figure out the intended
> meaning from context.

Depends on how advanced your classes were and how "old-school" your
teacher was. This is one of those intricate topics where, firstly, the
choice between Accusative and Genitive is not always clearly defined.
And secondly, modern users often tend to simplify the usage and just opt
for Accusative in all cases.

See for example

http://www.abc-russian.com/2016/09/or.html

or section "Russian Genitive Case: Abstract Or Indefinite Objects" here:
https://storylearning.com/learn/russian/russian-tips/russian-genitive-case

Mut...@dastardlyhq.com

unread,
Jan 31, 2022, 5:23:40 AM1/31/22
to
On Sun, 30 Jan 2022 12:12:33 +0000
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:
>
>> My opinion is that C (and hence also C++) text mode is an abomination
>> that should never have been introduced, and that in addition, given
>> that it was introduced, it's designed in a stupid way with the data
>> conversion applied underneath the buffer level so that one can't get a
>> clear view of the raw data.
>>
>> As an example, the design means that Unix `cat` can't be faithfully
>> implemented in Windows using only standard C or C++, which IMO is
>> extreme.
>
>I assume you are talking about the cases where cat defaults to reading
>stdin and/or writing stdout, If so, it could be argued that it's not the
>fault of the C and C++ standards, but more the fault of the
>implementations not providing a useful freopen function.

I see no reason why cat would need freopen because it probably uses low level
I/O anyway and doesn't care where its stdin is coming from if it has to read
from it. Other utilities (eg ls) otoh will use isatty() on stdout to see if
its connected to a terminal or a pipe/file and act accordingly (ls only formats
it output if its going to a terminal).

Manfred

unread,
Jan 31, 2022, 10:46:05 AM1/31/22
to
On 1/30/2022 9:18 AM, Alf P. Steinbach wrote:
> In some cases the fiction about what's allowed is imperfectly
> implemented. I remember in the 1990's (when I still worked) I had some
> fun demonstrating to colleagues how to completely and utterly hide some
> data on disk, using commands like

Nice trick!

Scott Lurndal

unread,
Jan 31, 2022, 11:11:40 AM1/31/22
to
Mut...@dastardlyhq.com writes:
>On Sun, 30 Jan 2022 12:12:33 +0000
>Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:
>>
>>> My opinion is that C (and hence also C++) text mode is an abomination
>>> that should never have been introduced, and that in addition, given
>>> that it was introduced, it's designed in a stupid way with the data
>>> conversion applied underneath the buffer level so that one can't get a
>>> clear view of the raw data.
>>>
>>> As an example, the design means that Unix `cat` can't be faithfully
>>> implemented in Windows using only standard C or C++, which IMO is
>>> extreme.
>>
>>I assume you are talking about the cases where cat defaults to reading
>>stdin and/or writing stdout, If so, it could be argued that it's not the
>>fault of the C and C++ standards, but more the fault of the
>>implementations not providing a useful freopen function.
>
>I see no reason why cat would need freopen because it probably uses low level
>I/O anyway and doesn't care where its stdin is coming from if it has to read
>from it.

cat(1) uses read(2)/write(2) unless -v is specified, in which case it uses
getc(3).

int
cat(fi, fname)
FILE *fi;
char *fname;
{
register int fi_desc;
register int nitems;

fi_desc = fileno(fi);

/*
* While not end of file, copy blocks to stdout.
*/

while ((nitems=read(fi_desc,buffer,BUFSIZ)) > 0) {
if ((errnbr = write(1,buffer,(unsigned)nitems)) != nitems) {

...

vcat(fi)
FILE *fi;
{
register int c;

while ((c = getc(fi)) != EOF)
{
/*
* For non-printable and non-cntrl chars, use the "M-x" notation.
*/
if (!ISPRINT(c, wp) &&
!iscntrl(c) && !ISSET2(c) && !ISSET3(c))
{
putchar('M');
putchar('-');
c-= 0200;
}
...

Ben Bacarisse

unread,
Jan 31, 2022, 11:13:13 AM1/31/22
to
Mut...@dastardlyhq.com writes:

> On Sun, 30 Jan 2022 12:12:33 +0000
> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:
>>
>>> My opinion is that C (and hence also C++) text mode is an abomination
>>> that should never have been introduced, and that in addition, given
>>> that it was introduced, it's designed in a stupid way with the data
>>> conversion applied underneath the buffer level so that one can't get a
>>> clear view of the raw data.
>>>
>>> As an example, the design means that Unix `cat` can't be faithfully
>>> implemented in Windows using only standard C or C++, which IMO is
>>> extreme.
>>
>>I assume you are talking about the cases where cat defaults to reading
>>stdin and/or writing stdout, If so, it could be argued that it's not the
>>fault of the C and C++ standards, but more the fault of the
>>implementations not providing a useful freopen function.
>
> I see no reason why cat would need freopen because it probably uses low level
> I/O anyway and doesn't care where its stdin is coming from if it has to read
> from it.

"Low level I/O" is not part of standard C, nor (as far as I know)
standard C++. The issue was: "`cat` can't be faithfully implemented in
Windows using only standard C or C++".

--
Ben.

Mut...@dastardlyhq.com

unread,
Jan 31, 2022, 11:37:16 AM1/31/22
to
On Mon, 31 Jan 2022 16:11:23 GMT
sc...@slp53.sl.home (Scott Lurndal) wrote:
>Mut...@dastardlyhq.com writes:
>>On Sun, 30 Jan 2022 12:12:33 +0000
>>Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>>"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:
>>>
>>>> My opinion is that C (and hence also C++) text mode is an abomination
>>>> that should never have been introduced, and that in addition, given
>>>> that it was introduced, it's designed in a stupid way with the data
>>>> conversion applied underneath the buffer level so that one can't get a
>>>> clear view of the raw data.
>>>>
>>>> As an example, the design means that Unix `cat` can't be faithfully
>>>> implemented in Windows using only standard C or C++, which IMO is
>>>> extreme.
>>>
>>>I assume you are talking about the cases where cat defaults to reading
>>>stdin and/or writing stdout, If so, it could be argued that it's not the
>>>fault of the C and C++ standards, but more the fault of the
>>>implementations not providing a useful freopen function.
>>
>>I see no reason why cat would need freopen because it probably uses low level
>>I/O anyway and doesn't care where its stdin is coming from if it has to read
>>from it.
>
>cat(1) uses read(2)/write(2)

As I suspected. You don't want the overhead of the higher level I/O functions
for system utilities.

unless -v is specified, in which case it uses
>getc(3).
>
>int
>cat(fi, fname)
>FILE *fi;
>char *fname;
>{

Its been a while since I've seen K&R style C in the wild.


Mut...@dastardlyhq.com

unread,
Jan 31, 2022, 11:39:57 AM1/31/22
to
On Mon, 31 Jan 2022 16:12:58 +0000
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>Mut...@dastardlyhq.com writes:
>
>> On Sun, 30 Jan 2022 12:12:33 +0000
>> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>>"Alf P. Steinbach" <alf.p.s...@gmail.com> writes:
>>>
>>>> My opinion is that C (and hence also C++) text mode is an abomination
>>>> that should never have been introduced, and that in addition, given
>>>> that it was introduced, it's designed in a stupid way with the data
>>>> conversion applied underneath the buffer level so that one can't get a
>>>> clear view of the raw data.
>>>>
>>>> As an example, the design means that Unix `cat` can't be faithfully
>>>> implemented in Windows using only standard C or C++, which IMO is
>>>> extreme.
>>>
>>>I assume you are talking about the cases where cat defaults to reading
>>>stdin and/or writing stdout, If so, it could be argued that it's not the
>>>fault of the C and C++ standards, but more the fault of the
>>>implementations not providing a useful freopen function.
>>
>> I see no reason why cat would need freopen because it probably uses low level
>
>> I/O anyway and doesn't care where its stdin is coming from if it has to read
>> from it.
>
>"Low level I/O" is not part of standard C, nor (as far as I know)

So? Very little is part of standard C if you want to be pedantic and all *nix's
implement open(), read(), write() etc. Unless you thought I meant something
else by low level.

>standard C++. The issue was: "`cat` can't be faithfully implemented in
>Windows using only standard C or C++".

I imagine a number of unix utilities are difficult or impossible to implement
properly on Windows.

Andrey Tarasevich

unread,
Jan 31, 2022, 1:10:55 PM1/31/22
to
On 1/30/2022 12:18 AM, Alf P. Steinbach wrote:
>
> This is just a bug in cmd.exe where it fails to check that the file name
> is "allowed" for ordinary users, so one is able to specify an internal
> NTFS stream. :-)
>

But... How is it a bug? Command-line stream access syntax is documented
by Microsoft.

Or do you think that "ordinary users" should not be allowed to access
with NTFS streams? If so, why?

Paavo Helde

unread,
Jan 31, 2022, 1:20:28 PM1/31/22
to
31.01.2022 18:11 Scott Lurndal kirjutas:
> int
> cat(fi, fname)
> FILE *fi;
> char *fname;
> {
> register int fi_desc;
> register int nitems;

'register' has been removed from the C++ language and K&R declarations
have never been part of it. Just sayin... ;-)

Alf P. Steinbach

unread,
Jan 31, 2022, 1:26:01 PM1/31/22
to
The forward slash support at the API level is also documented by
Microsoft. <url:
https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilea#parameters>

But the Windows user interface in general does not permit forward slash
as path component separators, and in general it does not permit naming
of NTFS streams.


[C:\root\temp]
> type poem.txt:secret
The filename, directory name, or volume label syntax is incorrect.


Which is the reason why I didn't use the `type` command for display.

Since this is inconsistent with the redirection operators in the same
command interpreter, there is necessarily a bug /somewhere/. It could be
that all the places that refuse such names are the ones that are buggy,
and the redirection operators are the ones that have correct filename
checking. Or it could be that the two single instances of allowing this
pattern are the buggy ones, and all the rest correct, as per intent.

You're right, however, that when I wrote that it was the redirection
operators, "this", I couldn't know that with more than 99.999983%
confidence, but on the third hand, adding weasel language just for that
very remote possibility would be IMHO be absurd. ;-)

- Alf

Ben Bacarisse

unread,
Jan 31, 2022, 6:48:21 PM1/31/22
to
Why are you asking me? Ask Alf why he thinks it matters! He brought it
up, I just bought up the standard C "solution".

--
Ben.

Vir Campestris

unread,
Feb 7, 2022, 4:06:34 PM2/7/22
to
On 30/01/2022 17:56, Richard Damon wrote:
> Right, and backwards compatibility can be a bitch.
>
> I thought CP/M did eventually add sub-directories as file system got
> bigger, they just weren't that needed initially on the small 8" floppy
> with only 1/4 MB of storage.
>
> Also, when CP/M started, Unix was a small niche OS so compatibility with
> it wasn't that important, at that time there were LOTS of different and
> incompatible systems in use.

I don't think CP/M ever supported directories.

It evolved into Concurrent DOS, and that supported MS's FAT filesystem.
It may have supported others, but it's a long time ago...

Wiki tells me it evolved further after I went in a different direction.

Andy
0 new messages