I've made a micro review of Boost.Process, there are small
issues I've found:
- stream buffer implementation:
underflow()
you need to check the errno if it is EINTR,
returning -1 and getting EINTR is not a problem and thus
it should retry.
sync();
Same as in underflow() - check for EINTR and retry, but also
there is a problem that you do not check that you had fully
written the data.
For example what I write out << std::flush sync() is called
and I expect the all data should be written to device, so
if return value is less then the size of the buffer you should
retry and write again till buffer is empty, you get error or EOF.
- Windows and Unicode.
You are using CreateProcessA. I would recommend to always use
wide API and convert narrow strings to wide similarly to what
boost::filesystem::v3 does, so for example where the global
locale as utf-8 facet you would convert narrow strings to wide
and run it.
Notes:
1. You can also always assume that strings under windows are UTF-8
and always convert them to wide string before system calls.
This is I think better approach, but it is different from what
most of boost does.
2. I do not recommend adding wide API - makes the code much uglier,
rather convert normal strigns to wide strings before system call.
- It may be very good addition to implement full support of putback.
Additional point:
-----------------
I've noticed that you planned asynchronous notification and so on
but I think it is quite important to add feature that provide
an ability to wait for multiple processes to terminate
and have timeout.
It can be done using sigtimedwait/sigwait and assigned signal handlers
for SIGCHLD
Artyom
P.S.: Good luck with the review library looks overall very nice.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I'm not sure if there Boost.Process code base has been updated, but a
change in Boost.System in v1.44 means that occurrences of
boost::system::system_category
need to be replaced by
boost::system::system_category()
because the argument type to boost::system::system_error changed in
v1.44
Just something I found when we upgraded to v1.45 recently.
Erik
----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited.
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law.
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses.
References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link:
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.
[...]
> Notes:
>
> 1. You can also always assume that strings under windows are UTF-8
> and always convert them to wide string before system calls.
>
> This is I think better approach, but it is different from what
> most of boost does.
[...]
An interesting thought... I developed a set of ASCII/UTF-8/16/32
classes for my company not too long ago, and I became fairly familiar
with the UTF-8 encoding scheme. There was only one issue that stopped
me from assuming that all std::string types as UTF-8-encoded: what if
the string *isn't* meant as UTF-8 encoded, and contains characters with
the high-bit set?
There's nothing technically stopping that from happening, and there's
no way to determine with complete certainty whether even a string that
seems to be valid UTF-8 was intended that way, or whether the UTF-8-like
characters are really meant as their high-ASCII values.
Maybe you know something I don't, that would allow me to change it? I
hope so, it would simplify some of the code greatly.
--
Chad Nelson
Oak Circle Software, Inc.
*
*
*
I wanted to talk about it for a loooooong time.
however never got there.
-------------------------------------------------
Proposal Summary:
===================
- We need to treat std::string, char const * as
UTF-8 strings on Windows and drop a support of
so called ANSI API.
- Optuional but recommended:
Deprecate wide strings as unportable API.
Basics:
========
There is a big difference in handing Unicode in Windows
and POSIX platforms API. it can be summarized as following:
OS Moder Unix Modern Windows
-------------------------------------------------
char string: UTF-8 Obsolete ANSI codepage (like 1251)
wchar_t string: UTF-32 UTF-16
OS Native API: char wchar_t
Common encoding: UTF-8 UTF-16
Unicode Support Modern Unix Modern Windows
----------------------------------------------
char API Full Unicode Not supported
wchar_t API Not Exists Full
Bottom line:
You can't open or delete a file in cross plafrom way!
Suggestion:
===========
Char Strings
------------
- Under POSIX platform:
Treat them as byte sequences with current locale,
by default assume that they are UTF-8 as:
a) Default Locale on most OSs is UTF-8 locale
b) POSIX API does not care about encodings
Even if the locale is not UTF-8 you still
can do anything right as
- Under Windows platform:
a) Treat them as UTF-8 strings, convert them to
UTF-16 just before accessing system services.
b) Never use ANSI API always use Wide API. It is
anyway default internal encoding.
Wide String:
------------
- Deprecate them, unless you have something tied
to Windows system API.
a) They are not portable: no OS (except Windows)
uses Wide strings in their API.
b) They are not well defined: may be UTF-16 or UTF-32
For more details read:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
What problem this would solve for us?
=====================================
1. All standard API support Unicode naturally as it
supposed to be.
- Want to open boost::filesystem::fstream?
- Want to pass parameters to other process?
- Want to display message?
- Want to read XML or JSON?
All works with Unicode by default because:
a) It is Unicode by default on Unix
b) Because they are mapped to wide API on
Windows.
2. Portable program should no longer worry about
setting standard locale facets, etc.
The program becomes much more portable.
3. Fewer bugs related to Unicode handling.
Artyom
----- Original Message ----
> Chad Nelson <chad.thec...@gmail.com>
Most platforms have a notion of a 'default' encoding. On Linux, the is
usually UTF-8 but isn't guaranteed to be. On Windows this is the active
local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t.
The safest approach (and the one taken by the STL and boost) is to assume
the strings are in this OS's default encoding unless explicitly known to be
otherwise. This means you can pass these strings around freely without
worrying about their encoding because, eventually, they get passed to an OS
call which knows how to handle them.
Alternatively, if you need to manipulate the string you can use the OS's
character conversion functions to take your default-encoding string,
convert it to something specific, manipulate the result and then convert it
back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte
with the CP_ACP flag.
HTH
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
Hi Artyom,
> [...]I've noticed that you planned asynchronous notification and so on
> but I think it is quite important to add feature that provide
> an ability to wait for multiple processes to terminate
> and have timeout.
thanks for your micro review! I'll comment on your notes on the weekend.
But which version of Boost.Process did you use? I wonder as meanwhile we
have support for asynchronous operations in Boost.Process. You can
download the latest version from
http://www.highscore.de/boost/gsoc2010/process.zip and find the
documentation at http://www.highscore.de/boost/gsoc2010/.
Boris
By googling it, the latest I found was a 04/2009 version. Guess I can use
this new one, then.
Cheers,
Greg
On Thu, Jan 13, 2011 at 3:24 PM, Boris Schaeling <bo...@highscore.de> wrote:
> On Thu, 13 Jan 2011 15:35:53 +0100, Artyom <arty...@yahoo.com> wrote:
>
> Hi Artyom,
>
> [...]I've noticed that you planned asynchronous notification and so on
>>
>> but I think it is quite important to add feature that provide
>> an ability to wait for multiple processes to terminate
>> and have timeout.
>>
>
> thanks for your micro review! I'll comment on your notes on the weekend.
> But which version of Boost.Process did you use? I wonder as meanwhile we
> have support for asynchronous operations in Boost.Process. You can download
> the latest version from http://www.highscore.de/boost/gsoc2010/process.zipand find the documentation at
Two problems with this approach:
- Even if the encoding under POSIX platforms is not UTF-8 you will
be still able to open files, close them, stat on them and do any
other operations regardless encoding as POSIX API is encoding
agnostic, this is why it works well.
- Under Windows, on the other hand you CAN NOT do everything with narrow
strings. For example you can't create file "שלום-سلام-pease-Мир.txt"
using char * API. And this has very bad consequences.
> This means you can pass these strings around freely without
> worrying about their encoding because, eventually, they get passed to an OS
> call which knows how to handle them.
You can't under Windows... "ANSI" API is limited.
>
> Alternatively, if you need to manipulate the string you can use the OS's
> character conversion functions to take your default-encoding string,
> convert it to something specific, manipulate the result and then convert it
> back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte
> with the CP_ACP flag.
>
CP_ACP flag can never be 65001 - UTF-8 so basically you is stuck with same
problem.
> HTH
>
> Alex
>
>
See my mail with wider description.
Artyom
On Thu, Jan 13, 2011 at 8:21 PM, Artyom <arty...@yahoo.com> wrote:
> Hello All,
>
> I wanted to talk about it for a loooooong time.
> however never got there.
>
> -------------------------------------------------
>
>
> Proposal Summary:
> ===================
>
> - We need to treat std::string, char const * as
> UTF-8 strings on Windows and drop a support of
> so called ANSI API.
>
> - Optuional but recommended:
>
> Deprecate wide strings as unportable API.
Fully agree. Two years ago I would very probably be advocating
some kind of TCHAR/wxChar/QChar/whatever-like character type
switching, but since then I've spent a lot of time developing portable
GUI applications and found out the hard way that it is better
to dump all the ANSI CPXXXX / UTF-XY encodings and stick
to UTF-8 and defer the conversion to whatever the native API
uses until you make the actual call.
a) UTF-16 in principle is ok but many implementations are not:
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
b) UTF-32 is basically a waste of memory for most localizations.
>
[snip]
>
> Suggestion:
> ===========
>
> Char Strings
> ------------
>
> - Under POSIX platform:
>
> Treat them as byte sequences with current locale,
> by default assume that they are UTF-8 as:
>
> a) Default Locale on most OSs is UTF-8 locale
> b) POSIX API does not care about encodings
> Even if the locale is not UTF-8 you still
> can do anything right as
>
> - Under Windows platform:
>
> a) Treat them as UTF-8 strings, convert them to
> UTF-16 just before accessing system services.
> b) Never use ANSI API always use Wide API. It is
> anyway default internal encoding.
>
>
> Wide String:
> ------------
>
> - Deprecate them, unless you have something tied
> to Windows system API.
+1, IMO having two APIs that are not seamlesly interchangeble
in the code (at least with the macro trickery) is useless.
[snip]
>
> What problem this would solve for us?
> =====================================
>
> 1. All standard API support Unicode naturally as it
> supposed to be.
>
> - Want to open boost::filesystem::fstream?
> - Want to pass parameters to other process?
> - Want to display message?
> - Want to read XML or JSON?
>
> All works with Unicode by default because:
>
> a) It is Unicode by default on Unix
> b) Because they are mapped to wide API on
> Windows.
>
> 2. Portable program should no longer worry about
> setting standard locale facets, etc.
>
> The program becomes much more portable.
>
> 3. Fewer bugs related to Unicode handling.
>
> Artyom
>
+1, but from my experience it is easier to say than to do.
My knowledge of Unicode and utf-8 is little more than
superficial and I didn't do a lot of char-by-char manipulation,
but to do what you are proposing we need at least some
straightforward (and efficient) way to convert the native
strings to the required encoding at the call site.
I'm not trying to nitpick on anyones implementation of
a Unicode library here but having to instantiate ~10
transcoding-related classed just to call ShellExecuteW
is not my idea of straightforward. :)
[snip]
BR, Matus
This isn't a problem, right? This is exactly why it _does_ work :D Assume
the strings are in OS-default encoding, don't mess with them, hand them to
the OS API which knows how to treat them.
> - Under Windows, on the other hand you CAN NOT do everything with narrow
> strings. For example you can't create file "שלום-سلام-pease-Мир.txt"
> using char * API. And this has very bad consequences.
This is indeed true. I was just describing the situation where the string
came from the result of one call and was being passed around. If you want
to manipulate the strings, things become more tricky.
> > This means you can pass these strings around freely without
> > worrying about their encoding because, eventually, they get passed to an OS
> > call which knows how to handle them.
>
> You can't under Windows... "ANSI" API is limited.
You've missed where I said "pass these strings around". I'm not suggesting
you can change them. But you can take a narrow string returned by an OS
call and pass it to another OS call without any problems.
> > Alternatively, if you need to manipulate the string you can use the OS's
> > character conversion functions to take your default-encoding string,
> > convert it to something specific, manipulate the result and then convert it
> > back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte
> > with the CP_ACP flag.
I ommitted one important caveat here: if you manipulate the string once
you've converted it to UTF-16, you may not be able to convert it back to
the default encoding losslessly. For example, as in your string above, if
you take the orginal string in Arabic, up-convert it and append a Russian
word, you can't blindly convert this back as the default encoding may not
be able to represent these two character sets simultaenously.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
> > On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote:
...
> > Two problems with this approach:
> >
> > - Even if the encoding under POSIX platforms is not UTF-8 you will
> > be still able to open files, close them, stat on them and do any
> > other operations regardless encoding as POSIX API is encoding
> > agnostic, this is why it works well.
>
> This isn't a problem, right? This is exactly why it _does_ work :D
> Assume
> the strings are in OS-default encoding, don't mess with them, hand them to
> the OS API which knows how to treat them.
It doesn't always work. On Mac OS X, the paths must be UTF-8; the OS isn't
encoding-agnostic, because the HFS+ file system stores file names as UTF-16
(much like NTFS). You can achieve something similar on Linux by mounting a
HFS+ or NTFS file system; the encoding is then specified at mount time and
should also be observed. Of course, file systems that store file names as
arbitrary null-terminated byte sequences are typically encoding-agnostic.
For my own code, I've gradually reached the conclusion that I should always
use UTF-8 encoded narrow paths. This may not be feasible for a library (yet)
because people still insist on using other encodings on Unix-like OSes,
usually koi8-r. :-) I'm anxiously awaiting the day everyone in the
Linux/Unix world will finally switch to UTF-8 so we can be done with this
question once and for all.
I'm not an expert, so take this with a grain of salt. But couldn't it
just as easily be said that UTF-8 is a waste of CPU? There are a
number of operations that are constant time if you can assume a fixed
size for a character that I would think would have to be linear for
UTF-8, for example accessing the Nth character.
-1
I'm opposed to this strategy simply because it differs from the way
existing libraries treat narrow strings. Not least the STL. If you open
an fstream with a narrow filename, for instance, this isn't treated as a
UTF-8 string. It's treated as being in the local codepage.
What the Visual Studio implementation of the STL actually does is pretty
much the same as how Boost.Filesystem v3 treats paths:
It uses mbstowcs_s to convert the narrow string to the wchar_t form and
then uses _wfsopen to open the file. Importantly, mbstowcs_s treats the
narrow string as being in the local codepage which on Windows _won't_ be
UTF-8. If you tried to open an fstream by handing it a UTF-8 encoded
string, you would end up with severe problems.
For shits and giggles I tried to open a std::fstream with
"שלום-سلام-pease-Мир.txt" as the filename. What it ends up doing is
creating a file called "×©×œ×•× -سلام-pease-Мир.txt"!
While this behaviour isn't great, it is standard. I don't think we should
make boost produce UTF-8 narrow string on Windows. A programmer would
expect to be able to take such a string and pass it to STL functions. As
you can see, that wouldn't work.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
Presumably, Mac OS X returns paths in the same encoding as it expects to
receive them? So just passing them around and eventually back to the OS
will always work regardless of encoding?
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
Yes, in principle, but:
- you rarely, if ever, need to access the Nth character;
- waste of space is also a waste of CPU due to more cache misses;
- UTF-8 has the nice property that you can do things with a string without
even decoding the characters; for example, you can sort UTF-8 strings as-is,
or split them on a specific (7 bit) character, such as '.' or '/'.
Typically, UTF/UCS-32 is only needed as an intermediate representation in
very few places, the rest of the strings can happily stay UTF-8.
Yes, it does (because it's UTF-8). It doesn't work on Windows in general -
if the file name contains characters that can't be represented in the
default code page, they are replaced by something else, typically '?',
sometimes the character without the acute mark. Either way, the name can no
longer be used to refer to the original file.
It differs from them because it's right, and existing libraries are wrong.
Unfortunately, they'll continue being wrong for a long time, because of this
same argument.
> Alexander Lamaison wrote:
>> Presumably, Mac OS X returns paths in the same encoding as it expects to
>> receive them? So just passing them around and eventually back to the OS
>> will always work regardless of encoding?
>
> Yes, it does (because it's UTF-8). It doesn't work on Windows in general -
> if the file name contains characters that can't be represented in the
> default code page, they are replaced by something else, typically '?',
> sometimes the character without the acute mark. Either way, the name can no
> longer be used to refer to the original file.
Only if you modify the string! Windows can't give you a narrow string in
the first place that it can't accept back. Even if you up-convert it to
something like UTF-16 but don't modify it, you should always be able to
down-convert back to the default codepage if you didn't modify the string.
Actually it had already happened:
1. All modern Linux Distributions come with UTF-8 locales by default
2. FreeBSD uses UTF-8 locales by default
3. OpenSolaris uses UTF-8 locales by default
4. Mac OS X uses UTF-8 locales by default
Of course users can define other locales but this is other story.
Artyom
Does the "right" strategy come with some policies/practices that can
allow it to coexist with the existing "wrong" libraries? If so, I'm
all +1 on it.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
First of all, neither in C++/03 nor in C++0x you can
open a file stream with wide file name. MSVC provides
non-standard extension but it does not exist in other compilers
like GCC/MinGW.
So using C++ you can't open a file called: "שלום-سلام-pease-Мир.txt"
under Microsoft Windows.
You can use OS level API like _wfopen to do this job using wide
string. But you can't to do this in C++. Period.
The idea is following:
1. Provide replacement for system libraries that actually
use text and relate to it as text in some encoding.
For STL and standard C library it would be filesystem API.
So you need to provide something like boost::filesystem::fstream
2. Make all boost libraries use Wide API only and never call ANSI API.
3. Treat narrow strings as UTF-8 and convert then to wide prior system calls.
>
> While this behaviour isn't great, it is standard.
>
If the standard it bad, leads to unportable and platform
incompatible code it should not be used!
You can always provide a fallback like boost::utf8_to_locale_encoding
if you have to use ANSI API.
But generally you should just use something like boost::utf8_to_utf16
and always call Wide API.
You must not use ANSI API under Windows.
Artyom
Unfortunately not. A library that requires its input paths to be UTF-8
always gets bug reports from users who are accustomed to using another
encoding for their narrow strings. There is plenty of precedent they can use
to justify their complaint.
Or maybe even boost::utf8_to<TCHAR/wxChar/QChar/...>()
>
> You must not use ANSI API under Windows.
>
> Artyom
>
>
Matus
It can, and it does. You can have a file whose name can't be represented as
a narrow string in the current code page. If you use an "ANSI" function to
get its name, you can only receive an approximation of its real name. If you
use the "wide" function, you get its real name.
I don't see the problem you cited as an answer to my question. Let me
try asking it differently: how do I program in an environment that has
both "right" and "wrong" libraries?
Also, is there any use in trying to get the difference into the type
system, e.g. by using some kind of wrapper over std::string that gives
it a distinct "utf-8" type?
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
The situation is abysmal, i grant you that.
> The idea is following:
>
> 1. Provide replacement for system libraries that actually
> use text and relate to it as text in some encoding.
>
> For STL and standard C library it would be filesystem API.
>
> So you need to provide something like boost::filesystem::fstream
+1. Done already, I believe :)
> 2. Make all boost libraries use Wide API only and never call ANSI API.
+1
> 3. Treat narrow strings as UTF-8 and convert then to wide prior system calls.
This is the part I have problems with: interpreting it as UTF-8 _by
default_. Unless the programmer reads the docs really well, they would
most likely expect to be able to use a narrow string as returned by other
libraries and pass it straight to boost libraries without first having to
convert it.
Boost.Filesystem v3 allows you to specify that the incoming string it UTF-8
encoded but doesn't _default_ to that. Is this insufficient?
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
> Alexander Lamaison wrote:
>> Windows can't give you a narrow string in the first place that it can't
>> accept back.
>
> It can, and it does. You can have a file whose name can't be represented as
> a narrow string in the current code page. If you use an "ANSI" function to
> get its name, you can only receive an approximation of its real name. If you
> use the "wide" function, you get its real name.
Ok, I see what you mean. The way I was looking at it, the string Windows
gave you was wrong, but correctly encoded :P
I would love to see something like this because as things stand it is far
too easy to forget that narrow strings (and wide strings for that matter)
aren't all alike and often need converting even when the character width
doesn't change.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
GetProcessId(HANDLE) is use on windows in boost::process::process, this
has a minimal system requirement of WindowsXP with SP1. Should be at
least noted in the documentation.
Jeff
"The basic principle to remember is: The position of characters in
the Unicode code charts does not specify their sorting weight."
-- http://unicode.org/reports/tr10/#Introduction
Any application that requires you to present a sorted list of
strings to a user pretty much requires a collation algorithm; in that
sense, the usefulness of the above mentioned property of UTF-8 is
limited.
Again, sorry if I'm stating the obvious here. I've had to bring up
that argument in character encoding related discussions more than
once, and it's become a bit of a knee-jerk response by now ;)
For the application discussed, i.e. for passing strings to OS APIs,
this really doesn't matter, though. Where it does matter slightly is when
deciding whether or not to use UTF-8 internally in your application.
The UCA maps code points to collation elements, or strings into lists
of collation elements, and then binary sorts those collation element
lists instead of the original strings. My guess would be that using
UCS/UTF-32 for that is likely to be cheaper, though I haven't actually
ran any comparisons here. If anyone has, I'd love to know.
All of this is mostly an aside, I guess :)
Jens
--
1.21 Jiggabytes of memory should be enough for anybody.
> Please excuse me if I'm stating the obvious, but I feel I should mention
> that binary sorting is not collation.
Yes, you're right. Sorting (lexicographically) UTF-8 as sequences of 8-bit
unsigned integers gives the same result as sorting their UCS-32 equivalents
as sequences of 32 bit unsigned integers.
There's really no good answer to that; it's, basically, a mess. You could
use UTF-8 everywhere in your code, pass that to "right" libraries as-is, and
only pass wchar_t[] to "wrong" libraries and the OS. This doesn't work when
the "wrong" libraries or the OS don't have a wide API though. And there's no
standard way of being wrong; some libraries use the OS narrow API, some
convert to wchar_t[] internally and use the wide API, using a variety of
encodings - the OS default (and there can be more than one), the C locale,
the C++ locale, or a global encoding that can be set per-library. It's even
more fun when supposedly portable libraries use different decoding
strategies depending on the platform.
> Also, is there any use in trying to get the difference into the type
> system, e.g. by using some kind of wrapper over std::string that gives it
> a distinct "utf-8" type?
This could help; a hybrid right+wrong library ought probably be able to take
either utf8_string or non_utf8_string, with the latter using who-knows-what
encoding. :-)
The "bite the bullet" solution is just to demand "right" libraries and use
UTF-8 throughout.
OK, thanks. Consider me +1 on whatever you recommend.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
John,
As I understand the choice is between UTF-8 and UTF-16, since UTF-32
is a waste of memory. Given that, there is never fixed size for a
character or linear times - both UTF-8 and UTF-16 are variable-size
encodings of UTF-32.
Alexander Churanov
Yes, my comment was in response to a comment about UTF-32 as
pertaining to an internal encoding. I'd only use UTF-16 if the APIs I
used required it, and the conversion could be done at the interface
(for example in a fascade). What interests me is if there's a good
reason to use UTF-8 internally and give UTF-32 the same treatment as
UTF-16, or vice versa. I do find the simplicity of a fixed-width
encoding alluring.
By the way, I disagree with Peter's assessment that, "you rarely, if
ever, need to access the Nth character," but I will gladly cede that
this depends on your problem domain.
It obviously depends on the problem domain :-) but, when talking about
Unicode, you can't reliably access the Nth character, in general, even with
UCS-32. (As far as I know.)
IIUC you can't assume a fixed size for a character even with UTF-32. In UTF-32 only _codepoints_ have fixed size, yet one character
may be composed of several codepoints, e.g. a latin letter followed by a diacritical mark, making up one character
(http://en.wikipedia.org/wiki/Combining_character).
Best regards,
Robert
I stand corrected. This sort of the thing is the reason I start with
disclaimers like, "I'm not an expert, so take this with a grain of
salt."
Anyhow, thanks for the info.
Patrick
No. The nth code point is 4n bytes from the beginning of the string,
but characters may be made of a combination of adjacent code points.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
Patrick
No,
Nth Unicode code-point is at nth position not a character.
For example in word "שָלוֹם" as 4 characters "שָ", "ל", "וֹ", "ם" and 6
code points: ש ָ ל ו ֹ מ
Where two code points are diacritic marks.
Boost.Locale has special character iterator to handle characters for this
purpose and it
works on characters and not code points.
See:
http://cppcms.sourceforge.net/boost_locale/html/tutorial.html#8e296a067a37563370ded05f5a3bf3ec
Artyom
Combining old libraries with new ones:
======================================
It would be simple to combine a library that
uses old policies with new ones.
namespace boost {
std::string utf8_to_ansi(std::string const &s);
std::string ansi_to_utf8(std::string const &s);
std::wstring utf8_to_wide(std::string const &s);
std::string wide_to_utf8(std::wstring const &s);
}
- If it supports wide strings call boost::utf8_to_wide
**under Windows platform** and nothing is lost.
- If it supports only narrow strings:
a) if it is encoding agnostic: like some unit-test
that only open files named with ASCII names,
then you can safely ignore and pass UTF-8 string
as ASCII and ASCII as UTF-8 as is the subset of it.
b) Do following:
1. Fill a bug to library owner on not-supporting
Unicode strings under Windows.
2. Use utf8_to_ansi/ansi_to_utf8 to pass strings
to this library under Windows.
Current State of Using Wide/ANSI API in Boost:
==============================================
I've did a small search to find which libraries use what API:
Following use both types of API:
-------------------------------
thread
asio
system
iostreams
regex
filesystem
According to new policy they should replace
ANSI api by wide api and conversion between UTF-8 and UTF-16
Following libraries use only ANSI API
--------------------------------------
interprocess
spirit
test
random
The should replace their ANSI api by Wide one
with a simple glue of utf8_to_wide/wide_to_utf8
Following libraries use STL functions that are not aware of unicode under
windows
---------------------------------------------------------------------------------
std::fstream
- Serialization
- Graph
- wave
- datetime
- property_tree
- progam_options
fopen
- gil
- spirit
- python
- regex
Need to replace with something like:
boost::fstream
and
boost::fopen
that work with UTF-8 under windows.
The rest of the libraries seems to be encoding agnostic.
I used the version from there:
http://www.boost.org/community/review_schedule.html
boost::filesystem::fstream uses a wide string under Windows afaik (assuming
it can detect that you're using an STL implementation which has wide-string
overloads -- aka Dinkumware). However there's still the problem that if
you're using MinGW (or some other non-MSVC toolset that doesn't use a recent
Dinkumware STL implementation) then it will drop back to a narrow string and
we're back where we started again...
> It would be simple to combine a library that
> uses old policies with new ones.
>
> namespace boost {
> std::string utf8_to_ansi(std::string const&s);
> std::string ansi_to_utf8(std::string const&s);
> std::wstring utf8_to_wide(std::string const&s);
> std::string wide_to_utf8(std::wstring const&s);
> }
ANSI doesn't really mean much.
It's purely a windows thing.
utf8_to_locale, which would take a std::locale object, would make more
sense.
> boost::filesystem::fstream uses a wide string under Windows afaik (assuming
> it can detect that you're using an STL implementation which has wide-string
> overloads -- aka Dinkumware). However there's still the problem that if
> you're using MinGW (or some other non-MSVC toolset that doesn't use a recent
> Dinkumware STL implementation) then it will drop back to a narrow string and
> we're back where we started again...
Yes I know this, that is why boost::fstream written
over C stdio.h should be provided.
I had written once small "nowide" library that does this and calls
_wfopen under Windows (available in MinGW) and I actually use it in
CppCMS's booster library which makes my life much simpler
> >
> > namespace boost {
> > std::string utf8_to_ansi(std::string const&s);
> > std::string ansi_to_utf8(std::string const&s);
> > std::wstring utf8_to_wide(std::string const&s);
> > std::string wide_to_utf8(std::wstring const&s);
> > }
>
> ANSI doesn't really mean much.
> It's purely a windows thing.
>
> utf8_to_locale, which would take a std::locale object, would make more
> sense.
>
1. std::locale based conversion using std::codecvt facet strongly depends on
current
implementation and this is bad point to start from.
2. These utf8_to_ansi and backwards should not be used outside windows scope,
where ANSI means
narrow windows API (a.k.a. ANSI API)
3. Under non-windows platform that should do anything to strings and pass them
as is as
native POSIX api is narrow and not wide.
Artyom
It is "reasonably reasonable" to assume the wide character locale is
UTF-16 or UTF-32.
Some IBM mainframes are the only ones where this is not the case as far
as I know.
Therefore you can portably convert a locale to UTF-8 by using
std::codecvt<char, wchar_t> to convert it to UTF-16 or UTF-32,
converting that UTF-16 to UTF-32 if needed, then convert it back to UTF-8.
That's, of course, not exactly very efficient, especially when you're
unable to pipeline those conversions.
> 2. These utf8_to_ansi and backwards should not be used outside windows scope,
> where ANSI means
> narrow windows API (a.k.a. ANSI API)
Good code is code that doesn't expose platform-specific details.
The name ANSI is so bad (it means American National Standards Institute,
even though Windows locales have nothing to do with that body) that I'd
rather not put that in any function I'd use in real code.
> 3. Under non-windows platform that should do anything to strings and pass them
> as is as
>
> native POSIX api is narrow and not wide.
Yet you still need to convert between UTF-8 and the POSIX locales.
Even if most recent POSIX systems use UTF-8 as their locale, there is no
guarantee of that.
Indeed, quite a few still run in latin-1.
> > as is as
> >
> > native POSIX api is narrow and not wide.
>
> Yet you still need to convert between UTF-8 and the POSIX locales.
> Even if most recent POSIX systems use UTF-8 as their locale, there is no
> guarantee of that.
> Indeed, quite a few still run in latin-1.
>
No you don't need convert UTF-8 to "locales" encoding as char* is native
system API unlike Windows one. So you don't need to mess around with encodings
at all unless you deal with text related stuff like for example collation.
The **only** problem is badly designed Windows API that makes
impossible to write cross platform code.
So the idea that when we on windows we treat "char *" as UTF-8 and
then call Wide API after converting it from UTF-8.
There is no problem with this.
As long as all library use same policy there would be no issues
using Unicode any more.
The problem is not locales, encodings or other stuff, the problem
is that Windows API does not allow you to use "char *" based
string fully as it does not support UTF-8 and platform independent
programming becomes total mess.
Artyom
> At Fri, 14 Jan 2011 17:50:02 +0200,
> Peter Dimov wrote:
>>
>> Unfortunately not. A library that requires its input paths to be
>> UTF-8 always gets bug reports from users who are accustomed to using
>> another encoding for their narrow strings. There is plenty of
>> precedent they can use to justify their complaint.
>
> I don't see the problem you cited as an answer to my question. Let me
> try asking it differently: how do I program in an environment that has
> both "right" and "wrong" libraries?
>
> Also, is there any use in trying to get the difference into the type
> system, e.g. by using some kind of wrapper over std::string that gives
> it a distinct "utf-8" type?
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning
one type to another automatically converts it to the target type during
the copy. (Converting to ascii_t will throw an exception if a resulting
character won't fit into eight bits.)
Each type has an internal storage type as well, based on the character
size (ascii_t and utf8_t use std::string, utf16_t uses 16-bit
characters, etc). You can access the internal storage type using
operator* or operator->. For a utf8_t variable 'v', for example, *v
gives you the UTF-8-encoded string.
An std::string is assumed to be ASCII-encoded. If you really do have
UTF-8-encoded data to get into the system, you either assign it to a
utf8_t using operator*, or use a static function utf8_t::precoded.
std::wstring is assumed to be utf16_t- or utf32_t-encoded already,
depending on the underlying character width for the OS.
A function is simply declared with parameters of the type that it
needs. You can call it with whichever type you've got, and it will be
auto-converted to the needed type during the call, so for the most part
you can ignore the different types and use whichever one makes the most
sense for your application. I use utf8_t as the main internal string
type for my programs.
For portable OS-interface functions, there's a typedef (os::native_t)
to the type that the OS's API functions need. For Linux-based systems,
it's utf8_t; for Windows, utf16_t. There's also a typedef
(os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but
I'm not sure there's a need for that.
There are some parts of the code that could use polishing, but I like
the overall design, and I'm finding it pretty easy to work with. Anyone
interested in seeing the code?
--
Chad Nelson
Oak Circle Software, Inc.
*
*
*
>> Yet you still need to convert between UTF-8 and the POSIX locales.
>> Even if most recent POSIX systems use UTF-8 as their locale, there is no
>> guarantee of that.
>> Indeed, quite a few still run in latin-1.
>>
>
>
> No you don't need convert UTF-8 to "locales" encoding as char* is native
> system API unlike Windows one. So you don't need to mess around with encodings
> at all unless you deal with text related stuff like for example collation.
I'm not sure I follow. If you pass a UTF-8 encoded string to a POSIX OS
that uses a non-UTF charater set, how is the OS meant to interpret that?
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
> On Fri, 14 Jan 2011 10:59:09 -0500
> Dave Abrahams <da...@boostpro.com> wrote:
>
>> Also, is there any use in trying to get the difference into the type
>> system, e.g. by using some kind of wrapper over std::string that gives
>> it a distinct "utf-8" type?
>
> The system I'm now using for my programs might interest you.
>
> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
... snip
> There are some parts of the code that could use polishing, but I like
> the overall design, and I'm finding it pretty easy to work with. Anyone
> interested in seeing the code?
Yes please! This sounds roughly like the solution I'd been imagining
where, for instance, boost::filesystem::path has string and wstring
contructors that work as they do now but also has path(utf8_string)
constructors that must be called like this:
std::string system_encoded_text = some_non_utf8_aware_library_call();
filesystem::path utf8_file_path(boost::utf8_string(system_encoded_text));
The utf8_string class would do the conversion from the system encoding.
As a null terminated byte sequence, I mean if your locale is UTF-8
and there is a file with name "\xFF\xFF.txt" which is clearly not UTF-8
you can open it, remove it and do almost anything with it.
It is locale agnostic (unless it is very specific language related API like
strcoll)
Artyom
> On Sat, 15 Jan 2011 10:08:22 -0500, Chad Nelson wrote:
>
>> On Fri, 14 Jan 2011 10:59:09 -0500
>> Dave Abrahams <da...@boostpro.com> wrote:
>>
>>> Also, is there any use in trying to get the difference into the type
>>> system, e.g. by using some kind of wrapper over std::string that
>>> gives it a distinct "utf-8" type?
>>
>> The system I'm now using for my programs might interest you.
>>
>> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
>
> ... snip
>
>> There are some parts of the code that could use polishing, but I like
>> the overall design, and I'm finding it pretty easy to work with.
>> Anyone interested in seeing the code?
>
> Yes please!
http://www.oakcircle.com/toolkit.html
I've released it under the Boost license, so anyone may use it as they
wish.
I think one part of os.cpp uses a class from the library that I didn't
include, but it's minor and can easily be replaced with one of the
Boost.Random classes instead. Everything else should work stand-alone.
It's pretty well documented, but ask me if you have any questions.
> This sounds roughly like the solution I'd been imagining where, for
> instance, boost::filesystem::path has string and wstring contructors
> that work as they do now but also has path(utf8_string) constructors
> that must be called like this:
>
> std::string system_encoded_text = some_non_utf8_aware_library_call();
> filesystem::path
> utf8_file_path(boost::utf8_string(system_encoded_text));
>
> The utf8_string class would do the conversion from the system
> encoding.
That's how I designed it. :-)
> No you don't need convert UTF-8 to "locales" encoding as char* is native
> system API unlike Windows one. So you don't need to mess around with encodings
> at all unless you deal with text related stuff like for example collation.
POSIX system calls expect the text they receive as char* to be encoded
in the current character locale.
To write cross-platform code, you need to convert your UTF-8 input to
the locale encoding when calling system calls, and convert text you
receive from those system calls from the locale encoding to UTF-8.
(Note: this is exactly what gtkmm::ustring does)
Windows is exactly the same, except it's got two sets of locales and two
sets of system calls.
The wide character locale is more interesting since it is always UTF-16,
so the conversion you have to do is only between UTF-8 and UTF-16, which
is easy and lossless.
Likewise, you could also choose to use UTF-16 or UTF-32 as your internal
representation rather than UTF-8. The choice is completely irrelevant
which regards to providing an uniformly encoded interface regardless of
platform.
> The problem is not locales, encodings or other stuff, the problem
> is that Windows API does not allow you to use "char *" based
> string fully as it does not support UTF-8
The actual locale used by the user is irrelevant.
Again, as I said earlier, the fact that UTF-8 is the most common locale
on Linux but is not available on Windows shouldn't affect the way the
system works.
A lot of Linux systems use a Latin-1 locale, and your approach will
simply fail on those systems.
> and platform independent
> programming becomes total mess.
So your technique for writing independent code is relying on the user to
use an UTF-8 locale?
Care to submit it for review?
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
> POSIX system calls expect the text they receive as char* to be encoded in
> the current character locale.
No, POSIX system calls (under most Unix OSes, except on Mac OS X) are
encoding-agnostic, they receive a null-terminated byte sequence (NTBS)
without interpreting it. On Mac OS X, file paths must be UTF-8. Locales are
not considered.
> To write cross-platform code, you need to convert your UTF-8 input to the
> locale encoding when calling system calls, and convert text you receive
> from those system calls from the locale encoding to UTF-8.
This is one possible way to do it (blindly using UTF-8 is another). Strictly
speaking, under an encoding-agnostic file system, you must not convert
anything to anything because this may cause you to irretrievably lose the
original path. For display purposes, of course, you have to pick an encoding
somehow. There is no "current" character locale on Unix, by the way, unless
you count the environment variables. The OS itself doesn't care.
Using the current C locale (LANG=...) allows you to display the file names
the same way the 'ls' command does, whereas using UTF-8 allows your user to
enter file names which are not representable in the LANG locale.
> Windows is exactly the same, except it's got two sets of locales and two
> sets of system calls.
Nope. It doesn't have two sets of locales.
> So your technique for writing independent code is relying on the user to
> use an UTF-8 locale?
More or less. The code itself doesn't depend on the user locale, it always
works, but to see the actual names in a terminal, you need an UTF-8 locale.
This is now the recommended setup on all Unix OSes.
A very nice and useful utility. Anyway, I'll share some comments, just in case you want to hear some. ;-)
"Be warned, if you try to convert a UTF-coded value to ASCII, each decoded
character must fit into an unsigned eight-bit type. If it doesn't, the library
will throw an \c oakcircle::unicode::will_not_fit exception."
I think that exception is not always appropriate. A better solution would be a policy-based class design or additional conversion
function accepting an error policy. This way the user could tell the converter to use some "similarly looking" or "invalid"
character instead of throwing when exact conversion is not possible.
"Note that, like pointers, they can hold a null value as well, created by passing
\c boost::none to the type's contructor or setting it equal to that value."
I don't feel the interface with pointer semantics is the most suitable here. Are there any practical advantages from being able to
have a null string? Even if so, one could use an actual pointer or boost::optional anyway.
Moreover, it would be nice if the proper encoding of the underlying string was the classes' invariant. Currently the classes cannot
guarantee this because they allow for direct access to the value which may be freely changed by the user with no respect to the
encoding.
Best regards,
Robert
If so (and this is what I see in code) ASCII is misleading.
It should be called Latin1/ISO-8859-1 but not ASCII.
> An std::string is assumed to be ASCII-encoded. If you really do have
> UTF-8-encoded data to get into the system, you either assign it to a
> utf8_t using operator*, or use a static function utf8_t::precoded.
> std::wstring is assumed to be utf16_t- or utf32_t-encoded already,
> depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots
of code with direct UTF-8 strings in it (Boost.Locale tests)
and this worked perfectly well with MSVC, GCC and Intel
compilers (as long as I work with char * not L"") and this works
file all the time.
It is bad assumption, the encoding should be byte string
which may be UTF-8 or may be not.
There are two cases we need to treat strings and encoding:
1. We handle human language or text - collation, formatting etc.
2. We want to access Windows Wide API that is not locale agnostic.
>
> For portable OS-interface functions, there's a typedef (os::native_t)
> to the type that the OS's API functions need. For Linux-based systems,
> it's utf8_t; for Windows, utf16_t. There's also a typedef
> (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but
> I'm not sure there's a need for that.
>
When you work with Linux and Unix at all you should not change encoding.
There were discussions about it. For example following code:
#include <fstream>
#include <cstdio>
#include <assert.h>
int main()
{
{
std::ofstream t("\xFF\xFF.txt");
if(!t) {
/// Not valid for this os - Mac OS X
return 0;
}
t << "test";
t.close();
}
{
std::ifstream t("\xFF\xFF.txt");
std::string s;
t >> s;
assert( s=="test");
t.close();
}
std::remove("\xFF\xFF.txt");
}
Which is valid code and works regardless of current locale on POSIX
platforms.
Using your API it would fail as it holds some assumptions on encoding.
> There are some parts of the code that could use polishing, but I like
> the overall design, and I'm finding it pretty easy to work with. Anyone
> interested in seeing the code?
IMHO, I don't think that inventing new strings or new text
containers is a way to go. std::string is perfectly fine as long
as you code in consistent way.
Artyom
> [...]- stream buffer implementation:
Thanks, I'll update the code! If you have any ideas how a test case could
look like to verify the fix let me know (given that there is no mocking
framework in Boost yet).
> - Windows and Unicode.
> [...] 1. You can also always assume that strings under windows are
> UTF-8
> and always convert them to wide string before system calls.
> This is I think better approach, but it is different from what
> most of boost does.
>
> 2. I do not recommend adding wide API - makes the code much uglier,
> rather convert normal strigns to wide strings before system call.
I'd appreciate a Boost-wide solution or guideline. And I think this thread
has already turned into a Unicode discussion? :) The interface of
Boost.Process in that aspect has definitely evolved without clear
direction - I think neither I nor other Boost.Process developers spent
time trying to solve this problem in this library.
> - It may be very good addition to implement full support of putback.
If you have a patch just drop me a mail! :)
> [...]P.S.: Good luck with the review library looks overall very nice.
Thanks! I have some more patches waiting to be applied. There will be
definitely another update (of the implementation only) before the review
starts.
Boris
> At Sun, 16 Jan 2011 09:58:00 -0500,
> Chad Nelson wrote:
>>
>> http://www.oakcircle.com/toolkit.html
>>
>> I've released it under the Boost license, so anyone may use it as
>> they wish.
>
> Care to submit it for review?
Have you looked at the code? ;-) Seriously, I don't think it's anywhere
near Boost-quality. There are at least a few changes I'd want to make
before I'd consider it even marginally ready, and I can't take the time
away from paying work right now to do them.
However, if someone else wants to run with it, I'm willing to donate
what I've got, and help with the work.
>> From: Chad Nelson
>> http://www.oakcircle.com/toolkit.html
>>
>> I've released it under the Boost license, so anyone may use it as
>> they wish.
>
> A very nice and useful utility. Anyway, I'll share some comments, just
> in case you want to hear some. ;-)
I'm always interested in comments -- thanks!
> "Be warned, if you try to convert a UTF-coded value to ASCII, each
> decoded character must fit into an unsigned eight-bit type. If it
> doesn't, the library will throw an \c oakcircle::unicode::will_not_fit
> exception."
>
> I think that exception is not always appropriate. A better solution
> would be a policy-based class design or additional conversion function
> accepting an error policy. This way the user could tell the converter
> to use some "similarly looking" or "invalid" character instead of
> throwing when exact conversion is not possible.
And if I were going to submit it for review, that's exactly what I'd
want too. That code was written solely for my own use, or other
programmers working with my company's code later, despite how the
documentation makes it look.
> "Note that, like pointers, they can hold a null value as well, created
> by passing \c boost::none to the type's contructor or setting it equal
> to that value."
>
> I don't feel the interface with pointer semantics is the most suitable
> here. Are there any practical advantages from being able to have a
> null string?
Nope. That's there solely so that certain functions can use it to return
an error value, using the same semantics as Boost.Optional, without
explicitly wrapping it in a Boost.Optional. If I were going to submit
it for review, I'd probably remove that completely.
> Even if so, one could use an actual pointer or boost::optional anyway.
I did use Boost.Optional at first, but for my code, I found it easier to
built that into the classes.
> Moreover, it would be nice if the proper encoding of the underlying
> string was the classes' invariant. Currently the classes cannot
> guarantee this because they allow for direct access to the value which
> may be freely changed by the user with no respect to the encoding.
As I said, this was written solely for my company's code. I know how to
ensure that changes to the internal data are consistent with the type,
and the design ensures that doing so is awkward enough to make people
scrutinize the code doing it carefully, so a code-review should catch
any problems easily. But again, if I were to submit it to Boost, I'd
likely change that first.
I'd also want to add full string emulation. Right now it only partly
emulates a string, and for any real work you're likely to need to
access the internal data.
>> The system I'm now using for my programs might interest you.
>>
>> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
>> Assigning one type to another automatically converts it to the
>> target type during the copy. (Converting to ascii_t will throw an
>> exception if a resulting character won't fit into eight bits.)
>>
>
> If so (and this is what I see in code) ASCII is misleading.
> It should be called Latin1/ISO-8859-1 but not ASCII.
Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little
awkward to type. ;-) As I've said, this code was written solely for my
company, I'd make a number of changes if I were going to submit it to
Boost.
>> An std::string is assumed to be ASCII-encoded. If you really do
>> have UTF-8-encoded data to get into the system, you either assign it
>> to a utf8_t using operator*, or use a static function
>> utf8_t::precoded. std::wstring is assumed to be utf16_t- or
>> utf32_t-encoded already, depending on the underlying character
>> width for the OS.
>
> This is very bad assumption. To be honest, I've written lots of code
> with direct UTF-8 strings in it (Boost.Locale tests) and this worked
> perfectly well with MSVC, GCC and Intel compilers (as long as I work
> with char * not L"") and this works file all the time.
>
> It is bad assumption, the encoding should be byte string which may be
> UTF-8 or may be not.
But if you assigned that byte string to a utf*_t type, how would you
treat it? I had to either make some assumption, or disallow assigning
from an std::string and char* entirely. And it's just too convenient to
use those assignments, for things like constants, to give that up.
The way I designed it, you're supposed to feed it only ASCII (or
Latin-1, if you prefer) text when you make an assignment that way. If
you have some differently-coded text, you'd feed it in through another
class, one that knows its coding and is designed to decode to UTF-32
the way that utf8_t and utf16_t are, so that the templated conversion
functions know how to handle it.
> There are two cases we need to treat strings and encoding:
>
> 1. We handle human language or text - collation, formatting etc.
> 2. We want to access Windows Wide API that is not locale agnostic.
I'm not sure where you're coming from. Those are two broad categories
of uses for that code, but arguably not the only two.
>> For portable OS-interface functions, there's a typedef
>> (os::native_t) to the type that the OS's API functions need. For
>> Linux-based systems, it's utf8_t; for Windows, utf16_t. There's
>> also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t
>> on Windows, but I'm not sure there's a need for that.
>>
>
> When you work with Linux and Unix at all you should not change
> encoding. There were discussions about it. [...] Using your API it
> would fail as it holds some assumptions on encoding.
Why would you feed "\xFF\xFF.txt" into a utf8_t type, if you didn't
want it encoded as UTF-8? If you have a function that requires some
different encoding, you'd use that encoding instead. For filenames,
you'd treat the strings entered by the user or obtained from the file
system as opaque blocks of bytes.
In any case, all modern Linux OSes use UTF-8 by default, so I haven't
seen any need to worry about other forms yet. I'm not even sure how I'd
tell what code-page a Linux system is set to use, so far I've never
needed to know that. Though if a Russian customer comes along and tells
me my code doesn't work right on his Linux system, I'll re-think that.
>> There are some parts of the code that could use polishing, but I
>> like the overall design, and I'm finding it pretty easy to work
>> with. Anyone interested in seeing the code?
>
> IMHO, I don't think that inventing new strings or new text containers
> is a way to go. std::string is perfectly fine as long as you code in
> consistent way.
I have to respectfully disagree. std::string says nothing about the
encoding of the data within it. If you're using more than one type of
encoding in your program, like Latin-1 and UTF-8, then using
std::strings is like using void pointers -- no type safety, no way to
automate conversions when necessary, and no way to select overloaded
functions based on the encoding. A C++ solution pretty much requires
that they be unique types.
> On Sun, 16 Jan 2011 12:56:23 -0800 (PST)
> Artyom <arty...@yahoo.com> wrote:
>
>>> The system I'm now using for my programs might interest you.
>>>
>>> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
>>> Assigning one type to another automatically converts it to the
>>> target type during the copy. (Converting to ascii_t will throw an
>>> exception if a resulting character won't fit into eight bits.)
>>>
>>
>> If so (and this is what I see in code) ASCII is misleading.
>> It should be called Latin1/ISO-8859-1 but not ASCII.
>
> Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little
> awkward to type. ;-) As I've said, this code was written solely for my
> company, I'd make a number of changes if I were going to submit it to
> Boost.
I'm a little concerned by this talk of ASCII and Latin1. When, say, utf8_t
is given a char* does it not treat is as OS-default encoded rather than
ASCII/Latin1? I've skimmed to code but havn't managed to work out how the
classes treat this case.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
> On Sun, 16 Jan 2011 21:41:25 -0500, Chad Nelson wrote:
>
>>> If so (and this is what I see in code) ASCII is misleading.
>>> It should be called Latin1/ISO-8859-1 but not ASCII.
>>
>> Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a
>> little awkward to type. ;-) As I've said, this code was written
>> solely for my company, I'd make a number of changes if I were going
>> to submit it to Boost.
>
> I'm a little concerned by this talk of ASCII and Latin1. When, say,
> utf8_t is given a char* does it not treat is as OS-default encoded
> rather than ASCII/Latin1? I've skimmed to code but havn't managed to
> work out how the classes treat this case.
Right now, the utf*_t classes assume that any std::string fed directly
into them is meant to be translated as-is. It's assumed to consist of
characters that should be directly encoded as their unsigned values.
That works perfectly for seven-bit ASCII text, but may be problematic
for values with the high-bit set.
I've done some research, and it looks like it would require little
effort to create an os::string_t type that uses the current locale, and
assume all raw std::strings that contain eight-bit values are coded in
that instead.
Design-wise, ascii_t would need to change slightly after this, to throw
on anything that can't fit into a *seven*-bit value, rather than
eight-bit. I'll add the default-character option to both types as well,
and maybe make other improvements as I have time.
With this change, the os::native_t typedef would either be completely
redundant or simply wrong, so I'll remove it.
I should be able to find the time for that sometime this week, if all
goes well.
Artyom, since you seem to have more experience with this stuff than I,
what do you think? Would those alterations take care of your objections?
I'm not sure about the os namespace ;) What about just calling it native_t
like your other class but in the same namespace as utf8_t etc.
> Design-wise, ascii_t would need to change slightly after this, to throw
> on anything that can't fit into a *seven*-bit value, rather than
> eight-bit. I'll add the default-character option to both types as well,
> and maybe make other improvements as I have time.
Sounds good.
> Artyom, since you seem to have more experience with this stuff than I,
> what do you think? Would those alterations take care of your objections?
Also, Artyom's Boost.Locale does very sophisticated encoding conversion but
the unicode conversions done by utf*_t look (scarily?) small. Do they do
as good a job or should these classes make use of the conversions in
Boost.Locale?
Unfortunately this is not the correct approach as well.
For example why do you think it is safe to pass ASCII subset of utf-8
to current non-utf-8 locale?
For example Shift-JIS that is in use on Windows/ANSI API has different
subset in 0-127 range - it is not ASCII!
Also if you want to use std::codecvt facet...
Don't relay on them unless you know where they come from!
1. By default they are noop - in the default C locale
2. Under most compilers they are not implemented properly.
OS \ Compiler MSVC GCC SunOS/stlport SunOS/standard
-------------------------------------------------------------------
Windows ok none - -
Linux - ok ? ?
Mac OS X - none - -
FreeBSD - none - -
Solaris - none buggy! ok-but-non-standard
Bottom lines don't relate on "current locale" :-)
>
> Artyom, since you seem to have more experience with this stuff than I,
> what do you think? Would those alterations take care of your objections?
>
The rule of thumb is following:
- When you hadle with strings as text storage just use std::string
- When you do a system call
a) on Posix - pass it as is
b) on Windows - Convert to Wide API from UTF-8
- When handling text as text (i.e. formatting, collation etc.)
use good library.
I would strongly recommend to read the answer of Pavel Radzivilovsky
on Stackoverflow:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375
And he is hard-core-windows-programmer, designer, architext and developer
and still he had chosen UTF-8!
The problem that the issue is so completated that making
it absolutly general and on the other hand right is only
one - decide what you are working with and stick with it.
In CppCMS project I work with (and I developed Boost.Locale
because of it) I stick by default with UTF-8 and use plain
std::string - works like a charm.
Invening "special unicode strings or storage" does not
improve anybody's understanding of Unicode neither improve
its handing.
Best,
Artyom
>> I've done some research, and it looks like it would require little
>> effort to create an os::string_t type that uses the current locale,
>> and assume all raw std::strings that contain eight-bit values are
>> coded in that instead.
>
> I'm not sure about the os namespace ;) What about just calling it
> native_t like your other class but in the same namespace as utf8_t
> etc.
If os::native_t were still going to be around, I wouldn't want
something that potentially confusing. But since it's going away, I see
no problem with that. I've updated my notes.
>> Artyom, since you seem to have more experience with this stuff than
>> I, what do you think? Would those alterations take care of your
>> objections?
>
> Also, Artyom's Boost.Locale does very sophisticated encoding
> conversion but the unicode conversions done by utf*_t look (scarily?)
> small. Do they do as good a job or should these classes make use of
> the conversions in Boost.Locale?
They should probably use Boost.Locale. I just haven't looked at it yet.
I'll check it out when I get some time to dig into that project again,
likely later this week.
>> Artyom, since you seem to have more experience with this stuff than I,
>> what do you think? Would those alterations take care of your objections?
>>
>
<snip>
> I would strongly recommend to read the answer of Pavel Radzivilovsky
> on Stackoverflow:
>
>
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375
>
I just want to say, as a note of encouragement, that I would
absolutely *love* to see these problems addressed in Boost by people
who really know this domain (and those willing to listen and work with
them). I'm excited by the direction of this thread!
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
> The problem that the issue is so completated that making
> it absolutly general and on the other hand right is only
> one - decide what you are working with and stick with it.
>
> In CppCMS project I work with (and I developed Boost.Locale
> because of it) I stick by default with UTF-8 and use plain
> std::string - works like a charm.
>
> Invening "special unicode strings or storage" does not
> improve anybody's understanding of Unicode neither improve
> its handing.
I don't understand how it could possibly not help. If I see an api
function call_me(std::string arg) I know next to nothing about what it's
expecting from the string (except that by convention it tends to mean
'string in OS-default encoding'). If I see call_me(boost::utf8_t arg), I
know *exactly* what it's after. Further, assuming I know what format my
own strings are in, I know how to provide it with what it expects.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
>> I've done some research, and it looks like it would require little
>> effort to create an os::string_t type that uses the current locale,
>> and assume all raw std::strings that contain eight-bit values are
>> coded in that instead.
>>
>> Design-wise, ascii_t would need to change slightly after this, to
>> throw on anything that can't fit into a *seven*-bit value, rather
>> than eight-bit. I'll add the default-character option to both types
>> as well, and maybe make other improvements as I have time.
>
> Unfortunately this is not the correct approach as well.
>
> For example why do you think it is safe to pass ASCII subset of utf-8
> to current non-utf-8 locale?
>
> For example Shift-JIS that is in use on Windows/ANSI API has different
> subset in 0-127 range - it is not ASCII!
Ah, I wasn't aware that there were character sets that redefined
0..127. That does change things a bit.
> Also if you want to use std::codecvt facet...
> Don't relay on them unless you know where they come from!
>
> 1. By default they are noop - in the default C locale
>
> 2. Under most compilers they are not implemented properly. [...]
I was planning to use MultiByteToWideChar and its opposite under
Windows (which presumably would know how to translate its own code
pages), and mbsrtowcs and its ilk under POSIX systems (which apparently
have been well-implemented for at least seven versions under glibc [1],
though I can't tell whether eglibc -- the fork that Ubuntu uses -- has
the same level of capabilities).
[1]: <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
> Bottom lines don't relate on "current locale" :-)
I hadn't wanted to add a dependency on ICU or iconv either. Though I may
end up having to for the program I'm currently developing, on at least
some platforms.
> [...] I would strongly recommend to read the answer of Pavel
> Radzivilovsky on Stackoverflow:
>
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375
>
> And he is hard-core-windows-programmer, designer, architext and
> developer and still he had chosen UTF-8!
Thanks, I'm familiar with it. In fact, reading that was one of the
reasons that I started developing the utf*_t classes, so that I *could*
keep strings in UTF-8 while still keeping track of the ones that aren't.
> The problem that the issue is so completated that making it absolutly
> general and on the other hand right is only one - decide what you are
> working with and stick with it.
>
> In CppCMS project I work with (and I developed Boost.Locale because
> of it) I stick by default with UTF-8 and use plain std::string -
> works like a charm.
To each his own. :-)
> Invening "special unicode strings or storage" does not improve
> anybody's understanding of Unicode neither improve its handing.
We'll have to agree to disagree there. The whole point to these classes
was to provide the compiler -- and the programmer using them -- with
some way for the string to carry around information about its encoding,
and allow for automatic conversions between different encodings. If
you're working with strings in multiple encodings, as I have to in one
of the programs we're developing, it frees up a lot of mental stack
space to deal with other issues.
> On Mon, 17 Jan 2011 10:09:13 -0800 (PST), Artyom wrote:
>
>> [...] Invening "special unicode strings or storage" does not improve
>> anybody's understanding of Unicode neither improve its handing.
>
> I don't understand how it could possibly not help. If I see an api
> function call_me(std::string arg) I know next to nothing about what
> it's expecting from the string (except that by convention it tends to
> mean 'string in OS-default encoding'). If I see call_me(boost::utf8_t
> arg), I know *exactly* what it's after. Further, assuming I know what
> format my own strings are in, I know how to provide it with what it
> expects.
+1. +100. :-) That's exactly what I was aiming for. And as an added
bonus, if you've got a string type that can translate itself to
utf32_t, then it doesn't matter what kind of string the function wants
because the classes can handle the conversions themselves.
However, after looking into the matter further for this discussion, I
see that he does have a valid point about locales and various encodings.
My classes definitely don't handle those well enough yet, and the
program we're currently developing (which I'm not at liberty to discuss
until it's released) will almost certainly need that.
I really wanted to avoid a dependency on the ICU library or anything
similar if at all possible, but it looks like it might be
inevitable. :-(
> [...] I just want to say, as a note of encouragement, that I would
> absolutely *love* to see these problems addressed in Boost by people
> who really know this domain (and those willing to listen and work with
> them). I'm excited by the direction of this thread!
It's starting to look like I'll have to improve the utf*_t classes
anyway, for my company's current project. I'm almost certain that we'll
be happy to donate the results to the Boost library.
> I really wanted to avoid a dependency on the ICU library or anything
> similar if at all possible, but it looks like it might be
> inevitable. :-(
You may well find that you can ;) Artyom's latest work on Boost.Locale
allows you to select from a range of different backends giving varying
levels of locale support. ICU gives the 'best' results but for my project
Swish, for instance, I didn't need any of these advanced features so I just
use the Win32 backend. This uses the Windows API to do the conversions
etc. and freed me from the beast that is ICU.
You should read the documentation of call_me (*). Yes, I know that in the
real world the documentation often doesn't specify an encoding (worse - the
encoding varies between platforms and even versions of the same library),
but if the developer of call_me hasn't bothered to document the encoding of
the argument, he won't bother to use a special UTF-8 type for the argument,
either. :-)
(*) And the documentation should either say that call_me accepts UTF-8, or
that call_me is encoding-agnostic, that is, it treats the string as a byte
sequence.
I can think of one reason to use a separate type - if you want to overload
on encoding:
void f( latin1_t arg );
void f( utf8_t arg );
In most such cases that spring to mind, however, what the user actually
wants is:
void f( string arg, encoding_t enc );
or even
void f( string arg, string encoding );
In principle, as Chad Nelson says, it's useful to have separate types if the
program uses several different encodings at once, fixed at compile time. I
don't consider such a way of programming a good idea though. Strings should
be either byte sequences or UTF-8; input can be of any encoding, possibly
not known until runtime, but it should always be either processed as a byte
sequence or converted to UTF-8 as a first step.
Regarding the OS-default encoding - if, on Windows, you ever encounter or
create a string in the OS default encoding, you've already lost - this code
can't be correct. :-)
DISCLAIMER: I have almost no experience with the details of this
stuff. I only know a few general things about programming (fewer
every day).
I think the reason to use separate types is to provide a type-safety
barrier between your functions that operate on utf-8 and system or
3rd-party interfaces that don't or may not. In principle, that should
force you to think about encoding and decoding at all the places where
it may be needed, and should allow you to code naturally and with
confidence where everybody is operating in utf8-land. The typical
failures I've seen, where there is no such mechanism (e.g. in Python
where there's no static typing), are caused because programmers lose
track of whether what they're handling is encoded as utf-8 or not.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
UTF-8 allows the use of char * for type erasure for strings, much like
void * allows that in general. Using C++ type tags to discriminate
between different data pointed by void pointers is mostly redundant
except when type safety is postponed until run-time; and that's only
marginally safer than using string tags.
Emil Dotchevski
Reverge Studios, Inc.
http://revergestudios.com/reblog/index.php?n=ReCode.ReCode
Ok...
1st of all I'd suggest to take a look on this code:
What you would see is how painfully hard to use this functions right
if you want to support things like skipping or replacing invalid characters.
So if you use it, use it with SUPER care, and don't forget that
there are changes between Windows XP and below and Windows Vista
and above - to make your life even more interesting (a.k.a. miserable)
> and mbsrtowcs and its ilk under POSIX systems (which apparently
> have been well-implemented for at least seven versions under glibc [1],
> though I can't tell whether eglibc -- the fork that Ubuntu uses -- has
> the same level of capabilities).
>
> [1]: <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
No... no..
This is not the way to go.
For example what would be result of:
#include <stdlib.h>
int main()
{
wchar_t wide[32];
size_t size = mbsrtowcs(wide,"שלום",sizeof(wide)/sizeof(wide[0]));
???
}
When current system locale lets say en_US.UTF-8?
The result size would be (size_t)(-1) indicating error.
You need first to use:
setlocale(LC_ALL,"");
To setup default locale and only then mbsrtowcs would work.
And how do you think would the code below work after this (calling
setlocale(...)?
FILE *f=fopen("point.csv","w");
fprintf(f,"%f,%f\n",1.3,4.5);
fclose(f);
What would be the output? Would it succeed to create
correct csv?
Answer - depending on locale, for example in some locales
like ru_RU.UTF-8 or Russian_Russia it would be
"1,3,4,5" and not expected "1.3,4.5"
Nice.. Isn't it?! And believe me 99.9% of developers would
have hard to understand what is wrong with this code.
You can't use these functions!
Also there is other problem.
What is "current locale" on current OS?
- Is this defined by global OS definitions of environment
variable LC_ALL, LC_CTYPE or LANG?
- Is this, defined by the environment variable
LC_ALL, LC_CTYPE or LANG in current user environment?
- Is this, defined by the environment variable LC_ALL,
LC_CTYPE or LANG in current process environment?
- Is this the locale defined by
setlocale(LC_ALL,"My_Locale_Name.My_Encoding");
- Is this the locale defined by
std::locale::global(std::locale("My_Locale_Name.My_Encoding"))?
All answers are correct and all users would probably expect each one of them to
work.
Don't bother to try to detect or convert to "current-locale" at POSIX
system this is something that can be changed easily or even may be not
defined at all!
>
> I hadn't wanted to add a dependency on ICU or iconv either. Though I may
> end up having to for the program I'm currently developing, on at least
> some platforms.
>
Under Unix it is more then justified to use iconv - it is standard
POSIX API, in fact in Linux it is part of libc on some other platforms
it may be indepenent library (like FreeBSD)
Acutally in Boost.Locale is use iconv by default under Linux
as it is better API then ICU's one (and faster because do not require
passing via UTF-16)
>
> We'll have to agree to disagree there. The whole point to these classes
> was to provide the compiler -- and the programmer using them -- with
> some way for the string to carry around information about its encoding,
> and allow for automatic conversions between different encodings.
This is totally different problem. If so you need container like this:
class specially_encoded_string {
public:
std::string encoding() const
{
return encoding_;
}
std::string to_utf8() const
{
return convert(content_,encoding_,"UTF-8");
}
void from_utf8(std::string const &input) const
{
content_ = convert(input,"UTF-8",encoding_);
}
std::string const &raw() const
{
return content_;
}
private:
std::string encoding_; /// <----- VERY IMPORTANT
/// may have valies as: ASCII, Latin1,
/// ISO-8859-8, Shift-JIS or Windows-1255
std::string content_; /// <----- The raw string
}
Creating "ascii_t" container or anything that that that does
not carry REAL encoding name with it would lead to bad things.
> If
> you're working with strings in multiple encodings, as I have to in one
> of the programs we're developing, it frees up a lot of mental stack
> space to deal with other issues.
The best way is to conver on input encoding to internal one and use it,
and conver it back at output.
I had written several programs that use different encodigns:
1. BiDiTeX: LaTeX + BiDi for Hebrew - converts input encoding to UTF-32 and
then convers it back on output
2. CppCMS: it allows using non UTF-8 encodings, but the encoding information
is carried with std::locale::codecvt facet and I created and the
encoding/locale is bounded to the currect request/reponse context.
Each user input (and BTW output as well) is validated - for example
HTML form by default validates input encoding.
These are my solutions of my real problems.
What you suggest is misleading and not well defined.
Best Regards,
Ok...
1st of all I'd suggest to take a look on this code:
What you would see is how painfully hard to use this functions right
if you want to support things like skipping or replacing invalid characters.
So if you use it, use it with SUPER care, and don't forget that
there are changes between Windows XP and below and Windows Vista
and above - to make your life even more interesting (a.k.a. miserable)
> and mbsrtowcs and its ilk under POSIX systems (which apparently
> have been well-implemented for at least seven versions under glibc [1],
> though I can't tell whether eglibc -- the fork that Ubuntu uses -- has
> the same level of capabilities).
>
> [1]: <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
>
This is the code that converts between encodings and usesds
> I think the reason to use separate types is to provide a type-safety
> barrier between your functions that operate on utf-8 and system or
> 3rd-party interfaces that don't or may not. In principle, that should
> force you to think about encoding and decoding at all the places where
> it may be needed, and should allow you to code naturally and with
> confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in
UTF-8 land though. It's a bit like defining a separate integer type for
nonnegative ints for type safety reasons - useful in theory, but nobody does
it.
If you're designing an interface that takes UTF-8 strings, it still may be
worth it to have the parameters be of a utf8-specific type, if you want to
force your users to think about the encoding of the argument each time they
call one of your functions... this is a legitimate design decision. If
you're in control of the whole program, though, it's usually not worth it -
you just keep everything in UTF-8.
I thought the point of using different types was instead of tagging a
string with an encoding name. In other words, a utf8_t would always hold a
std::string content_ in UTF-8 format.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
> Dave Abrahams wrote:
>
>> I think the reason to use separate types is to provide a type-safety
>> barrier between your functions that operate on utf-8 and system or
>> 3rd-party interfaces that don't or may not. In principle, that should
>> force you to think about encoding and decoding at all the places where
>> it may be needed, and should allow you to code naturally and with
>> confidence where everybody is operating in utf8-land.
>
> Yes, in principle. It isn't terribly necessary if everybody is operating in
> UTF-8 land though.
Which is exactly why it's necessary: everybody _isn't_ operating in UTF-8
land.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
I'm addressing this problem:
> The whole point to these classes was to
> provide the compiler -- and the programmer
> using them with some way for the string to
> carry around information about its encoding
i.e. sometimes string should come with encoding.
The point is that if the encoding you are using is not
the **default** encoding in your program (i.e. UTF-8)
then you may need to add encoding "tag" to the text,
otherwise just use UTF-8 with std::string.
Artyom
> > UTF-8 land though.
>
> Which is exactly why it's necessary: everybody _isn't_ operating in UTF-8
> land.
>
The problem is that you need to pic some encoding
and UTF-8 is the most universal and useful.
Otherwise you should:
1. Reinvent the string
2. Reinvent standard library to use new string
3. Reinvent 1001 other libraries to use the new
string.
It is just neither feasible no necessary.
Artyom
Yes, that's exactly my point, although this isn't a property of UTF-8;
it's a more general thing. In a dynamic language like Python
everything is type-erased.
> Using C++ type tags to discriminate
> between different data pointed by void pointers is mostly redundant
Exactly. I'm suggesting, essentially, to avoid the use of void
pointers except where you're forced to, at the boundaries with
"legacy" interfaces.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
But they won't be. That's not today's reality.
> It's a bit like defining a separate integer type for nonnegative
> ints for type safety reasons - useful in theory, but nobody does it.
I refer you to Boost.Units
> If you're designing an interface that takes UTF-8 strings,
...as we are...
> it still may be worth it to have the parameters be of a
> utf8-specific type, if you want to force your users to think about
> the encoding of the argument each time they call one of your
> functions...
Or, you may want to use a UTF-8 specific type to force users of legacy
char* interfaces (and ourselves) to think about decoding each time
they call a legacy char* interfaces.
> this is a
> legitimate design decision. If you're in control of the whole program,
> though, it's usually not worth it - you just keep everything in UTF-8.
By definition, since we're library designers, we don't have said
control. And people *will* be using whatever Boost does with "legacy"
non-UTF-8 interfaces.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
+1 for every point.
Alex
--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
_______________________________________________
Why is that a problem?
> Otherwise you should:
>
> 1. Reinvent the string
My idea is that you just wrap it.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
> On Mon, 17 Jan 2011 18:47:04 -0500, Chad Nelson wrote:
>
>> I really wanted to avoid a dependency on the ICU library or anything
>> similar if at all possible, but it looks like it might be inevitable. :-(
>
> You may well find that you can ;) Artyom's latest work on
> Boost.Locale allows you to select from a range of different backends
> giving varying levels of locale support. ICU gives the 'best'
> results but for my project Swish, for instance, I didn't need any of
> these advanced features so I just use the Win32 backend. This uses
> the Windows API to do the conversions etc. and freed me from the
> beast that is ICU.
Oh! I didn't realize that, thanks for the information!
In that case, what would people say to not having any conversion code
in the Unicode strings stuff at all (other than between the different
UTF-* codings, and maybe to and from ASCII for convenience), and relying
on Boost.Locale for that? Then the trade-offs are up to the developer
using each.
I'll have to see how painless I can make the boundaries between them.
>> From: Alexander Lamaison <aw...@doc.ic.ac.uk>
>>
>>> Yes, in principle. It isn't terribly necessary if everybody is
>>> operating in UTF-8 land though.
>>
>> Which is exactly why it's necessary: everybody _isn't_ operating in
>> UTF-8 land.
>>
>
> The problem is that you need to pic some encoding and UTF-8 is the
> most universal and useful.
I'll second that. Little wasted space, no byte-order problems, and very
easy to work with (finding the first byte of a character, for instance,
is child's play).
> Otherwise you should:
>
> 1. Reinvent the string
Or at least wrap it. ;-)
> 2. Reinvent standard library to use new string
Not entirely necessary, for the same reason that very few changes to
the standard library are needed when you switch from char strings to
char16_t strings to char32_t strings -- the standard library, designed
around the idea of iterators, is mostly type-agnostic.
The utf*_t types provide fully functional iterators, so they'll work
fine with most library functions, so long as those functions don't care
that some characters are encoded as multiple bytes. It's just the ones
that assume that a single byte represents all characters that you have
to replace, and you'd have to replace those regardless of whether you're
using a new string type or not, if you're using any multi-byte encoding.
> 3. Reinvent 1001 other libraries to use the new string.
Again, seldom necessary. Just use a type system that can translate
between your internal coding and what the library wants, at the
boundaries. If the other library you want to use can't handle
multi-byte encodings, you'd have to modify or reinvent it anyway.
> It is just neither feasible no necessary.
My code says it's perfectly feasible. ;-) Whether it's necessary or not
is up to the individual developer, but the type-safety it offers is
more in line with the design philosophy of C++ than using std::string
for everything. I hate to harp on the same tired example, but why do
you really need any pointer type other than void*? It's the same idea.
I don't think the string classes should implement _any_ of the conversions
themselves but should delegate them all to Boost.Locale. However, they
should look like they're doing the conversions by hiding the Boost.Locale
aspect from the caller as much as possible.
>>> Also if you want to use std::codecvt facet...
>>> Don't relay on them unless you know where they come from!
>>>
>>> 1. By default they are noop - in the default C locale
>>>
>>> 2. Under most compilers they are not implemented properly. [...]
>>
>> I was planning to use MultiByteToWideChar and its opposite under
>> Windows (which presumably would know how to translate its own code
>> pages),
>
> Ok...
>
> 1st of all I'd suggest to take a look on this code:
>
> http://cppcms.svn.sourceforge.net/viewvc/cppcms/boost_locale/trunk/libs/locale/src/encoding/wconv_codepage.hpp?revision=1462&view=markup
Pretty convoluted.
> What you would see is how painfully hard to use this functions right
> if you want to support things like skipping or replacing invalid
> characters.
Sorry for the cheap shot, but: it's Microsoft. I *expect* it to be
painful to use, from long experience. ;-)
> So if you use it, use it with SUPER care, and don't forget that
> there are changes between Windows XP and below and Windows Vista
> and above - to make your life even more interesting (a.k.a. miserable)
As you might have seen in an earlier reply this morning, I didn't
realize that it wasn't irretrievably tied to ICU; now that I know, I'd
be completely happy letting Boost.Locale handle the code-page stuff.
[...]
>> We'll have to agree to disagree there. The whole point to these
>> classes was to provide the compiler -- and the programmer using
>> them -- with some way for the string to carry around information
>> about its encoding, and allow for automatic conversions between
>> different encodings.
>
> This is totally different problem. If so you need container like this:
>
> class specially_encoded_string {
[...]
> private:
> std::string encoding_; /// <----- VERY IMPORTANT
> /// may have valies as: ASCII, Latin1,
> /// ISO-8859-8, Shift-JIS or Windows-1255
> std::string content_; /// <----- The raw string
> }
If you want arbitrary encodings, yes. If you only want a subset of the
possible encodings -- such as ASCII and the three main UTF types --
then all you need is some way to convert to and from an OS-specific
encodings.
> Creating "ascii_t" container or anything that that that does
> not carry REAL encoding name with it would lead to bad things.
Certainly, if you tried to use it for stuff that isn't really in that
encoding. It wasn't meant for that.
>> If you're working with strings in multiple encodings, as I have to
>> in one of the programs we're developing, it frees up a lot of mental
>> stack space to deal with other issues.
>
> The best way is to conver on input encoding to internal one and use
> it, and conver it back at output.
I agree, for the most part. But if a large something comes in encoded
with a particular coding, why waste a possibly-significant amount of
processor time immediately recoding it to your internal format if you
don't know that you're going to need to do anything with it? Or if it
might well just be going out in that same external format again,
without needing to be touched? Much better to hold onto it in whatever
format it comes in, and only recode it only when you need to, in my
opinion -- if you can easily keep track of what format it's in, anyway.
> [...] 2. CppCMS: it allows using non UTF-8 encodings, but the encoding
> information is carried with std::locale::codecvt facet and I created
> and the encoding/locale is bounded to the currect request/reponse
> context. [...]
That sounds an awful lot like having a new string type that carries
around its encoding. ;-)
> These are my solutions of my real problems.
> What you suggest is misleading and not well defined.
I can see that parts of it are certainly not well defined yet, but I
believe it's a fixable problem.
> On Tue, 18 Jan 2011 09:51:12 -0500, Chad Nelson wrote:
>> In that case, what would people say to not having any conversion code
>> in the Unicode strings stuff at all (other than between the different
>> UTF-* codings, and maybe to and from ASCII for convenience), and
>> relying on Boost.Locale for that? Then the trade-offs are up to the
>> developer using each.
>
> I don't think the string classes should implement _any_ of the
> conversions themselves but should delegate them all to Boost.Locale.
> However, they should look like they're doing the conversions by hiding
> the Boost.Locale aspect from the caller as much as possible.
Why delegate them to another library? Those classes already have
efficient, flexible, and correct iterator-based template code for the
conversions between the UTF-* types. I'd rather just farm out the stuff
that those types are weak at, like converting to and from
system-specific locales.
If they can do that, that's great! The conversion code was so short that I
assumed it wasn't a full, complete conversion algorithm. After all,
something the size of ICU is apparently necessary for full Unicode support!
Please forgive my scepticism :P
>On Tue, 18 Jan 2011 10:54:57 -0500, Chad Nelson wrote:
>
>> Why delegate them to another library? Those classes already have
>> efficient, flexible, and correct iterator-based template code for the
>> conversions between the UTF-* types. I'd rather just farm out the
>> stuff that those types are weak at, like converting to and from
>> system-specific locales.
>
> If they can do that, that's great! The conversion code was so short
> that I assumed it wasn't a full, complete conversion algorithm.
They're complete, and accurate. The algorithms aren't overly complex,
they just translate between different forms of the exact same data,
after all.
> After all, something the size of ICU is apparently necessary for full
> Unicode support!
>
> Please forgive my scepticism :P
Of course! :-) It's an understandable confusion, full Unicode support
involves a *lot* more than what those classes handle, or are meant to.
They should be, though. As a practical matter, the difference between
taking/returning a string and taking/returning an utf8_t is to force people
to write an explicit conversion. This penalizes people who are already in
UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and
s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true
that for people whose strings are not UTF-8, forcing those explicit
conversions may be considered a good thing. So it depends on what your goals
are. Do you want to promote the use of UTF-8 for all strings, or do you want
to enable people to remain in non-UTF-8-land?
> > It's a bit like defining a separate integer type for nonnegative
> > ints for type safety reasons - useful in theory, but nobody does it.
>
> I refer you to Boost.Units
I'm sure that there are many libraries that use units in their interfaces, I
just haven't heard of them. :-)
There's also the additional consideration of utf8_t's invariant. Does it
require valid UTF-8? One possible specification of fopen might be:
FILE* fopen( char const* name, char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware platforms and file
systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte
sequence on encoding-agnostic platforms and file systems such as Linux and
Solaris, but UTF-8 is recommended.
On Windows, the UTF-8 sequence may be invalid due to the presence of UTF-16
surrogates encoded as single code points, but such use is discouraged.
Patrick
Boost, as the cutting edge C++ library should try to enforce new standards
and not dwell on old and obsolete ones. Today everybody is (maybe slowly)
moving towards UTF-8 and creating a new string class/wrapper for UTF-8 that
nobody uses, IMO, encourages the usage of the old ANSI encodings.
Maybe a better course of action would be to create ansi_str_t with the encoding
tags for the legacy ANSI-encoded strings, which could be obsoleted
in the future, and use std::string as the default class for UTF-8 strings.
We will have to do this transition anyway at one point, so why not do it now.
my 0.02€
regards
Matus
+1
>
> There's also the additional consideration of utf8_t's invariant. Does it
> require valid UTF-8? One possible specification of fopen might be:
>
> FILE* fopen( char const* name, char const* mode );
>
> The 'name' argument must be UTF-8 on Unicode-aware platforms and
> file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an
> arbitrary byte sequence on encoding-agnostic platforms and file
> systems such as Linux and Solaris, but UTF-8 is recommended.
>
+1 As well.
Also I would like to add a small note of general C++ design as
a language: don't pay on what you don't need.
And 95% of all uses of strings is encoding agnostic!
Artyom