Thanks,
Mattia
> Hi all, is there a C++ function similar to isspace that can handle
> w_chars? Does the regex library handles w_chars?
Yes, there is a template function declared in <locale> and named
std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I think. The
Boost regex library is fully templated and ought to support wchar_t as
well, but I have not tried this. According to Boost documentation one needs
a separate ICU library for full Unicode support though.
hth
Paavo
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that std::isspace
expects 2 arguments, and still I don't know if the file contains
unicode characters can be correctly handles.
The regex library referred to the new C++0x version.
Mattia
> > > Hi all, is there a C++ function similar to isspace that
> > > can handle w_chars? Does the regex library handles
> > > w_chars?
> > Yes, there is a template function declared in <locale> and
> > named std::isspace, curiously enough.
> > There is no regex librar in the official C++ standard yet I
> > think. The Boost regex library is fully templated and ought
> > to support wchar_t as well, but I have not tried this.
> > According to Boost documentation one needs a separate ICU
> > library for full Unicode support though.
> Well, take a look at my snippet:
> std::ifstream infile(argv[1]);
> std::string s;
> while (getline(infile, s))
> {
> s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
> ());
> std::cout << s;
> }
> Using locale on VC++2008 I've got an error reporting that
> std::isspace expects 2 arguments,
That's because std::isspace requires two arguments, the
character to be tested, and the locale.
> and still I don't know if the file contains unicode characters
> can be correctly handles.
The functions in <locale> are pretty useless, since they only
handle single byte characters. The "approved" solution is to
read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the appropriate
locale) on the wchar_t in the wstring.
--
James Kanze
Ok, well, suppose I want to use UTF-8 encoding, how do I specify it
using locale? And where can I find a list of the possible locale
encoding configuration (e.g. if I wanted to correctly decode a web
page just parsing the fist bytes looking for 'charset')?
Thanks, Mattia
With UTF-8 one is using char, not wchar_t. Note that if char is a signed
type, then one must take care to cast char to unsigned char in places
where a non-negative value is expected.
By historic reasons the locale and encoding stuff has been mixed up. Are
you more interested in locales or in encodings? Locales affect such stuff
as the character of representing the decimal point in numbers, look of
the dates and whether V and W are sorted together or separately, and
whether cyrillic characters are considered alphabetic characters or not.
Encoding is a fully different business, specifying for example how those
cyrillic characters are encoded in the binary data, if at all.
If you just want to translate different encodings, then you do not need
any locale stuff at all. When a web page comes in, you do not know if the
decimal point used in numbers therein is a dot or a comma, for example,
so strictly speaking you cannot set the correct locale for processing the
page. What you can do is to look at BOM markers and charset encoding, and
to translate the file from its charset to the encoding you are using
internally, for example. For that, again no locales are needed, but
instead one needs some kind os system-sepcific code or other library like
iconv.
> using locale? And where can I find a list of the possible locale
> encoding configuration (e.g. if I wanted to correctly decode a web
> page just parsing the fist bytes looking for 'charset')?
http://www.iana.org/assignments/character-sets
But you don't want to deal with this by yourself. Use a library like
iconv.
hth
Paavo
Ok, so suppose I want to split a russian text into words and the base
method look at every character in order to decide if a space is found,
what do you suggest?
If you mean space as ASCII character 32, then I would use the text
encoded in UTF-8 and compare each byte with ' '.
However, if you mean any whitespace, then I would start by finding out at
unicode.org site if there are any non-ASCII whitespace characters defined
in the standard Russian locale. If there are, and wchar_t on the given
platform is wide enough to represent all of them in a single wchar_t,
then I could encode the text as UTF-16 or UTF-32 as appropriate for
wchar_t on the given platform and use std::isspace<wchar_t>() with the
Russian locale.
Or I could keep the text in UTF-8 and use my own custom function for
checking for the whitespace, checking directly for all Unicode whitespace
characters as listed in http://en.wikipedia.org/wiki/Whitespace_%
28computer_science%29, this seems to me much less error-prone than
worrying if Russian locale and std::isspace are working correctly on all
platforms.
hth
Paavo
> Ok, well, suppose I want to use UTF-8 encoding, how do I
> specify it using locale? And where can I find a list of the
> possible locale encoding configuration (e.g. if I wanted to
> correctly decode a web page just parsing the fist bytes
> looking for 'charset')?
There are no standard names for locales -- you'll have to read
your system documentation. Posix defines a standard *format*
for names under Unix systems. But you'll still have to read the
documentation to see what is present, *and* what the default
encoding is, since if UTF-8 is the default, it may not be
present in the name. (Actually, I can't find a definition of
this format in the Posix standard. But it is common to Solaris,
HP-UP, AIX and Linux, at least, and seems to be at least a de
facto standard. The problem is that it doesn't necessarily
represent the default encoding, so UTF-8 might be "en_US.utf8"
or "en_US", the latter only if the default encoding is UTF-8.)
--
James Kanze
[...]
> > Ok, well, suppose I want to use UTF-8 encoding, how do I specify it
> With UTF-8 one is using char, not wchar_t. Note that if char
> is a signed type, then one must take care to cast char to
> unsigned char in places where a non-negative value is
> expected.
He didn't make clear whether he meant internal or external
encoding. One can use UTF-8 externally (and probably should for
any new projects), and still use wchar_t and UTF-16 or UTF-32
internally.
> By historic reasons the locale and encoding stuff has been mixed up.
The reasons aren't just historical. Functions like isalpha have
to know the encoding if they are to work. Logically, of course,
locale and encoding are, or should be, two completely separate
concepts, but practically, at the technical level, that would
mean specifying both a locale and an encoding for things like
isalpha. (Note that the design of <locale> leaves a bit to be
desired here, since it links isalpha purely to the ctype facet;
logically, it should depend on both ctype and codecvt.
Practically, however, I'll admit that I wouldn't like to
implement a design that handled this correctly.)
> Are you more interested in locales or in encodings? Locales
> affect such stuff as the character of representing the
> decimal point in numbers, look of the dates and whether V and
> W are sorted together or separately, and whether cyrillic
> characters are considered alphabetic characters or not.
> Encoding is a fully different business, specifying for example
> how those cyrillic characters are encoded in the binary data,
> if at all.
The character encoding does affect whether isalpha(0xE9) should
return true (ISO 8859-1) or false (UTF-8).
> If you just want to translate different encodings, then you do
> not need any locale stuff at all. When a web page comes in,
> you do not know if the decimal point used in numbers therein
> is a dot or a comma, for example, so strictly speaking you
> cannot set the correct locale for processing the page. What
> you can do is to look at BOM markers and charset encoding, and
> to translate the file from its charset to the encoding you are
> using internally, for example. For that, again no locales are
> needed, but instead one needs some kind os system-sepcific
> code or other library like iconv.
Strictly speaking, when a web page comes in, you don't even know
how comma or dot are encoding in it. In practice, all of the
codesets used in web pages have the first 128 values in common.
And the header should be written using just those values until
it's reached the point where it specifies the encoding. (Also
in practice, a lot of headers don't bother to specify the
encoding, so it's worthwhile to develop some pragmatic
heuristics to guess it. If the data starts with a BOM, then
it's Unicode, and the BOM will allow you to determine the
format. If the data contains 0's in the first four bytes, it's
almost certainly some format of UTF-16 or UTF-32, and you can
determine which by the number and position of the zeros.
Otherwise, I'd treat it as undetermined ASCII based until I
encountered a byte value larger than 128---if that byte value
was part of a legal UTF-8 code, I'd shift to UTF-8, otherwise to
ISO-8859-1, but that's really just a guess.)
> > using locale? And where can I find a list of the possible
> > locale encoding configuration (e.g. if I wanted to correctly
> > decode a web page just parsing the fist bytes looking for
> > 'charset')?
> http://www.iana.org/assignments/character-sets
But that doesn't tell you what the name of the locale on your
system might be.
--
James Kanze
[...]
> > Ok, so suppose I want to split a russian text into words and
> > the base method look at every character in order to decide
> > if a space is found, what do you suggest?
> If you mean space as ASCII character 32, then I would use the text
> encoded in UTF-8 and compare each byte with ' '.
> However, if you mean any whitespace, then I would start by
> finding out at unicode.org site if there are any non-ASCII
> whitespace characters defined in the standard Russian locale.
> If there are, and wchar_t on the given platform is wide enough
> to represent all of them in a single wchar_t, then I could
> encode the text as UTF-16 or UTF-32 as appropriate for wchar_t
> on the given platform and use std::isspace<wchar_t>() with the
> Russian locale.
> Or I could keep the text in UTF-8 and use my own custom
> function for checking for the whitespace, checking directly
> for all Unicode whitespace characters as listed
> inhttp://en.wikipedia.org/wiki/Whitespace_%
> 28computer_science%29, this seems to me much less error-prone
> than worrying if Russian locale and std::isspace are working
> correctly on all platforms.
FWIW: I have code floating around which implements all of the
isxxx functions for UTF-8, using tables which are generated
automatically from the UnicodeData.txt file. It's in my TODO
list to get it up at my site, but I'm still really in the
process of moving and getting reestablished in a new job in a
new city in a new country (on a new computer as well), so I
probably won't be getting around to it very soon.
--
James Kanze
> AFAIK, C90 defines a locale by the name of "C",
> which should also be visible from C++.
And Posix defines "POSIX". Neither of which are really useful
for anything.
--
James Kanze
Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?
You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic translation
from UTF-8 to wchar_t. The following example assumes that you have a file
test1.utf containing valid UTF-8 text. It reads the file in as a wide
stream and prints out the numeric values of all wchar_t characters.
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
int main() {
std::wifstream is;
const std::locale filelocale("en_US.UTF8");
is.imbue(filelocale);
is.open("test1.utf8");
std::wstring s;
while(std::getline(is, s)) {
for (std::wstring::size_type j=0; j<s.length(); ++j) {
std::cout << s[j] << " ";
}
std::cout << "\n";
}
}
(Tested on Linux with a recent gcc, I am not too sure if this works on
Windows. First, wchar_t in MSVC is too narrow for real Unicode, at best
one might get UTF-16 as a result.)
hth
Paavo
For curiosity, I tested this also on Windows with MSVC9, and as expected
it did not work, the locale construction immediately threw an exception
(bad locale name). Neither did any alterations work ("english.UTF8",
".UTF8", ".utf-8", ".65001").
Thus, if one wants any portability it seems the best approach currently
is still to read in binary UTF-8 and perform any needed conversions by
hand.
Paavo
Under Windows, you have to use const std::locale filelocale
("English_Australia.1252") according to http://docs.moodle.org/en/Table_of_locales,
I've tested it in VC++08 and it works. Any suggestion in how to handle
the dualism?
Thanks, Mattia
[...]
> >> Ok, so I think that I will open my file specifying to use UTF-8
> >> encoding, but how can I do it in C++?
> > You can open it as a narrow stream and read in as binary
> > UTF-8, or (maybe) you can open it as a wide stream and get
> > an automatic translation from UTF-8 to wchar_t. The
> > following example assumes that you have a file test1.utf
> > containing valid UTF-8 text. It reads the file in as a wide
> > stream and prints out the numeric values of all wchar_t
> > characters.
> > #include <iostream>
> > #include <fstream>
> > #include <locale>
> > #include <string>
> > int main() {
> > std::wifstream is;
> > const std::locale filelocale("en_US.UTF8");
The above line supposes 1) that you're on a Unix platform
(because it uses the Unix conventions for naming locales), and
2) that the "en_US.UTF8" locale has been installed---under that
name. (I've worked on a lot of systems where this was not the
case.)
> > is.imbue(filelocale);
> > is.open("test1.utf8");
> > std::wstring s;
> > while(std::getline(is, s)) {
> > for (std::wstring::size_type j=0; j<s.length(); ++j) {
> > std::cout << s[j] << " ";
> > }
> > std::cout << "\n";
> > }
> > }
> > (Tested on Linux with a recent gcc, I am not too sure if
> > this works on Windows. First, wchar_t in MSVC is too narrow
> > for real Unicode, at best one might get UTF-16 as a result.)
UTF-16 is "real Unicode". Just like UTF-8.
> For curiosity, I tested this also on Windows with MSVC9, and
> as expected it did not work, the locale construction
> immediately threw an exception (bad locale name). Neither did
> any alterations work ("english.UTF8", ".UTF8", ".utf-8",
> ".65001").
That's because Windows uses different conventions for naming
locales. (Windows Vista and later clames that names conforming
to RFC 4646 are used, see
http://msdn.microsoft.com/en-us/library/dd373814%28VS.85%29.aspx.
Except that RFC 4646 doesn't seem to contain information
concerning the character encoding. I'm guessing that Windows
would use the code page for this---65001 for UTF-8. But I don't
know how it has to be added to the "en-US".)
> Thus, if one wants any portability it seems the best approach
> currently is still to read in binary UTF-8 and perform any
> needed conversions by hand.
It should be sufficient to find out how the different locales are
named for each system, and read this information in from some
sort of configuration file.
--
James Kanze
> On Jan 31, 10:39�am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
>> Paavo Helde <myfirstn...@osa.pri.ee> wrote
>> innews:Xns9D116950C4paavo256@2
Did you actually test the results? It seems this is reading UTF-8 in
unaltered, so there is no point to use a wide stream in the first place.
Paavo
Well, yeah, although using an example file like
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt and using
plain std::string, std::ifstream and std::cout everything works fine,
if I put the 'w' in front of all this types the sysout fails
producing:
UTF-8 encoded sample plain-text file
Γ
Why??
Because C++ does not convert from UTF-8 to UTF-16 just like that.
UTF-8 fits into std::string. std::wstring is UTF-16 when sizeof
(wchar_t) is 2 and UTF-32 when sizeof(wchar_t) is 4. The support for
character portability is weak in STL, not sure why. Also POSIX
functions do not help much since most implementations were made before
Unicode was defined.
If you really want to convert then use platforms support. Most
platforms support Unicode (for example MultiByteToWideChar() in
Windows). If you want portable solution then use library that is
capable to provide conversions like ICU. http://site.icu-project.org/
You might want to use a custom codecvt facet with that otherwise you are
just reading bytes and converting them to wchar_t which is not what you want
I don't think.
/Leigh
"Leigh Johnston" <le...@i42.co.uk> wrote in message
news:PdmdneKVNOF6s_vW...@giganews.com...
Never mind, if the locale is correct it should do multibyte conversion
correctly.
> On Jan 31, 2:17 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
>> gervaz <ger...@gmail.com> wrote
>> innews:a5a4ece2-5b9d-4846-a818-9de61c1306
> 5...@r24g2000yqd.googlegroups.com:
Because codepage 1252 has nothing to do with UTF-8.
BTW, in Windows, I would not rely also too much on what you see on the
console. That's why I printed out only numeric wchar_t values in the
earlier example.
Paavo
Worrying? "I don't support doing analysis of Russian text on a
platform with broken Russian locales" sounds like something you can
happily say.
/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Ok, to summarize things learned so far:
UTF-8 can be handled by simply using std::string (henche char)
UTF-16 and UTF-32 handled by std::wstring and std::wchar_t but not
reliable because the type size is implementation-specific
Now, something like:
std::ifstream is;
const std::locale filelocale("Russian_Russia.1251");
is.imbue(filelocale);
is.open(argv[1]);
std::string s;
while(std::getline(is, s))
{
for (std::string::const_iterator it = s.begin(); it != s.end(); +
+it)
{
std::cout << *it;
if (std::isspace(*it, filelocale)) std::cout << "space found!"
<< std::endl;
}
std::cout << std::endl;
}
Works if we give as input a Russian text (althought the cout isn't
able to correctly display the russian characters).
If we are under Linux, something like
try
{
const std::locale filelocale("Russian_Russia.1251");
}
catch
{
try
{
const std::locale filelocale("ru_utf8");
}
catch
{
throw();
}
}
Can work? Any suggestion (I don't even know the specif exception that
have to be catch. Just experimenting...
Thanks, Mattia
> On Fri, 2010-01-29, Paavo Helde wrote:
> ...
>> Or I could keep the text in UTF-8 and use my own custom function for
>> checking for the whitespace, checking directly for all Unicode
>> whitespace characters as listed in
>> http://en.wikipedia.org/wiki/Whitespace_% 28computer_science%29, this
>> seems to me much less error-prone than worrying if Russian locale and
>> std::isspace are working correctly on all platforms.
>
> Worrying? "I don't support doing analysis of Russian text on a
> platform with broken Russian locales" sounds like something you can
> happily say.
Happily to who? My boss? Or the customer? (Happily I myself don't have any
of those problems, the only locale I have needed so far is "C", to override
strange formatting caused by default locales. Everything is just working in
UTF-8.).
BTW, Windows is supporting Russian locales quite well, I believe. The
problem is that it does not support locales with UTF-8 encodings (which are
the only ones which make sense in my (limited) experience).
Cheers
Paavo
You can't *always* say it, but you cannot always bend over backways
either in an effort to please some minority. Your posting seemed to
imply that you should always implement your own logic, in case some
target platform happens to be broken.
Only if the boss/customer is willing to pay for the extra work;-) Or if I
happen to want to do that anyway, for whatever reasons!
In the concrete above case the choice is easy, writing whitespace finder
once for all languages recognized by Unicode consortium is much easier
than struggling with Russian-Hebrean-unknown locales on all kind of known
and unknown platforms, so I would never consider the latter.
For other problems, YMMV.
Cheers
Paavo