How much sophistication do you need? CStringW::Tokenize can split a string
up into tokens based on separator characters. It doesn't handle quoted
parameters, however.
Or, you could just convert to Unicode and call CommandLineToArgvW...
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
In MFC there is the CCommandLineInfo class.
--
David Wilkinson
Visual C++ MVP
|How much sophistication do you need? CStringW::Tokenize can split a string
|up into tokens based on separator characters. It doesn't handle quoted
|parameters, however.
I want the handling of quotes ergs.
|Or, you could just convert to Unicode and call CommandLineToArgvW...
... and convert the individual args back to multibyte. I did exactly that,
imitating the allocation scheme (with LocalAlloc) of CommandLineToArgvW so that
a single LocalFree() will free the pointers and the strings. My allocation
assumes that the individual MBCS args will have the same lengths as their wide
counterparts (because I allocated before converting the strings back to
multibyte). Is there a way to *guarantee* that my assumption is valid?
--
- Vince
||Or, you could just convert to Unicode and call CommandLineToArgvW...
|
| ... and convert the individual args back to multibyte. I did exactly that,
|imitating the allocation scheme (with LocalAlloc) of CommandLineToArgvW so that
|a single LocalFree() will free the pointers and the strings. My allocation
|assumes that the individual MBCS args will have the same lengths as their wide
|counterparts (because I allocated before converting the strings back to
|multibyte). Is there a way to *guarantee* that my assumption is valid?
Never mind. Instead of MultiByteToWideChar and WideCharToMultiByte, I just did
the conversions by assignment. That will insure the strings are the same
length, that the characters in the parsed string are exactly the same as in the
original string (and it doesn't mess with '"' and '\\' which are important to
CommandLineToArgvW).
--
- Vince
Well, this is getting a bit off track of your original query, but you might
consider whether this is the time to convert your whole app to Unicode.
There are distinct advantages to doing so, including a slight performance
boost.
|Well, this is getting a bit off track of your original query, but you might
|consider whether this is the time to convert your whole app to Unicode.
|There are distinct advantages to doing so, including a slight performance
|boost.
I usually write everything in Unicode. But the project in question is a plugin
DLL for a MBCS app. At plugin init time, the host app passes a single (MBCS)
string, a user parameter. I wanted to parse it like a command line to give the
user greater flexibility (namely quoted strings being a single arg) and to allow
me to use a normal process_argv routine. I came up with this, which works well
(and lacks EC). It works just like CommandLineToArgvW.
CHAR** WINAPI MBStringToMBArgv(LPSTR str, INT *pargc)
{
// alloc memory for a wide version of str
LPWSTR wstr = (LPWSTR) LocalAlloc(LMEM_FIXED,
(lstrlenA(str) + 1) * sizeof(WCHAR));
// "copy" str to wstr
WCHAR *wp = wstr;
while ( *wp++ = *str++ );
// parse wstr
WCHAR **wargv = CommandLineToArgvW(wstr, pargc);
// cleanup
LocalFree(wstr);
// determine memory needed for argv
size_t needed = *pargc * (sizeof(CHAR*) + 1); // ptrs and NULs
for ( INT i=0; i<*pargc; i++ ) // add arg lengths
needed += lstrlenW(wargv[i]);
// allocate memory for ptrs and args
LPBYTE argv = (LPBYTE) LocalAlloc(LMEM_FIXED, needed);
// fill the pointers and strings
CHAR **ptrs = (CHAR**) argv;
CHAR *parg = (CHAR*) ((CHAR**) argv + *pargc);
for ( INT i=0; i<*pargc; i++ )
{
ptrs[i] = parg;
wp = wargv[i];
while ( *parg++ = (CHAR) *wp++ );
}
// cleanup
LocalFree(wargv);
// when done with it use LocalFree() on the returned pointer
return (CHAR**) argv;
}
--
- Vince
You should use a proper multibyte to widecode conversion function so
you don't do the wrong thing if someone sends you a multibyte string
with a multibyte character.
|> � � � � // "copy" str to wstr
|> � � � � WCHAR *wp = wstr;
|> � � � � while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.
I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
and wwind up where you started.
--
- Vince
If not, it would be a reportable bug. Do you have any evidence to
suggest that wouldn't work? If so, let's report it.
|> I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
|> and wwind up where you started.
|
|If not, it would be a reportable bug. Do you have any evidence to
|suggest that wouldn't work? If so, let's report it.
The file 00ff.bin contains each byte, 0~255. CP 875 is Greek. This code gives
the results below it.
BYTE before[256], after[256];
for ( INT i=0; i<256; i++ )
before[i] = i;
WCHAR wbuf[256];
DWORD dwRead;
MultiByteToWideChar(875, 0, (CHAR*) before, 256, wbuf, 256);
WideCharToMultiByte(875, 0, wbuf, 256, (CHAR*) after, 256, NULL, FALSE);
for ( INT i=0; i<256; i++ )
{
if ( before[i] != after[i] )
printf("%u %u\n", before[i], after[i]);
}
220 63
225 63
236 63
237 63
252 63
253 63
--
- Vince
|> � � � � // "copy" str to wstr
|> � � � � WCHAR *wp = wstr;
|> � � � � while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.
Do you think CommandLineToArgvW cares about that?
--
- Vince
|Those codepoints are not valid in CP875. Of course you can only expect
|roundtrip if the original string is a valid MBCS string for its codepage
|to begin with. In case you are wondering, 63 is the code for question
|mark '?'.
Oddly, if I use MB_ERR_INVALID_CHARS in MultiByteToWideChar it still succeeds.
Going back, with WideCharToMultiByte and WC_ERR_INVALID_CHARS, it fails.
I don't want to be the policeman. Do you think my method of simple assignment
to convert CHAR <-> WCHAR will foul up CommandLineToArgvW? If the user provides
garbage, I figure he'll get back.
|But I'm not sure why you _need_ a roundtrip. You say your plugin is
|Unicode, except for this one parameter string. So you only need to
|convert it one way, so that everything is now Unicode, right?
I said my plugin was MBCS (as is the hosting app). It uses no C library
functions. I could convert it to Unicode but I'd find myself calling "A"
functions most of the time anyway.
--
- Vince
http://www.ascii.ca/ebc875.htm
Those codepoints are not valid in CP875. Of course you can only expect
roundtrip if the original string is a valid MBCS string for its codepage
to begin with. In case you are wondering, 63 is the code for question
mark '?'.
But I'm not sure why you _need_ a roundtrip. You say your plugin is
Unicode, except for this one parameter string. So you only need to
convert it one way, so that everything is now Unicode, right?
--
With best wishes,
Igor Tandetnik
With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925
|No. But if the current system codepage is in fact CP1253 (Windows
|codepage for Greek), and the caller did want to pass some Greek
|characters to you, you will silently convert them to accented latin
|characters that just happen to have the same codes in Latin-1 aka
|ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
|U+00FF correspond to, for historical reasons).
|
|For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
|CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A WITH
|ACUTE.
What's the problem? When I convert each Unicode argv back to MBCS with
while ( *p++ == (CHAR) *wp++ );
won't it go back to 193 (and again be interpreted as GREEK CAPITAL LETTER
ALPHA)? I don't think CommandLineToArgvW cares whether it's GREEK CAPITAL
LETTER ALPHA or LATIN CAPITAL LETTER A WITH ACUTE. I'm assuming
CommandLineToArgvW only **interprets** whitespace, backslashes, and
double-quotes.
--
- Vince
No. But if the current system codepage is in fact CP1253 (Windows
codepage for Greek), and the caller did want to pass some Greek
characters to you, you will silently convert them to accented latin
characters that just happen to have the same codes in Latin-1 aka
ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
U+00FF correspond to, for historical reasons).
For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A WITH
ACUTE.
In other words, your technique only works correctly if you are sure the
incoming string consists entirely of plain vanilla ASCII-7 characters
(codepoints 0 through 127).
Ah, I didn't realize you were going to Unicode and back. Anyway, you'd
still have problems with true double-byte encodings, like Chinese BIG-5
or Japanese Shift-JIS. In these encodings, some characters are
represented by two bytes, called lead byte and trailing byte. Lead byte
always has high bit set, but trailing byte could have any value at all,
including values that just happen to be the same as ASCII codes for
space, backslash or double quote.
Your naive algorithm will convert such double-byte character to two
independent Unicode codepoints. The codepoint corresponding to the
trailing byte could then be interpreted by CommandLineToArgvW as a
separator. As a result, a) some parameter will be broken up in the
middle, and b) when your algorithm converts back from Unicode to MBCS,
you'll end up with a lead byte not followed by a trailing byte (or
followed by an unrelated ASCII character that will be misinterpreted as
a trailing byte).
Yes, I see. STDARGV.C deals with this (if (_ismbblead(c)) ...). Do you think
that I could (possibly with some effort) include STDARGV.C in my project and use
its parse_cmdline()?
--
- Vince
If it cares about getting the right answer, I would think it would care about
having the correct input. Only if all the characters are ASCII can you do the
conversion in a simple character-by-character manner.