CommandLineToArgvA?

Vincent Fatica

unread,

Jun 3, 2009, 9:10:58 PM6/3/09

to

Is there a function that will parse a multibyte string, producing a count and
distinct multibyte args (similar to CommandLineToArgvW)? The string I want to
parse is not a command line but I want to treat it exactly like a command line
and wind up with multibyte args. Thanks.
--
- Vince

Tim Roberts

unread,

Jun 4, 2009, 11:18:18 PM6/4/09

to

How much sophistication do you need? CStringW::Tokenize can split a string
up into tokens based on separator characters. It doesn't handle quoted
parameters, however.

Or, you could just convert to Unicode and call CommandLineToArgvW...
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

David Wilkinson

unread,

Jun 5, 2009, 6:15:58 AM6/5/09

to

In MFC there is the CCommandLineInfo class.

--
David Wilkinson
Visual C++ MVP

Vincent Fatica

unread,

Jun 5, 2009, 9:48:07 AM6/5/09

to

On Thu, 04 Jun 2009 20:18:18 -0700, Tim Roberts <ti...@probo.com> wrote:

|How much sophistication do you need? CStringW::Tokenize can split a string
|up into tokens based on separator characters. It doesn't handle quoted
|parameters, however.

I want the handling of quotes ergs.

|Or, you could just convert to Unicode and call CommandLineToArgvW...

... and convert the individual args back to multibyte. I did exactly that,
imitating the allocation scheme (with LocalAlloc) of CommandLineToArgvW so that
a single LocalFree() will free the pointers and the strings. My allocation
assumes that the individual MBCS args will have the same lengths as their wide
counterparts (because I allocated before converting the strings back to
multibyte). Is there a way to *guarantee* that my assumption is valid?
--
- Vince

Vincent Fatica

unread,

Jun 5, 2009, 11:34:14 AM6/5/09

to

On 5 Jun 2009 09:48:07 -0400, Vincent Fatica <vi...@blackholespam.net> wrote:

||Or, you could just convert to Unicode and call CommandLineToArgvW...
|
| ... and convert the individual args back to multibyte. I did exactly that,
|imitating the allocation scheme (with LocalAlloc) of CommandLineToArgvW so that
|a single LocalFree() will free the pointers and the strings. My allocation
|assumes that the individual MBCS args will have the same lengths as their wide
|counterparts (because I allocated before converting the strings back to
|multibyte). Is there a way to *guarantee* that my assumption is valid?

Never mind. Instead of MultiByteToWideChar and WideCharToMultiByte, I just did
the conversions by assignment. That will insure the strings are the same
length, that the characters in the parsed string are exactly the same as in the
original string (and it doesn't mess with '"' and '\\' which are important to
CommandLineToArgvW).
--
- Vince

Tim Roberts

unread,

Jun 6, 2009, 10:13:11 PM6/6/09

to

Vincent Fatica <vi...@blackholespam.net> wrote:
>
>On Thu, 04 Jun 2009 20:18:18 -0700, Tim Roberts <ti...@probo.com> wrote:
>
>|How much sophistication do you need? CStringW::Tokenize can split a string
>|up into tokens based on separator characters. It doesn't handle quoted
>|parameters, however.
>
>I want the handling of quotes ergs.
>
>|Or, you could just convert to Unicode and call CommandLineToArgvW...
>
> ... and convert the individual args back to multibyte.

Well, this is getting a bit off track of your original query, but you might
consider whether this is the time to convert your whole app to Unicode.
There are distinct advantages to doing so, including a slight performance
boost.

Vincent Fatica

unread,

Jun 7, 2009, 12:01:05 AM6/7/09

to

On Sat, 06 Jun 2009 19:13:11 -0700, Tim Roberts <ti...@probo.com> wrote:

|Well, this is getting a bit off track of your original query, but you might
|consider whether this is the time to convert your whole app to Unicode.
|There are distinct advantages to doing so, including a slight performance
|boost.

I usually write everything in Unicode. But the project in question is a plugin
DLL for a MBCS app. At plugin init time, the host app passes a single (MBCS)
string, a user parameter. I wanted to parse it like a command line to give the
user greater flexibility (namely quoted strings being a single arg) and to allow
me to use a normal process_argv routine. I came up with this, which works well
(and lacks EC). It works just like CommandLineToArgvW.

CHAR** WINAPI MBStringToMBArgv(LPSTR str, INT *pargc)
{
// alloc memory for a wide version of str
LPWSTR wstr = (LPWSTR) LocalAlloc(LMEM_FIXED,
(lstrlenA(str) + 1) * sizeof(WCHAR));

// "copy" str to wstr
WCHAR *wp = wstr;
while ( *wp++ = *str++ );

// parse wstr
WCHAR **wargv = CommandLineToArgvW(wstr, pargc);

// cleanup
LocalFree(wstr);

// determine memory needed for argv
size_t needed = *pargc * (sizeof(CHAR*) + 1); // ptrs and NULs
for ( INT i=0; i<*pargc; i++ ) // add arg lengths
needed += lstrlenW(wargv[i]);

// allocate memory for ptrs and args
LPBYTE argv = (LPBYTE) LocalAlloc(LMEM_FIXED, needed);

// fill the pointers and strings
CHAR **ptrs = (CHAR**) argv;
CHAR *parg = (CHAR*) ((CHAR**) argv + *pargc);
for ( INT i=0; i<*pargc; i++ )
{
ptrs[i] = parg;
wp = wargv[i];
while ( *parg++ = (CHAR) *wp++ );
}

// cleanup
LocalFree(wargv);

// when done with it use LocalFree() on the returned pointer
return (CHAR**) argv;
}
--
- Vince

random...@gmail.com

unread,

Jun 7, 2009, 12:59:09 AM6/7/09

to

On Jun 6, 9:01 pm, Vincent Fatica <vi...@blackholespam.net> wrote:
> CHAR** WINAPI MBStringToMBArgv(LPSTR str, INT *pargc)
> {
> // alloc memory for a wide version of str
> LPWSTR wstr = (LPWSTR) LocalAlloc(LMEM_FIXED,
> (lstrlenA(str) + 1) * sizeof(WCHAR));
>
> // "copy" str to wstr
> WCHAR *wp = wstr;
> while ( *wp++ = *str++ );

You should use a proper multibyte to widecode conversion function so
you don't do the wrong thing if someone sends you a multibyte string
with a multibyte character.

Vincent Fatica

unread,

Jun 7, 2009, 3:27:29 AM6/7/09

to

On Sat, 6 Jun 2009 21:59:09 -0700 (PDT), random...@gmail.com wrote:

|> � � � � // "copy" str to wstr

|> � � � � WCHAR *wp = wstr;
|> � � � � while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.

I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
and wwind up where you started.
--
- Vince

Scot T Brennecke

unread,

Jun 7, 2009, 5:01:27 AM6/7/09

to

If not, it would be a reportable bug. Do you have any evidence to
suggest that wouldn't work? If so, let's report it.

Vincent Fatica

unread,

Jun 7, 2009, 9:59:46 AM6/7/09

to

On Sun, 07 Jun 2009 04:01:27 -0500, Scot T Brennecke <Sc...@Spamhater.MVPs.org>
wrote:

|> I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
|> and wwind up where you started.
|
|If not, it would be a reportable bug. Do you have any evidence to
|suggest that wouldn't work? If so, let's report it.

The file 00ff.bin contains each byte, 0~255. CP 875 is Greek. This code gives
the results below it.

BYTE before[256], after[256];
for ( INT i=0; i<256; i++ )
before[i] = i;
WCHAR wbuf[256];
DWORD dwRead;
MultiByteToWideChar(875, 0, (CHAR*) before, 256, wbuf, 256);
WideCharToMultiByte(875, 0, wbuf, 256, (CHAR*) after, 256, NULL, FALSE);
for ( INT i=0; i<256; i++ )
{
if ( before[i] != after[i] )
printf("%u %u\n", before[i], after[i]);
}

220 63
225 63
236 63
237 63
252 63
253 63

--
- Vince

Vincent Fatica

unread,

Jun 7, 2009, 10:24:52 AM6/7/09

to

On Sat, 6 Jun 2009 21:59:09 -0700 (PDT), random...@gmail.com wrote:

|> � � � � // "copy" str to wstr

|> � � � � WCHAR *wp = wstr;
|> � � � � while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.

Do you think CommandLineToArgvW cares about that?
--
- Vince

Vincent Fatica

unread,

Jun 7, 2009, 11:10:58 AM6/7/09

to

On Sun, 7 Jun 2009 10:41:34 -0400, "Igor Tandetnik" <itand...@mvps.org> wrote:

|Those codepoints are not valid in CP875. Of course you can only expect
|roundtrip if the original string is a valid MBCS string for its codepage
|to begin with. In case you are wondering, 63 is the code for question
|mark '?'.

Oddly, if I use MB_ERR_INVALID_CHARS in MultiByteToWideChar it still succeeds.
Going back, with WideCharToMultiByte and WC_ERR_INVALID_CHARS, it fails.

I don't want to be the policeman. Do you think my method of simple assignment
to convert CHAR <-> WCHAR will foul up CommandLineToArgvW? If the user provides
garbage, I figure he'll get back.

|But I'm not sure why you _need_ a roundtrip. You say your plugin is
|Unicode, except for this one parameter string. So you only need to
|convert it one way, so that everything is now Unicode, right?

I said my plugin was MBCS (as is the hosting app). It uses no C library
functions. I could convert it to Unicode but I'd find myself calling "A"
functions most of the time anyway.
--
- Vince

Igor Tandetnik

unread,

Jun 7, 2009, 10:41:34 AM6/7/09

to

Vincent Fatica wrote:
> On Sun, 07 Jun 2009 04:01:27 -0500, Scot T Brennecke
> <Sc...@Spamhater.MVPs.org> wrote:
>
>>> I'm not very confident that you can MultiByteToWideChar then
>>> WideCharToMultiByte and wwind up where you started.
>>
>> If not, it would be a reportable bug. Do you have any evidence to
>> suggest that wouldn't work? If so, let's report it.
>
> The file 00ff.bin contains each byte, 0~255. CP 875 is Greek. This
> code gives the results below it.
>

> 220 63
> 225 63
> 236 63
> 237 63
> 252 63
> 253 63

http://www.ascii.ca/ebc875.htm

Those codepoints are not valid in CP875. Of course you can only expect
roundtrip if the original string is a valid MBCS string for its codepage
to begin with. In case you are wondering, 63 is the code for question
mark '?'.

But I'm not sure why you _need_ a roundtrip. You say your plugin is

Unicode, except for this one parameter string. So you only need to
convert it one way, so that everything is now Unicode, right?

--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925

Vincent Fatica

unread,

Jun 7, 2009, 12:15:51 PM6/7/09

to

On Sun, 7 Jun 2009 11:44:47 -0400, "Igor Tandetnik" <itand...@mvps.org> wrote:

|No. But if the current system codepage is in fact CP1253 (Windows
|codepage for Greek), and the caller did want to pass some Greek
|characters to you, you will silently convert them to accented latin
|characters that just happen to have the same codes in Latin-1 aka
|ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
|U+00FF correspond to, for historical reasons).
|
|For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
|CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A WITH
|ACUTE.

What's the problem? When I convert each Unicode argv back to MBCS with

while ( *p++ == (CHAR) *wp++ );

won't it go back to 193 (and again be interpreted as GREEK CAPITAL LETTER
ALPHA)? I don't think CommandLineToArgvW cares whether it's GREEK CAPITAL
LETTER ALPHA or LATIN CAPITAL LETTER A WITH ACUTE. I'm assuming
CommandLineToArgvW only **interprets** whitespace, backslashes, and
double-quotes.
--
- Vince

Igor Tandetnik

unread,

Jun 7, 2009, 11:44:47 AM6/7/09

to

Vincent Fatica wrote:
> I don't want to be the policeman. Do you think my method of simple
> assignment to convert CHAR <-> WCHAR will foul up CommandLineToArgvW?

No. But if the current system codepage is in fact CP1253 (Windows

codepage for Greek), and the caller did want to pass some Greek
characters to you, you will silently convert them to accented latin
characters that just happen to have the same codes in Latin-1 aka
ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
U+00FF correspond to, for historical reasons).

For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A WITH
ACUTE.

In other words, your technique only works correctly if you are sure the
incoming string consists entirely of plain vanilla ASCII-7 characters
(codepoints 0 through 127).

Igor Tandetnik

unread,

Jun 7, 2009, 12:49:09 PM6/7/09

to

Ah, I didn't realize you were going to Unicode and back. Anyway, you'd
still have problems with true double-byte encodings, like Chinese BIG-5
or Japanese Shift-JIS. In these encodings, some characters are
represented by two bytes, called lead byte and trailing byte. Lead byte
always has high bit set, but trailing byte could have any value at all,
including values that just happen to be the same as ASCII codes for
space, backslash or double quote.

Your naive algorithm will convert such double-byte character to two
independent Unicode codepoints. The codepoint corresponding to the
trailing byte could then be interpreted by CommandLineToArgvW as a
separator. As a result, a) some parameter will be broken up in the
middle, and b) when your algorithm converts back from Unicode to MBCS,
you'll end up with a lead byte not followed by a trailing byte (or
followed by an unrelated ASCII character that will be misinterpreted as
a trailing byte).

Vincent Fatica

unread,

Jun 7, 2009, 1:24:19 PM6/7/09

to

Yes, I see. STDARGV.C deals with this (if (_ismbblead(c)) ...). Do you think
that I could (possibly with some effort) include STDARGV.C in my project and use
its parse_cmdline()?
--
- Vince

David Wilkinson

unread,

Jun 7, 2009, 1:44:10 PM6/7/09

to

Vincent Fatica wrote:
> |You should use a proper multibyte to widecode conversion function so
> |you don't do the wrong thing if someone sends you a multibyte string
> |with a multibyte character.
>
> Do you think CommandLineToArgvW cares about that?

If it cares about getting the right answer, I would think it would care about
having the correct input. Only if all the characters are ASCII can you do the
conversion in a simple character-by-character manner.

uniteda...@gmail.com

unread,

Feb 4, 2017, 10:32:25 PM2/4/17

to

Please see the WINE project

https://www.winehq.org/

It's awesome, they contains a source code of `CommandLineToArgvW`, which should meet your needs.