encodeURIComponent for C++

Notre Poubelle

unread,

Jan 13, 2005, 7:05:06 PM1/13/05

to

Hi,

I'd like to use something like Javascript's encodeURIComponent within my C++
code. I could invoke a scripting engine and have javascript make the
encodeURIComponent method, but this seems like overkill. I've also tried the
UrlEscape and InternetCanonicalizeUrl APIs but they doesn't properly handle
the conversion of unicode characters to UTF-8 encoding. For example, if the
user enters the Unicdoe character for code point 5357 (UTF-16), the
conversion to UTF-8 is E5 8D 97, which encodeURIComponent encodes as the
string %E5%8D%97. This is the output I'm looking for in my C++ code. Thanks.

Yan-Hong Huang[MSFT]

unread,

Jan 13, 2005, 10:59:21 PM1/13/05

to

Hello,

Based on my understanding, now what you are seeking for is the C++ API
equal to javascript method encodeURIComponent. Please correct me if I have
misunderstood anything.

In WinInet programming, we have a serial APIs for handling Uniform Resource
Locators. Please refer to
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wininet/win
inet/handling_uniform_resource_locators.asp for the detailed information.

InternetCreateUrl may be what you need. It uses the information in the
URL_COMPONENTS structure to create a Uniform Resource Locator. Please feel
free to post here if you have any more concerns.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
Register to Access MSDN Managed Newsgroups!
-http://support.microsoft.com/default.aspx?scid=/servicedesks/msdn/nospam.as
p&SD=msdn

This posting is provided "AS IS" with no warranties, and confers no rights.

Notre Poubelle

unread,

Jan 14, 2005, 12:39:07 PM1/14/05

to

Hello, and thank you for your reply.

Yes, I am trying to find a C++ API that is equal to the javascript method
encodeURIComponent. The encodeURIComponent method can be used to encode part
of a query string so that it is safe for use on the web. If for example I
want to pass two pieces of information on the web that are like this:

param1=data
param2=some special data with spaces & the ampersand character

my encoded params would look like this:

param1=data
param2=some%20special%20data%20with%20spaces%20%26%20the%20ampersand%20character

and my URL might look like this:

http://localhost/default.asp?param1=data&param2=some%20special%20data%20with%20spaces%20%26%20the%20ampersand%20character

I would make two calls to encodeURIComponent in building the safe parts of
the querystring:

encodeURIComponent("data") = data
encodeURIComponent("some special data with spaces & the ampersand
character")
=some%20special%20data%20with%20spaces%20%26%20the%20ampersand%20character

I am looking for the same kind of functionality within C++.
InternetCanonicalizeURL and URLEscape work for the above example, but both
fail when the data I want to encode contains non-ascii characters, such as
Japanese unicode characters. encodeURIComponent handles these more
gracefully, essentially converting each non-ASCII character in the string to
UTF-8 and preceding each byte encoding with the % symbol, as exemplified in
the original post.

Notre Poubelle

unread,

Jan 14, 2005, 12:45:02 PM1/14/05

to

I forgot to mention in my last post that I tried InternetCreateUrl but I
couldn't get the results I was looking for. I am not looking to create a
complete URL; I am trying to encode only part of the URL. I tried
InternetCreateUrl with the ICU_ESCAPE flag and the INTERNET_SCHEME_PARTIAL in
the URL_COMPONENTS structure, but it didn't even properly encode a URL path
that contains the ampersand ("&") character; perhaps I am doing it wrong.
Here's my sample code.

TCHAR * scheme = _T("http://");
TCHAR * path = _T("&");
INTERNET_SCHEME nScheme = INTERNET_SCHEME_PARTIAL;
URL_COMPONENTS url_comp;
memset(&url_comp, 0, sizeof(url_comp));
url_comp.lpszScheme = scheme;
url_comp.nScheme = nScheme;
url_comp.lpszHostName = NULL;
url_comp.nPort = 0;
url_comp.lpszUrlPath = path;
url_comp.lpszExtraInfo = NULL;
url_comp.dwStructSize = sizeof(url_comp);

DWORD dwFlags=ICU_ESCAPE;
TCHAR lpszUrl[512];
DWORD dwUrlLength=512;

BOOL bCreate = InternetCreateUrl(&url_comp,
dwFlags,lpszUrl, &dwUrlLength);
if (bCreate == FALSE) {
DWORD errorCode = GetLastError();
ATLASSERT(FALSE);
}

Yan-Hong Huang[MSFT]

unread,

Jan 17, 2005, 12:30:24 AM1/17/05

to

Hello Notre,

I consulted our WinINet team. If one needs to use UrlEscape with Unicode
URLs, do the following:

1. convert your URL to UCS-2 encoding ( use MultiByteToWideChar or other
APIs to convert).
2. call UrlEscapeW on the UCS-2 encoding string
3. Convert the resulted URL back to UTF8 or ANSI using WideCharToMultiByte.

Please test the above the let me know whether it works on your side. Thanks
very much.

Notre Poubelle

unread,

Jan 17, 2005, 1:01:02 PM1/17/05

to

Hi Yan-Hong,

Thanks again for your reply and continued research.

I'm not an expert in all the encoding schemes, but isn't UCS-2 basically the
same as UTF-16? I have a CComBSTR in a Unicode enabled project, so I am
already using UTF-16. If there is a way to convert from UTF-16 to UCS-2 (if
indeed they are different), then I don't know which API to use.

If UTF-16 is the same as UCS-2, then I have already performed steps 1 and 2.
Step 3 will convert the bytes from UTF-16 to UTF-8. I think that will work
fine for the characters in the URL that can be expressed as ASCII (single
byte) characters, since they do not need any special encoding to begin
(beyond that performed by URLEscapeW). For those parts of the URL that
cannot be expressed in ASCII, and multible bytes must be used,
WideCharToMultiByte will generate multiple bytes. This isn't quite what I
want, at least not directly. What I want is a *string* of the UTF-8
representation of each non-ASCII character, prefixed by the percent (%)
character. I'm not sure if I'm expressing this very clearly.

What I'm beginning to think, based on our conversation so far, is that there
is no C++ friendly API function like Javascript's encodeURIComponent that
will properly encode URL components that contain non-ASCII characters. What
it looks like I may need to do is to perform multiple steps:
1. Get my URL querystring components in UTF-16/UCS-2 (done, I think)
2. For each querystring component, iterate through each Unicode character in
the querystring component and test whether it can be represented as an ASCII
character or whether multiple bytes must be used. If it can be represented
as an ASCII character, then I should call UrlEscapeW on it to escape any URL
special characters. Then call WideCharToMultiByte to get it back to
UTF-8/ANSI.
If the Unicode character cannot be represented in ASCII, then call
WideCharToMultiByte to convert the character to UTF-8 encoding. This will
give me multiple bytes. I must then read each bytes' hexadecimal
representation and form a string that is of the form %[2 digit hex code],
e.g. "%E5".
3. The little strings formed in step 2 must be all concatenated together.

As an optimization, instead of calling UrlEscapeW and WideCharToMultiByte on
each individual Unicode character, I can call it on substrings of contiguous
characters with the same potential encoding. For example if my URL
querystring subcomponent contains the 5 ASCII-representable characters in a
row, followed by two multi-byte characters, I would call UrlEscapeW and
WideCharToMultiByte on these 5 characters and WideCharToMultiByte on the 2
multi-byte characters.

Does this sound accurate based on your understanding? Is there any other
ready-made API function that could simplify my task, or perhaps make it more
efficient?

Thanks again

Yan-Hong Huang[MSFT]

unread,

Jan 18, 2005, 1:31:49 AM1/18/05

to

Hi Notre,

There are basically four ways to encode Unicode characters in bytes:
UTF-8
128 characters are encoded using 1 byte (the ASCII characters). 1920
characters are encoded using 2 bytes (Roman, Greek, Cyrillic, Coptic,
Armenian, Hebrew, Arabic characters). 63488 characters are encoded using 3
bytes (Chinese and Japanese among others). The other 2147418112 characters
(not assigned yet) can be encoded using 4, 5 or 6 characters. For more info
about UTF-8, do `man 7 utf-8' (manpage contained in the man-pages-1.20
package).

UCS-2 (which is typical what Windows recognizes as Unicode string, if we
define UNICODE in the project, it will be UCS-2 string)
Every character is represented as two bytes. This encoding can only
represent the first 65536 Unicode characters.

UTF-16
This is an extension of UCS-2 which can represent 1112064 Unicode
characters. The first 65536 Unicode characters are represented as two
bytes, the other ones as four bytes.

UCS-4
Every character is represented as four bytes.

So, what is the format of your source data? If it is not UCS-2, I suggest
you use MultiByteToWideChar API to convert it to UCS-2 first. Then use
UrlEscapeW on it and convert it back to original format. UTF-8, UTF-16 all
belong to Multibyte format.

All the available WinInet APIs have been all listed in my first post. If
they have no the same effect as script method encodeURIComponent, I think
we may need to use several steps to accomplish the same goal.

Notre Poubelle

unread,

Jan 18, 2005, 7:19:03 PM1/18/05

to

Ok, I think I'm getting a clearer picture now. Would it be fair to say that
a BSTR is using UCS-2 encoding rather than UTF-16?

Yan-Hong Huang[MSFT]

unread,

Jan 18, 2005, 10:30:15 PM1/18/05

to

Right.

http://distributions.linux.com/howtos/Unicode-HOWTO-1.shtml can give you
some more information on it. :)

For BSTR, please refer to http://www.devguy.com/fp/Tips/COM/bstr.htm.

Hope that helps. Have a good day.