For one particular url
(http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=19&l=en&oc=9100SAP&s=dhs)
I observed that while reading data, in one of the chunks there was a
null character in between. i.e. the memory representation of data read
was of the form "aa bb cc 00 xx yy zz". So when i tried to append this
chunk to the data that has already been read it appended only till the
point it encountered the 00. So the portion of data in this chunk
after 00 i.e. "xx yy zz" is lost.
When i browsed this page in IE and did a view source found that this
00 character corresponded to a blank space.
I tried to convert the chunk of data to WideChar (using
MultiByteToWideChar function) before appending it to the already read
data but after conversion the widechar buffer contained data only till
the 00 character.
I also tried appending the MultiByte buffer to the already read data
using the function call strContent->append(pszOutBuffer, dwSize). This
thing copied the complete chunk to the strContent buffer but later
when I tried to manipulate this buffer it considered the content only
till the 00 character.
So can anyone tell from where this 00 character comes in between and
how IE is able to handle it and what can be done to solve this problem
so that data is not lost.
The code snippet is attached below.
string *strContent = new string;
do
{
// Check for available data.
dwSize = 0;
WinHttpQueryDataAvailable( hRequest, &dwSize)
if(dwSize>0)
{
// Allocate space for the buffer.
pszOutBuffer = new char[dwSize+1];
// Read the Data.
std::ZeroMemory(pszOutBuffer, dwSize+1);
if (!WinHttpReadData( hRequest, (LPVOID)pszOutBuffer, dwSize,
&dwDownload))
{
TRACE(_T("Error %u in WinHttpReadData.\n"), GetLastError());
}
strContent->append(pszOutBuffer);
delete [] pszOutBuffer;
}
} while (dwSize>0);
I would be very grateful if someone could share some information on
this.
Thank You
Gaurav Jain
I took a network sniff (netmon) while browsing to the site blow. In the response The charset is "utf-8". Since "0x00" is a legal utf-8 character, so from WinHttp's perspective things are normal.
Now even though 0x00 is legal in utf-8 stream, it is not a legal XML character and probabaly not legal in HTML (I am not 100% sure). I think your problem here is how to deal with invalid HTML tokens -- you may simply convert them to spaces (0x20).
You mentioned using "strContent->append(pszOutBuffer, dwSize)" below. I think that's good because you don't lose any information. After you read all the data, you would simply hand the data over to your rendering layer and let it decide what to do.
Thanks,
Biao.W. [MSFT]
----- Gaurav wrote: -----