Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Problem in Auto detecting Codepage

9 views
Skip to first unread message

shresht...@gmail.com

unread,
Oct 26, 2006, 3:48:11 AM10/26/06
to
Hi,

I am trying to Auto Detect the codepage for a txt file (containing
English and other language characters as well).
The txt file is saved in UTF-8 format.
For this i tried using IMultiLanguage2::DetectInputCodepage using
MLDETECTCP_NONE.


In this, i am facing a problem that for certain files it is able to
detect the actual codepage wheras for others it simply return English
Codepage as output.


Here is the relevant piece of code that i am using (CoCreateInstace
being already done).


if(S_OK ==
mycodePageRecognizer.GetIMultiLanguage2(&pMultiLanguage2))
{
XInterface<IMultiLanguage2> xMultiLanguage2;
xMultiLanguage2.Set(pMultiLanguage2);
pMultiLanguage2 = 0;


INT pcSrcSize = myserialStream.GetNewFileSize();
DetectEncodingInfo myEncodings[1];
INT cEncodings = sizeof(myEncodings) /
sizeof(DetectEncodingInfo);


HRESULT hr =
xMultiLanguage2.GetPointer()->DetectInputCodepage(MLDETECTCP_NONE, 0,
pSrcStr, &pSrcSize, myEncodings, &cEncodings);


if (SUCCEEDED(hr) && cEncodings > 0)
{
myulCodePage = myEncodings[0].nCodePage;
}
}


Taking an example, if i am having a text file with English and Japanese

characters, it worked fine if the file consisted of 199 character but
was not working for 200+ characters.
On increasing it to around 250 it again started working fine (Returned
the correct codepage).
I know it is not having any co-relation with the size but still giving
it as an example.


**** Again i am telling that the file is saved in UTF-8 format (Also it

worked fine for any number of characters if saved in UTF_16 BE or LE
formats).


Please help me in finding where exactly am i going wrong.


Thanks and Regards,
Shreshth Luthra

shresht...@gmail.com

unread,
Oct 27, 2006, 12:05:57 AM10/27/06
to
One more thing i have tried out is to take a bigger array of
DetectEncodingInfo structures and found out that the correct code page
is there somewhere on the 2nd or 3rd number.
But at the same time the last Data member confidence = -1 for it.

Can anyone explain what can i deduce form this information. And how to
use it to get to a better result.

Regards,
Shreshth

0 new messages