Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode corruption in Characterization field for chinese character

0 views
Skip to first unread message

Dan Meineck

unread,
Oct 11, 2004, 8:23:03 AM10/11/04
to
Hi there, i wonder if anyone can help me. I am creating an index server based
search plug-in for a .NET site, using Cisso as my method of using index
server. It's all working nicely and i have now started looking at the server
indexing pages with unicode characters, specifically in this example, chinese.

I am indexing flat HTML pages in a publish directory which have a meta
element of MS.LOCALE set to the locale of the correct language, in my case
'zh-CN'.

Setting the codepage of the cisso wrapper allows the foreign characters to
render correctly, and setting the localeid to chinese allows for chinese
characters to be acceptable as a search terms.

My problem is that when i have conducted a search and am getting the results
back, the value immediately retrived from the dataset from cisso in the
characterisation column, for the chinese content result, is corrupt:

"my keywords. 锘?html>. Latest News鏅寸鍦板尯鏀垮簻 鈥撴湇鍔″ぇ浼? 鏅寸鍦板尯鏀垮簻
鈥撴湇鍔″ぇ浼楁櫞绌哄湴鍖烘斂搴滅幇鏈?5浣嶅鍛樸備粬浠潵鑷拰浠h〃鐫28涓夊尯浜烘皯缇や紬锛屽苟鍦ㄤ换鏈熺殑鍥涘勾閲岋紝璐熻矗鏅寸鍦板尯鐨勫畯瑙傛斂绛栦笌瑙勫垝锛屾彁渚涘叕鍏辨湇鍔″拰鍐冲畾鍚勭鏈嶅姟鐨勬敹璐广?
閫氳繃鏈綉绔欙紝鎮ㄥ彲浠ラ愭笎娣卞叆浜嗚В鏅寸鏀垮簻鍚勯」鏂逛究甯傛皯鐨勬湇鍔′互鍙婃斂搴滃姛鑳斤紝鍚勯儴闂ㄧ殑鑱旂郴鏂瑰紡鍜屾斂搴滃勾搴︽姤鍛娿俉hat's NewTwo Column Lorem ipsum dolor sit amet, consete"

- Notice the ?html along with the w missing of 'What's NewTwo Column' - i
will add the HTML source of the index page below to clarify:

<html><head><title>dan</title>
<meta name="MS.LOCALE" content="zh-cn">
<meta name="keywords" content="my keywords">
<meta name="comments" content="">
<meta name="author" content="Admin">
<meta name="accessrights" content=",1,2,">
<meta name="immediacyurl" content="http://localhost/immsample501">
<meta name="lastsavedtm" content="08/10/2004 10:31:53">
<meta name="categories" content=",">
<meta name="language" content="--">
</head><body>Latest News晴空地区政府 –服务大众

晴空地区政府
–服务大众晴空地区政府现有55位委员。他们来自和代表着28个选区人民群众,并在任期的四年里,负责晴空地区的宏观政策与规划,提供公共服务和决定各种服务的收费。

通过本网站,您可以逐渐深入了解晴空政府各项方便市民的服务以及政府功能,各部门的联系方式和政府年度报告。What''s NewTwo Column
Lorem ipsum dolor sit
amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor
invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum
dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing
elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren.
dan dan dan dan dan dan my keywords my keywords my keywords my keywords my
keywords
</body>
</html>

- It looks as if the corruption comes straight out of index server - can
anyone shed any light on this? Also another problem found is if the title is
in chinese text, it gets ignored because for some reason the meta data below
is corrupted.

Thanks,

Dan

Hilary Cotter

unread,
Oct 11, 2004, 11:45:36 AM10/11/04
to
your server locale has to be Chinese for the characterization to show up
correctly.

"Dan Meineck" <DanMe...@discussions.microsoft.com> wrote in message
news:B415FA8C-A79F-4D5B...@microsoft.com...


> Hi there, i wonder if anyone can help me. I am creating an index server
> based
> search plug-in for a .NET site, using Cisso as my method of using index
> server. It's all working nicely and i have now started looking at the
> server
> indexing pages with unicode characters, specifically in this example,
> chinese.
>
> I am indexing flat HTML pages in a publish directory which have a meta
> element of MS.LOCALE set to the locale of the correct language, in my case
> 'zh-CN'.
>
> Setting the codepage of the cisso wrapper allows the foreign characters to
> render correctly, and setting the localeid to chinese allows for chinese
> characters to be acceptable as a search terms.
>
> My problem is that when i have conducted a search and am getting the
> results
> back, the value immediately retrived from the dataset from cisso in the
> characterisation column, for the chinese content result, is corrupt:
>
> "my keywords. 锘?html>. Latest News鏅寸鍦板尯鏀垮簻 鈥撴湇鍔″ぇ浼?
> 鏅寸鍦板尯鏀垮簻

> 鈥撴湇鍔″ぇ浼楁櫞绌哄湴鍖烘斂搴滅幇鏈?5浣嶅鍛樸備粬�µ
> 潵鑷拰�µ h〃鐫28涓夊尯浜烘皯缇や紬锛屽苟鍦ㄤ换鏈熺殑鍥涘勾閲岋紝璐熻矗鏅寸鍦板尯鐨勫畯瑙傛斂绛æ
> �笌瑙勫垝锛屾彁渚涘叕鍏辨湇鍔″拰鍐冲畾鍚勭鏈嶅姟鐨勬敹璐广?
> 閫氳繃鏈綉绔欙紝鎮ㄥ彲�µ ラ愭笎娣卞叆浜嗚В鏅寸鏀垮簻鍚勯」鏂逛究甯傛皯鐨勬湇鍔′互鍙婃斂搴滃姛鑳斤紝鍚勯儴闂ㄧ殑鑱旂郴鏂瑰紡鍜屾斂搴滃勾搴︽姤鍛娿俉hat's

Dan Meineck

unread,
Oct 11, 2004, 5:01:01 PM10/11/04
to
Hi Hilary,

Thanks for looking at my question. I did see you mention this in some
previous answers to other people's questions. The problem i have is that this
search plug-in is designed for a multi language site, so setting the server
locale to chinese isnt possible as we may want to search against welsh,
chinese, japanese, arabic etc at the same time. Is this not possible with
index server?

Also if the above question is correct, does that mean that unless the server
locale is set to the specific language code used in the indexed documents,
regardless of the MS.LOCALE setting (for example in html) that it will be
corrupted if in a different locale to the server's setting? Is this a bug or
a 'feature'? ;)

Many Thanks for your help - i see you are very helpful on these forums!

Best Regards,

Dan

"Hilary Cotter" wrote:

> your server locale has to be Chinese for the characterization to show up
> correctly.
>
> "Dan Meineck" <DanMe...@discussions.microsoft.com> wrote in message
> news:B415FA8C-A79F-4D5B...@microsoft.com...
> > Hi there, i wonder if anyone can help me. I am creating an index server
> > based
> > search plug-in for a .NET site, using Cisso as my method of using index
> > server. It's all working nicely and i have now started looking at the
> > server
> > indexing pages with unicode characters, specifically in this example,
> > chinese.
> >
> > I am indexing flat HTML pages in a publish directory which have a meta
> > element of MS.LOCALE set to the locale of the correct language, in my case
> > 'zh-CN'.
> >
> > Setting the codepage of the cisso wrapper allows the foreign characters to
> > render correctly, and setting the localeid to chinese allows for chinese
> > characters to be acceptable as a search terms.
> >
> > My problem is that when i have conducted a search and am getting the
> > results
> > back, the value immediately retrived from the dataset from cisso in the
> > characterisation column, for the chinese content result, is corrupt:
> >

> > "my keywords. 锘?html>. Latest Newsé …å¯¸é ¦æ ¿å°¯é €åž®ç°» éˆ¥æ’´æ¹‡é ”â€³ã ‡æµ¼?
> > é …å¯¸é ¦æ ¿å°¯é €åž®ç°»
> > éˆ¥æ’´æ¹‡é ”â€³ã ‡æµ¼æ¥ æ«žç»Œå“„æ¹´é –çƒ˜æ–‚æ ´æ»…å¹‡é ˆ?5æµ£å¶…é ›æ¨¸å‚™ç²¬æµ
> > æ½µé‘·æ‹°æµ ï½ˆã€ƒé «28æ¶“å¤Šå°¯æµœçƒ˜çš¯ç¼‡ã‚„ç´¬é”›å±½è‹Ÿé ¦ã„¤æ ¢é ˆç†ºæ®‘é ¥æ¶˜å‹¾é–²å²‹ç´ ç’ ç†»çŸ—é …å¯¸é ¦æ ¿å°¯é ¨å‹«ç•¯ç‘™å‚›æ–‚ç»›æ
> > ¦ç¬Œç‘™å‹«åž é”›å±¾å½ æ¸šæ¶˜å •é è¾¨æ¹‡é ”â€³æ‹°é å†²ç•¾é šå‹­é ˆå¶…å§Ÿé ¨å‹¬æ•¹ç’ å¹¿?
> > é–«æ°³ç¹ƒé ˆç¶‰ç»”æ¬™ç´ éŽ®ã„¥å½²æµ ãƒ©æ„­ç¬Žå¨£å žå †æµœå—šÐ’é …å¯¸é €åž®ç°»é šå‹¯ã€ é ‚é€›ç©¶ç”¯å‚›çš¯é ¨å‹¬æ¹‡é ”â€²äº’é ™å©ƒæ–‚æ ´æ»ƒå§›é‘³æ–¤ç´ é šå‹¯å„´é—‚ã„§æ®‘é‘±æ—‚éƒ´é ‚ç‘°ç´¡é œå±¾æ–‚æ ´æ»ƒå‹¾æ ´ï¸½å§¤é ›å¨¿ä¿‰hat's

> > NewTwo Column Lorem ipsum dolor sit amet, consete"
> >
> > - Notice the ?html along with the w missing of 'What's NewTwo Column' - i
> > will add the HTML source of the index page below to clarify:
> >
> > <html><head><title>dan</title>
> > <meta name="MS.LOCALE" content="zh-cn">
> > <meta name="keywords" content="my keywords">
> > <meta name="comments" content="">
> > <meta name="author" content="Admin">
> > <meta name="accessrights" content=",1,2,">
> > <meta name="immediacyurl" content="http://localhost/immsample501">
> > <meta name="lastsavedtm" content="08/10/2004 10:31:53">
> > <meta name="categories" content=",">
> > <meta name="language" content="--">

> > </head><body>Latest News晴空地区政府 â€“æœ åŠ¡å¤§ä¼—
> >
> > 晴空地区政府
> > â€“æœ åŠ¡å¤§ä¼—æ™´ç©ºåœ°åŒºæ”¿åºœçŽ°æœ‰55ä½ å§”å‘˜ã€‚ä»–ä»¬æ ¥è‡ªå’Œä»£è¡¨ç €28ä¸ªé€‰åŒºäººæ°‘ç¾¤ä¼—ï¼Œå¹¶åœ¨ä»»æœŸçš„å››å¹´é‡Œï¼Œè´Ÿè´£æ™´ç©ºåœ°åŒºçš„å® è§‚æ”¿ç­–ä¸Žè§„åˆ’ï¼Œæ ä¾›å…¬å…±æœ åŠ¡å’Œå†³å®šå „ç§ æœ åŠ¡çš„æ”¶è´¹ã€‚
> >
> > é€šè¿‡æœ¬ç½‘ç«™ï¼Œæ‚¨å ¯ä»¥é€ æ¸ æ·±å…¥äº†è§£æ™´ç©ºæ”¿åºœå „é¡¹æ–¹ä¾¿å¸‚æ°‘çš„æœ åŠ¡ä»¥å Šæ”¿åºœåŠŸèƒ½ï¼Œå „éƒ¨é—¨çš„è ”ç³»æ–¹å¼ å’Œæ”¿åºœå¹´åº¦æŠ¥å‘Šã€‚What''s

Hilary Cotter

unread,
Oct 11, 2004, 9:40:16 PM10/11/04
to
That's not quite true, and perhaps I wasn't very clear. To get the
characterization to show up correctly when for some of the Asian languages
you need a localized version of the server, ie you need the Traditional
Chinese version of Win2003. Non Unicode text will show up fine, and it will
be hit and miss for the other languages.

This has no bearing on the Ms.Locale metatag which is used by IS in breaking
words in html content according to the language rules for that
language/locale.

So to host the languages you wish (Welsh, Chinese, Japanese, and Arabic)
you will
1) need the correct word breakers installed
2) need to select the correct localized version of the OS which matches the
majority of your users.

I take it also that you are using the description metatag for your
characterization as well.

If this is critical to you, I would open a support incident with Microsoft
PSS.


"Dan Meineck" <DanMe...@discussions.microsoft.com> wrote in message

news:6A330FBC-6F9D-48A3...@microsoft.com...

>> > æ½µé‘·æ‹°æµ ï½ˆã€ƒé «28æ¶“å¤Å å°¯æµœçƒ˜çš¯ç¼‡ã‚„ç´¬é”›å±½è‹Ÿé ¦ã„¤æ ¢é ˆç�€

>> > ºæ®‘é ¥æ¶˜å‹¾é–²å²‹ç´ ç’ ç�€ »çŸ—é …å¯¸é ¦æ ¿å°¯é ¨å‹«ç•¯ç‘™å‚›æ–‚ç»›æ
>> > ¦ç¬Œç‘™å‹«åž é”›å±¾å½ æ¸šæ¶˜å •é è¾¨æ¹‡é ”â€³æ‹°é å�€
>> > ²ç•¾é šå‹­é ˆå¶…å§Ÿé ¨å‹¬æ•¹ç’ å¹¿?
>> > é–«æ°³ç¹ƒé ˆç¶‰ç»”æ¬™ç´ éŽ®ã„¥å½²æµ
>> > ãƒ©æ„­ç¬Žå¨£å žå �€ æµœå—šÐ’é …å¯¸é €åž®ç°»é šå‹¯ã€ é ‚é€›ç©¶ç”¯å‚›çš¯é ¨å‹¬æ¹‡é ”â€²äº’é ™å©ƒæ–‚æ ´æ»ƒå§›é‘³æ–¤ç´ é šå‹¯å„´é—‚ã„§æ®‘é‘±æ—‚éƒ´é ‚ç‘°ç´¡é œå±¾æ–‚æ ´æ»ƒå‹¾æ ´ï¸½å§¤é ›å¨¿ä¿‰hat's
>> > NewTwo Column Lorem ipsum dolor sit amet, consete"
>> >
>> > - Notice the ?html along with the w missing of 'What's NewTwo Column' -
>> > i
>> > will add the HTML source of the index page below to clarify:
>> >
>> > <html><head><title>dan</title>
>> > <meta name="MS.LOCALE" content="zh-cn">
>> > <meta name="keywords" content="my keywords">
>> > <meta name="comments" content="">
>> > <meta name="author" content="Admin">
>> > <meta name="accessrights" content=",1,2,">
>> > <meta name="immediacyurl" content="http://localhost/immsample501">
>> > <meta name="lastsavedtm" content="08/10/2004 10:31:53">
>> > <meta name="categories" content=",">
>> > <meta name="language" content="--">
>> > </head><body>Latest News晴空地区政府

>> > â€“æœ åÅ ¡å¤§ä¼—


>> >
>> > 晴空地区政府

>> > â€“æœ åÅ ¡å¤§ä¼—晴空地区政府现有55ä½ å§”å‘˜ã€‚ä»–ä»¬æ ¥è‡ªå’Œä»£è¡¨ç €28ä¸ªé€‰åŒºäººæ°‘ç¾¤ä¼—ï¼Œå¹¶åœ¨ä»»æœŸçš„å››å¹´é‡Œï¼Œè´Ÿè´£æ™´ç©ºåœ°åŒºçš„å® è§‚æ”¿ç­–ä¸Žè§„åˆ’ï¼Œæ ä¾›å…¬å…±æœ åÅ
>> > ¡å’Œå�€ ³å®šå „ç§ æœ åÅ

>> > ¡çš„æ”¶è´¹ã€‚
>> >
>> > é€šè¿‡æœ¬ç½‘ç«™ï¼Œæ‚¨å ¯ä»¥é€ æ¸ æ·±å…¥äº�€

>> > è§£æ™´ç©ºæ”¿åºœå „é¡¹æ–¹ä¾¿å¸‚æ°‘çš„æœ åÅ
>> > ¡ä»¥å Š政府åÅ Ÿèƒ½ï¼Œå „éƒ¨é—¨çš„è ”ç³»æ–¹å¼ å’Œæ”¿åºœå¹´åº¦æÅ
>> > ¥å‘Š。What''s

Dan Meineck

unread,
Oct 12, 2004, 11:53:02 AM10/12/04
to
Hi Hilary,

Thanks for the answer - it looks like i will have to put up with the way
index server works, it's unfortunate as we want to create a multilanguage
site, and obviously allow any language and it's text system.

We are using everything after the <BODY> tag for the characterisation tag
summary - i was under the impression this was generated by index server using
this info.

I could ring PSS and explain this issue, but i doubt they would be able to
provide me with a workaround if what you have explained is true, although
having worked for PSS in the old vb client team in the uk i may end up
speaking to some old friends ;)

Thanks for all your help.

Dan

"Hilary Cotter" wrote:

> >> > "my keywords. 醘?html>. Latest
> >> > News項寸頦栿尯頀垮簻
> >> > 鈥撴湇順″㠇浼?
> >> > 項寸頦栿尯頀垮簻
> >> > 鈥撴湇順″㠇浼楠櫞绌哄湴頖烘斂栴滅幇須?5浣嶅頛樸備粬æµ
> >> > 潵鑷拰æµ h〃頫28涓夊尯浜烘皯缇や紬醛屽苟頦ㄤ栢須çâ€
> >> > ºæ®‘頥涘勾閲岋素璠熻矗項寸頦栿尯頨勫畯瑙傛斂绛æ
> >> > ¦ç¬Œç‘™å‹«åž é†›å±¾å½ æ¸šæ¶˜å •頠辨湇順″拰頠åâ€
> >> > ²ç•¾é šå‹­é ˆå¶…姟頨勬敹璠广?
> >> > 閫氳繃須綉细欙素鎮ㄥ彲æµ
> >> > ãƒ©æ„­ç¬Žå¨£å žå †æµœå—šà ’é …å¯¸é €åž®ç°»é šå‹¯ã€ é ‚é€›ç©¶ç†¯å‚›çš¯é ¨å‹¬æ¹‡é †â€²äº’é ™å©ƒæ–‚æ ´æ»ƒå§›é‘³æ–¤ç´ é šå‹¯å„´é—‚ã„§æ®‘é‘±æ—‚éƒ´é ‚ç‘°ç´¡é œå±¾æ–‚æ ´æ»ƒå‹¾æ ´ï¸½å§¤é ›å¨¿ä¿‰hat's


> >> > NewTwo Column Lorem ipsum dolor sit amet, consete"
> >> >
> >> > - Notice the ?html along with the w missing of 'What's NewTwo Column' -
> >> > i
> >> > will add the HTML source of the index page below to clarify:
> >> >
> >> > <html><head><title>dan</title>
> >> > <meta name="MS.LOCALE" content="zh-cn">
> >> > <meta name="keywords" content="my keywords">
> >> > <meta name="comments" content="">
> >> > <meta name="author" content="Admin">
> >> > <meta name="accessrights" content=",1,2,">
> >> > <meta name="immediacyurl" content="http://localhost/immsample501">
> >> > <meta name="lastsavedtm" content="08/10/2004 10:31:53">
> >> > <meta name="categories" content=",">
> >> > <meta name="language" content="--">

> >> > </head><body>Latest News晴空地区憿府
> >> > –朠务大众
> >> >
> >> > 晴空地区憿府
> >> > –朠务大众晴空地区憿府现有55你姆员。他们栥自和代表砀28个选区人民群众,并在任期的四年里,负责晴空地区的宠观憿策与规划,栠供公共朠åÅ
> >> > ¡å’Œå†³å®šå „秠朠åÅ
> >> > ¡çš„æ†¶è´¹ã€‚
> >> >
> >> > 通过本网站,您堯以造渠深入äºâ€
> >> > 解晴空憿府堄项方便市民的朠åÅ
> >> > ¡ä»¥å Šæ†¿åºœåŠŸèƒ½ï¼Œå „éƒ¨é—¨çš„è †ç³»æ–¹å¼ å’Œæ†¿åºœå¹´åº¦æÅ
> >> > ¥å‘Šã€‚What''s

Hilary Cotter

unread,
Oct 12, 2004, 3:05:55 PM10/12/04
to
If you want only the what occurs between the body tags to show up you are
out of luck. Characterization for html pages is the first 320 bytes (by
default), or the contents of the description metatag.

I urge you to call PSS for their definitive answer to this question. It is
unlikely that your friends in the VB team will field your question. If by
chance they do, I would ask that they consult the Index Server PSS support
group in the US.

"Dan Meineck" <DanMe...@discussions.microsoft.com> wrote in message

news:48319B72-5C6B-46D6...@microsoft.com...

0 new messages