Strange mojibake (windows commandline)

7 views
Skip to first unread message

Darren Cook

unread,
Dec 10, 2009, 12:14:38 AM12/10/09
to nlp-ja...@googlegroups.com
I've a weird issue that I've cannot recall ever having seen before. I'm
passing a UTF-8 string from a PHP script to a windows commandline tool
that then does a DB search. Certain characters get corrupted, but only
when they appear at the end of the string.

I've included a sample below [1]; all these characters are fine if
something comes after them (e.g. [2]). No, I found some exceptions to
that [3].

I've confirmed it is correct when it leaves the PHP script, and is
corrupted by the time it has turned into argv[] in my C++ code.

Has anyone seen this before?

Darren

[1]:
タ タ (OK)
チ チE
ツ チE
テ チE
ト チE

デ チE
ド チE

そ ぁE
た ぁE
ち ち (OK)
つ つ (OK)
て て (OK)
と と (OK)

芝 芁E
立 竁E
字 孁E

日 日
本 本


[2]:
立った日 立った日
トキ トキ
芝立字本 芝立字本


[3]:
ドド ドド
トx チE
トxx チEx
トxト チEチE
芝x 芝x
立xった日 立xった日

--
Darren Cook, Software Researcher/Developer
http://dcook.org/gobet/ (Shodan Go Bet - who will win?)
http://dcook.org/mlsn/ (Multilingual open source semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

Julien veneziano

unread,
Dec 10, 2009, 12:35:14 AM12/10/09
to nlp-ja...@googlegroups.com
Hi Darren,

The problem is quite simple, window by default use JIS encoding. So the way to fix it would be to send the data to the command line in JIS and convert them back to UTF-8. Or there is a way to force the shell to use as default UTF8, i think that would do the trick.

Regards,

Julien

2009/12/10 Darren Cook <dar...@dcook.org>

--

You received this message because you are subscribed to the Google Groups "nlp-Japanese" group.
To post to this group, send email to nlp-ja...@googlegroups.com.
To unsubscribe from this group, send email to nlp-japanese...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nlp-japanese?hl=en.



Darren Cook

unread,
Dec 10, 2009, 1:55:57 AM12/10/09
to nlp-ja...@googlegroups.com
> The problem is quite simple, window by default use JIS encoding. So the way
> to fix it would be to send the data to the command line in JIS and convert
> them back to UTF-8. Or there is a way to force the shell to use as default
> UTF8, i think that would do the trick.

Hi Julien,
Thanks. I'd dismissed wrong encoding because the characters are fine in
different positions in the string, and because most characters are fine.

Here are the UTF-8 bytes:
タ e3 82 bf (OK)
チ e3 83 81 (Bad)
... ...
ミ e3 83 9f (Bad)
ム e3 83 a0 (OK)

Looks like you are on to something. 8381 to 839f must have special
meaning in Shift-JIS. But only at the end of a string??

Aha, looks like the windows shell is actually auto-detecting and
converting. On just my PHP side if I change the encoding to Shift-JIS my
program now returns Shift-JIS (but the returned data has truncated strings).

Darren
>> nlp-japanese...@googlegroups.com<nlp-japanese%2Bunsu...@googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/nlp-japanese?hl=en.
>>
>>
>>
>
> --
>
> You received this message because you are subscribed to the Google Groups "nlp-Japanese" group.
> To post to this group, send email to nlp-ja...@googlegroups.com.
> To unsubscribe from this group, send email to nlp-japanese...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nlp-japanese?hl=en.
>
>
>


Darren Cook

unread,
Dec 13, 2009, 9:17:55 PM12/13/09
to nlp-ja...@googlegroups.com
> I've a weird issue that I've cannot recall ever having seen before. I'm
> passing a UTF-8 string from a PHP script to a windows commandline tool
> that then does a DB search. Certain characters get corrupted, but only
> when they appear at the end of the string.

Quick update for the archives: suffixing the word "dummy", and then
removing it in my windows commandline tool mostly fixed the problem, but
certain words with katakana long vowels still went wrong.

We couldn't track down a way to get the shell to work in UTF-8 (or we
did but it didn't help).

Eventual solution was to modify the commandline tool to take that UTF-8
string on stdin instead of as a commandline argument. A bit of a pain as
it required switching from exec() to proc_open() on the PHP side, but it
works. (There is also a proc_open flag to say to bypass CMD.EXE, which
might have also done the job; I didn't try it.)

Darren
Reply all
Reply to author
Forward
0 new messages