mojibake recovery tool

288 views
Skip to first unread message

Johan Fänge

unread,
Aug 19, 2007, 2:02:52 PM8/19/07
to Honyaku E<>J translation list
Hi,
I'm writing an online mojibake recovery tool. I'd like some feedback
on how it's working:

https://mimizu.mine.nu:8443/tdict/mojibake.jsp

The idea is to paste something like "å 応æ
€§ã‚¹ãƒ‘ッタリング法", and it should respond with the original
text (in this case "反応性スパッタリング法"), and something of how it was
corrupted.

If you have some mojibake lying around please try it with the tool.
I'm collecting examples of mojibake, especially mojibake that it can't
handle yet.

/Johan Fänge

Johan F

unread,
Aug 20, 2007, 12:24:11 AM8/20/07
to hon...@googlegroups.com
How ironic, the example mojibake:d when I posted through google groups
web interface. E.g. euro sign (EURO) turned into EURO. (Not something I can
handle now.) The example string should have been:

å å¿OEæEURO§ã,¹ãƒ'ッã,¿ãƒªãƒ³ã,°æ³*

Oh, and the tool is (meant) for Japanese only (as it attempts to
recognize Japanese text).

/Johan Fänge

Johan Fänge skrev:


> Hi,
> I'm writing an online mojibake recovery tool. I'd like some feedback
> on how it's working:
>
> https://mimizu.mine.nu:8443/tdict/mojibake.jsp
>

> The idea is to paste something like "å å¿OEæ
> EURO§ã,¹ãƒ'ッã,¿ãƒªãƒ³ã,°æ³*", and it should respond with the original

Johan F

unread,
Aug 20, 2007, 1:37:42 AM8/20/07
to hon...@googlegroups.com
Hrm, so that didn't work either. Seems like I can't expect to send
untampered UTF-8 through google groups. Here's the same text encoded in
a URL instead:

https://mimizu.mine.nu:8443/tdict/mojibake.jsp?q=%C3%A5%C2%8F%C2%8D%C3%A5%C2%BF%C5%93%C3%A6%E2%82%AC%C2%A7%C3%A3%E2%80%9A%C2%B9%C3%A3%C6%92%E2%80%98%C3%A3%C6%92%C6%92%C3%A3%E2%80%9A%C2%BF%C3%A3%C6%92%C2%AA%C3%A3%C6%92%C2%B3%C3%A3%E2%80%9A%C2%B0%C3%A6%C2%B3%E2%80%A2

/Johan Fänge

Johan F skrev:

Warren Smith

unread,
Aug 20, 2007, 9:11:15 PM8/20/07
to hon...@googlegroups.com
Hmmm...

I tried to get to this site because I just got a mojibake note from an
important client....

This is what Norton said when I tried to open this site:
-------------------
There is a problem with this website's security certificate.


The security certificate presented by this website was not issued by a
trusted certificate authority.

Security certificate problems may indicate an attempt to fool you or
intercept any data you send to the server.
We recommend that you close this webpage and do not continue to this
website.

If you arrived at this page by clicking a link, check the website address in
the address bar to be sure that it is the address you were expecting.
When going to a website with an address such as https://example.com, try
adding the 'www' to the address, https://www.example.com.
If you choose to ignore this error and continue, do not enter private
information into the website.

For more information, see "Certificate Errors" in Internet Explorer Help.

minoru

unread,
Aug 20, 2007, 11:00:55 PM8/20/07
to Honyaku E<>J translation list
I had the need for recovering mojibake sentences as I am using
Trados for this translation and the client gave me Trados data.
Interesting is that Trados data worked most of the sentence but
could not reproduce the previous senttences in less than 5% of
the total cases.
Here is one example to which I applied your tool.

u [ L z [ X, L p [ ? , 'ア, O, i右 L p [ , ,Q-
{, z [ X j A, ,ア, , 'ア, , , , , z [ X, '[, 適 , u [ L t [ h, ?
, , , , A適-, -e , , , , , B

ブレーキホースのキャリパー?分の接?を外し(?キャリパー上の2本のホース)、ど?にも接?していないホースの端を?正なブレーキフルードの??チて
いる、?魔ネ容器に黒し込みまu。

As you can see the result is not perfect.

Minoru Mochizuki

On 8月20日, 午前3:02, Johan F nge <va...@vaste.mine.nu> wrote:
> Hi,
> I'm writing an online mojibake recovery tool. I'd like some feedback
> on how it's working:
>
> https://mimizu.mine.nu:8443/tdict/mojibake.jsp
>
> The idea is to paste something like "

> § , ' , ,° ", and it should respond with the original


> text (in this case "反応性スパッタリング法"), and something of how it was
> corrupted.
>
> If you have some mojibake lying around please try it with the tool.
> I'm collecting examples of mojibake, especially mojibake that it can't
> handle yet.
>

> /Johan F nge

Steve Venti

unread,
Aug 21, 2007, 12:05:22 AM8/21/07
to hon...@googlegroups.com
Warren Smith wrote:

> Hmmm...


>
> This is what Norton said when I tried to open this site:

Warren, if Ralph Cramden was smart enough not to believe everything he
heard from Norton, maybe you should be, too. <g>

--
Steve Venti

The source of all unhappiness is other people.
--Wally
-----------------------------------------------------------------------

Johan F

unread,
Aug 21, 2007, 12:17:14 AM8/21/07
to hon...@googlegroups.com
Ah, yes, it's a self-signed certificate. It should be nothing to worry
about, since this site isn't something "trusted" (like a bank).

It just means I haven't paid Verisign to authenticate me (and instead
I've "issued" a certificate myself). Check e.g. wikipedia for more info
about self-signed certificates.

I hope the site works just fine if you ignore the warnings.

/Johan Fänge

Warren Smith skrev:

Johan F

unread,
Aug 21, 2007, 1:24:43 AM8/21/07
to hon...@googlegroups.com
Hm, that looks very much like it has been originally encoded in
Shift JIS, and then decoded as windows-1252 ("Western Europe" or
"iso-8859-1"). I think this is the most common corruption path for Japanese.

However, that can't be the whole story, since the mojibake-string
contains kanji (e.g. 右 and 適). This just isn't possible if it was
simply decoded with windows-1252, since that character set only contains
"western" characters. Of course it might have been partly destroyed as
well. (E.g. if "日本語" gets corrupted to "???" there's not much one can
do. Almost no information is preserved, really only that it's 3 characters.)

Perhaps you could describe in more detail know how the text was
corrupted? It would also help if I knew the original text (then I could
try to reproduce the corruption). Also, I'm not familiar with Trados,
how it works and what file formats it uses.

/Johan Fänge

PS. Google wasn't kind to your mojibake-string. However, I found it in
the logs. This time I've attached it inside a skeleton html-file. Let's
see how it survives.

minoru skrev:

test.html
Reply all
Reply to author
Forward
0 new messages