[AOLSERVER] More with the Chinese translations...

0 views
Skip to first unread message

Janine Sisk

unread,
Jun 25, 2009, 7:04:02 PM6/25/09
to AOLS...@listserv.aol.com
The Java solution is working, but it's kind of slow.  I thought I'd give a try to what several of you suggested, namely using Tcl to do the conversion instead.  Of course I've run into problems here too... nothing could be easy about this! :)

To recap, I'm currently using a translator written in Java, from mandarintools.com.  My servlet requests a page from the Traditional Chinese site, setting the charset to UTF-8. It then uses the converter to translate it from UTF-8 to UTF-8S, which is a version of Simplified Chinese that's apparently somewhat obscure, but gives the right results.  It is then written out to the client with the charset once again set to UTF-8.

All of my attempts to recreate this in Tcl have resulted in garbage.  I started out assuming that my incoming data from ns_httpget will be in UTF-8, since the Traditional site is using it and Tcl strings default to that encoding.  So I tried

set page_body [ns_httpget "http://big5.hrichina.org"]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body

The outgoing charset is also set to UTF-8, via the old Arsdigita ReturnHeaders proc.  But this results in garbage.

After messing with this for a while I decided to make sure I could read the page in and spit it back out without error. Nope.
"encoding system" told me that the system encoding is iso8859-1, which seems correct. I've tried all combinations of converting from this, or not, and converting to utf-8, or not, and get garbage no matter what. I've also tried using "encoding system" to set Tcl's encoding to utf-8, but still no joy.

Any suggestions?

thanks,

janine

-- AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <list...@listserv.aol.com> with the body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: field of your email blank.

Stephen Deasey

unread,
Jun 25, 2009, 8:53:25 PM6/25/09
to AOLS...@listserv.aol.com
On Fri, Jun 26, 2009 at 12:04 AM, Janine Sisk<jan...@furfly.net> wrote:
>
> set translated_page_body [encoding convertto gb2312 $page_body]


This isn't going to work. It's not the encoding of the characters you
need to change but the characters themselves.

You want to do the equivalent of this:

% string map {h H w W} "hello world"
Hello World

Where 'h' and 'H' are similar, but distinct, letters and the encoding
hasn't changed (happens to be utf8 in my terminal, could be ascii
etc.).


In fact, looking at the source code for the software at
mandarintools.com that's all they're doing. It's a poor quality
conversion, according to what I've read, but if it's sufficient well
hey!

The file hcutf8.txt in the .zip source bundle contains a mapping of
simplififed to traditional characters. The first character on each
line (that is not a comment) is the simplified character, followed by
one or more traditional candidate characters.

You could create a Tcl list in the format string-map is expecting,
with each of the traditional characters followed by the simplified
character. Without using a proxy setup, simply map the
utf8-traditional response body into utf8-simplified and send directly
to the browser.

Janine Sisk

unread,
Jun 25, 2009, 9:19:21 PM6/25/09
to AOLS...@listserv.aol.com
That was what I thought when I first started this, that that would
never work, but so many people have told me since then to try this
that I figured I was wrong. Now I'm not sure what to think!

janine

---
Janine Sisk
President/CEO of furfly, LLC
503-693-6407

Reply all
Reply to author
Forward
0 new messages