Unicode to Big5 converter

1,659 views
Skip to first unread message

Edward Lipsett /t

unread,
Nov 8, 2011, 1:37:10 AM11/8/11
to Chinese Mac

Thanks to Quark Xpress, we have a lot of old layout files made with B5 and
GB5 fonts. We are gradually replacing these with Illustrator and InDesign
files using new TC and SC Unicode fonts, but we still have to revise old
files pretty often.

In order to do this, we have to
1) Start with a DOC file of the translation
2) Using NJstar on a Windows machine, save it as B5 text
3) Open the B5 text file on an old Mac OS9 machine, and place it into
Illustrator using a B5 or GB5 font.
4) Once that's done, we can either finish the file under OS9, or move it
to an OSX machine and finish it there using the same font.

This process sucks. Seriously.


Until we get rid of the need for using these old B5 and GB5 fonts, then,
is there any simpler way under OSX to take a modern DOC file and produce a
B5 text file? We do not need any formatting at all, just text.

If I can make a B5 file on an OSX machine easily, I can bypass the Windows
machine entirely. With luck, I may be able to bypass the OS9 machine, too!

Any suggestions appreciated!

Thank you for your time.


----------
Edward Lipsett, Intercom, Ltd.
translation€@intercomltd.com (remove Euro mark)
Publishing: http://www.kurodahan.com
Translation & layout: http://www.intercomltd.com

Weizhong Yang

unread,
Nov 8, 2011, 4:18:46 AM11/8/11
to chine...@googlegroups.com
Hi Edward,

If you want to convert a Microsoft Word document to a Big5 plain text file, as far as I know, the built-in TextEdit.app can do it.

Use TextEdit.app to open your .doc file, select "Make Plain Text" from the "Format" menu, and then you can select the encoding while saving the file. If Big5 and GB2312 do not appear in the list, please select "Customize Encoding List" in the menu of the file saving panel.

Cheers,
zonble

--
Weizhong Yang (a.k.a zonble)
http://zonble.net
zon...@gmail.com

Nobumi Iyanaga

unread,
Nov 8, 2011, 8:55:59 AM11/8/11
to chine...@googlegroups.com
Hello Edward,

On Nov 8, 2011, at 3:37 PM, Edward Lipsett /t wrote:

>
> Thanks to Quark Xpress, we have a lot of old layout files made with B5 and
> GB5 fonts. We are gradually replacing these with Illustrator and InDesign
> files using new TC and SC Unicode fonts, but we still have to revise old
> files pretty often.
>
> In order to do this, we have to
> 1) Start with a DOC file of the translation
> 2) Using NJstar on a Windows machine, save it as B5 text
> 3) Open the B5 text file on an old Mac OS9 machine, and place it into
> Illustrator using a B5 or GB5 font.
> 4) Once that's done, we can either finish the file under OS9, or move it
> to an OSX machine and finish it there using the same font.
>
> This process sucks. Seriously.

I understand that this is cumbersome, but I don't understand why you have to do all this process. Do you mean that your files have Chinese texts? But if they are files of translated texts from Chinese to English (or other European languages), the encodings and the fonts have almost no effects, no...?

Anyway (I mean even if these files contain Chinese texts), the first thing that I would try would be to open your old Word files in a newer version of Word on a Mac OS X machine, and save them as ".txt" files with Unicode as the encoding...

If this is not possible, you might want to try Nisus Writer as a converter. It can open Word files, and save files as Unicode text files (with some trick, you can even save foot/endnotes as Unicode text).

Best regard,

Nobumi Iyanaga
Tokyo,
Japan

Edward Lipsett /ht

unread,
Nov 8, 2011, 9:08:51 AM11/8/11
to chine...@googlegroups.com
The problem is not unicode; the problem is that we have a LOT of
old layouts made with B5/GB5 fonts, and thanks to various
software issues, it is very difficult to translate a Unicode DOC
file into a B5 text file.

Fortunately, a previous poster suggested a possible solution
which I will try tomorrow. There is also something called Cyclone
which seems to do a reasonable job, but fails on a variety of
symbols.

> Anyway (I mean even if these files contain Chinese texts), the
> first thing that I would try would be to open your old Word
files
> in a newer version of Word on a Mac OS X machine, and save them
> as ".txt" files with Unicode as the encoding...

Edward Lipset

Kerim Friedman

unread,
Nov 8, 2011, 9:08:39 AM11/8/11
to chine...@googlegroups.com
1&2 seem to be about trying to get Chinese text into B5. Weizhong's answer explains how to do this without having to use NJStar. The built-in TextEdit program can do this just fine.

3 is the step I don't understand. It seems you have a workflow which requires using Illustrator in OS9? Adobe CS 5.5 won't open these files on OS X? 

Cheers,

Kerim


--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.yale.edu/chinesemac
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac



--



Assistant Professor
Department of Indigenous Cultures
College of Indigenous Studies
National DongHwa University, TAIWAN
助理教授國立東華大學民族文化學系


Edward Lipsett /ht

unread,
Nov 8, 2011, 9:18:33 AM11/8/11
to chine...@googlegroups.com

> 3 is the step I don't understand. It seems
> you have a workflow which requires using
> Illustrator in OS9? Adobe CS 5.5 won't open
> these files on OS X?

AI CSx under OS X, for whatever reason (probably because this is
the Japanese-native version of AI, I suspect) insists on
importing the B5 text as something else, usually Japanese. The
identical B5/GB5 fonts are installed on and work fine on both OS9
and OSX machines. Once the text is put into the right font in AI8
under OS9, that file can be transferred to the OSX machine and
worked on normally, whether B5 (TC) or GB5 (SC).

The problems exists for all version of AI that run under OSX.
Likewise, things work the same under all versions of AI under
OS9.

Edward Lipsett /ht

unread,
Nov 8, 2011, 9:21:20 AM11/8/11
to chine...@googlegroups.com

> > 3 is the step I don't understand. It seems
> > you have a workflow which requires using
> > Illustrator in OS9? Adobe CS 5.5 won't open
> > these files on OS X?

I should add, though, if TextEdit can produce a clean B5 text
file, then it should be possible to eliminate the need for OS9 at
all.
The fundamental problem is Word and/or Illustrator (native
Japanese versions) refusing to handle B5 properly.


Kerim Friedman

unread,
Nov 8, 2011, 9:21:21 AM11/8/11
to chine...@googlegroups.com
Might be worth having someone who owns the English version of AI on OS X check if they can import the file without it turning into Japanese. Seems like you could save yourself a lot of trouble if that worked - there would no longer be any need to run OS 9.

- Kerim

--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.yale.edu/chinesemac
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac

Kerim Friedman

unread,
Nov 8, 2011, 9:27:05 AM11/8/11
to chine...@googlegroups.com
Also, note that sometimes Unicode>Big5 conversion will fail because of the existence of individual characters not included in Big5. If so, you may need to do a search and replace for extraneous Japanese characters included in your Unicode text and remove them before conversion. A program like UnicodeChecker can sometimes help diagnose problems: http://earthlingsoft.net/UnicodeChecker/

Cheers,

Kerim

--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.yale.edu/chinesemac
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac

Eric Rasmussen

unread,
Nov 8, 2011, 11:33:53 AM11/8/11
to chine...@googlegroups.com
Hi Edward,

You definitely shouldn't need to go back into OS 9 to do the
Big5-to-Unicode conversion, which is basically what you are doing in
that workflow.

Do the DOC files open as gibberish in OS X and Word 2011 or TextEdit?
If so, then I think I would save them as TXT or RTF files in Windows
and then use Jedit X (my choice for this kind of thing) to open them,
apply the correct encoding to eliminate the gibberish, and then
convert the text to Unicode. You should then be able to just copy and
paste into CS 5.

Which brings us to your other problem, importing Chinese text into the
Japanese edition of CS 5. Is it still a problem, even with
Unicode-based Chinese text?

Eric

Tyler Thorsted

unread,
Nov 8, 2011, 12:25:21 PM11/8/11
to chine...@googlegroups.com
Edward,

I recently had great success using the latest version of QuarkXPress 8 or 9 to open old Chinese documents. In the latest Quark you can specify in the preferences which language it should open non-unicode fonts. Quark does the conversion to unicode and then you can export to word or html. 

Tyler Thorsted


Edward Lipsett /t

unread,
Nov 8, 2011, 8:40:33 PM11/8/11
to Chinese Mac

On 11/11/09 2:25, "Tyler Thorsted" <t.tho...@gmail.com> wrote:

>Edward,
>I recently had great success using the latest version of QuarkXPress 8 or
>9 to open old Chinese documents. In the latest Quark you can specify in
>the preferences which language it should open non-unicode fonts. Quark
>does the conversion to unicode and then you can export to word or html.
>
>Tyler Thorsted


Thank you, Tyler, but you would have to pay me a significant amount of
money to get me to touch Quark ever again.

We had to buy multiple languages for QXP3.3 (Japanese, Korean, TC and
SC), and customer service (in Japanese at least) was distinctly not
impressive. The company's inability to support multiple languages, and
delay in supporting Unicode, made it impossible for us to continue to use
it. Because of it, though, we have a lot of GB5 fonts which have been used
over the years to make various materials in Illustrator.
We are gradually getting rid of them, and using real fonts under OSX to
make new documents, but it is far easier to add three words to a map than
recreate the whole thing all over again... especially when we're only
getting paid for the three words.

I am just looking for a temporary workaround that will simplify my life
until we can get rid of the last of those legacy files.


Edward Lipsett


Edward Lipsett /t

unread,
Nov 8, 2011, 11:22:06 PM11/8/11
to Chinese Mac
>You definitely shouldn't need to go back into OS 9 to do the
>Big5-to-Unicode conversion, which is basically what you are doing in
>that workflow.

Other way around... We start with a DOC file done in (for example) SimSun
or some reasonable Windows SC font, and for a variety of silly reasons
need to convert that into B5 encoding.
Being able to do that easily under OSX would certainly help.


>Do the DOC files open as gibberish in OS X and Word 2011 or TextEdit?
>If so, then I think I would save them as TXT or RTF files in Windows
>and then use Jedit X (my choice for this kind of thing) to open them,
>apply the correct encoding to eliminate the gibberish, and then
>convert the text to Unicode. You should then be able to just copy and
>paste into CS 5.

Using new software and fonts under OSX, everything works fine.

The only problem is that in order to get the B5 text displaying normally
using these old GB5 fonts, we have to move the whole thing to OS9, and use
AI8 or 9. Once the text has been placed in AI8/9, using B5 or GB5 fonts,
we can then move that file to OSX and continue work normally using the
EXACT SAME FONTs.


>Which brings us to your other problem, importing Chinese text into the
>Japanese edition of CS 5. Is it still a problem, even with
>Unicode-based Chinese text?

With new fonts, no problem at all. Import, cut-and-paste, whatever, they
all work fine.

Edward


Eric Rasmussen

unread,
Nov 9, 2011, 8:05:57 AM11/9/11
to chine...@googlegroups.com
On Tue, Nov 8, 2011 at 11:22 PM, Edward Lipsett /t
<trans...@intercomltd.com> wrote:
> ... We start with a DOC file done in (for example) SimSun
> or some reasonable Windows SC font, and for a variety of silly reasons
> need to convert that into B5 encoding.
> Being able to do that easily under OSX would certainly help.

Okay, I think maybe I see now -- you need to distinguish between
character sets and encodings. Please forgive me if I'm still
misunderstanding you. Here's what I think is happening:

Your original text is in the Simplified Chinese character set and you
are using NJStar to convert it to the Traditional Chinese character
set so you can apply the Big5/GB5 fonts to it, but NJStar also
converts the encoding to Big5, which causes problems when importing
the converted text to the Japanese edition of CS 5.

So what you need is an SC-TC conversion tool on the OS X side that
will do what NJStar is currently doing for you, but without converting
the text to Big5 encoding (and thus without causing the ensuing
problem with CS 5-J). All Mac SC-TC conversion tools will do this --
everything stays in Unicode. Wenlin 4 gives you the most control --
you can have it manually ask you about each ambiguity (as in Wenlin 3)
or you can have it automatically choose, which is what NJStar is
doing.

Converting the text to the Big5 encoding (as opposed to the Big5
character set) is not necessary -- probably there is some way to tell
NJStar to keep the text Unicode-encoded. All Big5 characters are in
Unicode. Then you could just do the SC-TC conversions on the Windows
side and the text would go straight into CS 5-J on the Mac side, no
problem, where you can apply the Big5/GB5 (character set) fonts (which
have both Big5 and Unicode cmaps).

I mean, there is a reason why Big5/GB5 matching font sets are still
sold. It's the same reason matching GB/GB-T fonts exist -- because it
makes it trivial to produce both TC and SC versions of the same text,
just by changing the font.

> The only problem is that in order to get the B5 text displaying  normally
> using these old GB5 fonts, we have to move the whole thing to OS9, and use
> AI8 or 9. Once the text has been placed in AI8/9, using B5 or GB5 fonts,
> we can then move that file to OSX and continue work normally using the
> EXACT SAME FONTs.

See above, I think the only reason you are having this issue is that
NJStar is changing the encoding as well as the character set. This is
not necessary. The fonts don't care which encoding is used -- all they
care about is that the text be TC (i.e., the Big5 character set, not
the encoding).

Eric

TenThousandThings

unread,
Nov 9, 2011, 8:27:25 AM11/9/11
to Chinese Mac
On Nov 9, 8:05 am, Eric Rasmussen <hello.ras...@gmail.com> wrote:
> I mean, there is a reason why Big5/GB5 matching font sets are still
> sold. It's the same reason matching GB/GB-T fonts exist -- because it
> makes it trivial to produce both TC and SC versions of the same text,
> just by changing the font.

A side note: This works best when you start with TC text. GB5 fonts
will always exist for this reason. There is very little ambiguity in
that direction, easy to check for. Going from SC to TC is trickier,
and using GB-T fonts to do it is more of a quick-and-dirty solution,
rather than a precise method -- for that you need good SC-TC software.

TenThousandThings

unread,
Nov 9, 2011, 8:36:35 AM11/9/11
to Chinese Mac
On Nov 9, 8:05 am, Eric Rasmussen <hello.ras...@gmail.com> wrote:
> So what you need is an SC-TC conversion tool on the OS X side that
> will do what NJStar is currently doing for you, but without converting
> the text to Big5 encoding (and thus without causing the ensuing
> problem with CS 5-J). All Mac SC-TC conversion tools will do this --
> everything stays in Unicode. Wenlin 4 gives you the most control --
> you can have it manually ask you about each ambiguity (as in Wenlin 3)
> or you can have it automatically choose, which is what NJStar is
> doing.

One more thing. Here is another good option:

http://www.ideographer.com/chineserewriter/index_en.shtml

Edward Lipsett /t

unread,
Nov 9, 2011, 7:49:13 PM11/9/11
to Chinese Mac

On 11/11/09 22:05, "Eric Rasmussen" <hello....@gmail.com> wrote:

>On Tue, Nov 8, 2011 at 11:22 PM, Edward Lipsett /t
><trans...@intercomltd.com> wrote:
>> ... We start with a DOC file done in (for example) SimSun
>> or some reasonable Windows SC font, and for a variety of silly reasons
>> need to convert that into B5 encoding.
>> Being able to do that easily under OSX would certainly help.
>
>Okay, I think maybe I see now -- you need to distinguish between
>character sets and encodings. Please forgive me if I'm still
>misunderstanding you. Here's what I think is happening:

>Your original text is in the Simplified Chinese character set and you
>are using NJStar to convert it to the Traditional Chinese character
>set so you can apply the Big5/GB5 fonts to it, but NJStar also
>converts the encoding to Big5, which causes problems when importing
>the converted text to the Japanese edition of CS 5.

Since the fonts in question are B5/GB5, they must use B5 encoding.
Even if I have a pure B5 text file on the OSX machine, there is no way to
get it to display properly under Illustrator UNLESS I first import it into
AI8/9 under OS9. Once I have the AI8/9 file with the B5/GB5 font
displaying, I can then transfer that file to OSX and AI and continue work
normally. The identical B5/GB5 fonts work normally on the OSX machine.

>So what you need is an SC-TC conversion tool on the OS X side that
>will do what NJStar is currently doing for you, but without converting
>the text to Big5 encoding (and thus without causing the ensuing
>problem with CS 5-J).

No. A conversion tool that will produce B5 text from DOC files (which is
Unicode using SC glyphs, I believe) would eliminate the need to Njstar, at
least.
It would have no effect on the problem of actually placing the text data
in Illustrator under OSX using the B5/GB5 fonts.

>Converting the text to the Big5 encoding (as opposed to the Big5
>character set) is not necessary -- probably there is some way to tell
>NJStar to keep the text Unicode-encoded. All Big5 characters are in
>Unicode. Then you could just do the SC-TC conversions on the Windows
>side and the text would go straight into CS 5-J on the Mac side, no
>problem, where you can apply the Big5/GB5 (character set) fonts (which
>have both Big5 and Unicode cmaps).
>
>I mean, there is a reason why Big5/GB5 matching font sets are still
>sold. It's the same reason matching GB/GB-T fonts exist -- because it
>makes it trivial to produce both TC and SC versions of the same text,
>just by changing the font.


This is exactly why we purchased them: So we could use Quark Xpress TC
versions to do both SC and TX character sets.
And they worked fine until Quark stopped offering its software for a
period of several years, and shifted to Unicode only years after the rest
of the software world in Japan did. They used to own the vast majority of
the layout market in Japan with QXP3.3; now they are essentially
invisible, with Illustrator and InD dominating. The change wasn't because
QXP was a bad program, it was because they just stopped evolving for a
period of several years, and let the market shift to other vendors.

>>The only problem is that in order to get the B5 text displaying normally
>> using these old GB5 fonts, we have to move the whole thing to OS9, and
>>use
>> AI8 or 9. Once the text has been placed in AI8/9, using B5 or GB5 fonts,
>> we can then move that file to OSX and continue work normally using the
>> EXACT SAME FONTs.
>
>See above, I think the only reason you are having this issue is that
>NJStar is changing the encoding as well as the character set. This is
>not necessary. The fonts don't care which encoding is used -- all they
>care about is that the text be TC (i.e., the Big5 character set, not
>the encoding).

This is something I have to look into. The reason I asked this question
here in the first place is that you guys know a lot more about the inner
workings of Chinese fonts than I do. I was under the impression that
B5/GB5 fonts all use B5 encoding, just with different glyphs.
I thought there were only two factors: The encoding (which basically
defines the address where any specific character can be located) and the
glyph (the image displayed for that address). If I understand you
correctly, there is some third factor involved which I confess I do not
understand.
Wenlin and Ideographer's ChineseRewriter have been suggested; I'll have a
look and see what they come up with.

I did try Cyclone and TextEdit, and output files could not be imported
into AI under OSX in the correct font (total gibberish in a Japanese font;
font could be changed after placing into AI, but the gibberish remained,
indicating the text data has been corrupted at import).

Eric Rasmussen

unread,
Nov 10, 2011, 10:17:07 AM11/10/11
to chine...@googlegroups.com
On Wed, Nov 9, 2011 at 7:49 PM, Edward Lipsett /t
<trans...@intercomltd.com> wrote:
> Since the fonts in question are B5/GB5, they must use B5 encoding.
> ... The identical B5/GB5 fonts work normally on the OSX machine.

Your first statement is not correct. The fonts require the Big5
character set, but not the encoding. They would not work at all in OS
X if they did not support Unicode.

>>So what you need is an SC-TC conversion tool on the OS X side that
>>will do what NJStar is currently doing for you, but without converting
>>the text to Big5 encoding (and thus without causing the ensuing
>>problem with CS 5-J).
>
> No. A conversion tool that will produce B5 text from DOC files (which is
> Unicode using SC glyphs, I believe) would eliminate the need to Njstar, at
> least.

The DOC file is Unicode text, but it is not correct to say "Unicode
with SC glyphs." It is not a font issue at this point. About two
thousand Simplified-form characters occupy different code points
within Unicode from their Traditional-form counterparts. It would be
correct to say "Unicode with SC code points." This is not a font
issue. That's why you need a conversion tool, to change those
Simplified-form characters/code points to the corresponding
Traditional-form characters/code points within Unicode. There's no
doubt that NJStar is doing that for you -- otherwise you could not end
up with the Big5 text that you do.

> It would have no effect on the problem of actually placing the text data
> in Illustrator under OSX using the B5/GB5 fonts.

I don't think there could be any problem placing the text data in OS X
Illustrator as long as it is Unicode-encoded. Again, I don't think the
fonts require Big5-encoded text to work. They would have to be ancient
for that to be the case, and if so they wouldn't work in OS X
Illustrator (which is Unicode-based) at all. They do, right?

>>Converting the text to the Big5 encoding (as opposed to the Big5
>>character set) is not necessary -- probably there is some way to tell
>>NJStar to keep the text Unicode-encoded. All Big5 characters are in
>>Unicode. Then you could just do the SC-TC conversions on the Windows
>>side and the text would go straight into CS 5-J on the Mac side, no
>>problem, where you can apply the Big5/GB5 (character set) fonts (which
>>have both Big5 and Unicode cmaps).
>>
>>I mean, there is a reason why Big5/GB5 matching font sets are still
>>sold. It's the same reason matching GB/GB-T fonts exist -- because it
>>makes it trivial to produce both TC and SC versions of the same text,
>>just by changing the font.
>
> This is exactly why we purchased them: So we could use Quark Xpress TC

> versions to do both SC and TC character sets.

Sounds like they are quite old, but still they must have Unicode cmaps
or you could not use them with Chinese text in OS X. During the
transition to Unicode, there was a tool [TrueKeys] available that,
among other things, added Unicode cmaps to older Chinese fonts without
them.

>>> ... Once the text has been placed in AI8/9, using B5 or GB5 fonts,


>>> we can then move that file to OSX and continue work normally using the
>>> EXACT SAME FONTs.
>>

>>See above. I think the only reason you are having this issue is that


>>NJStar is changing the encoding as well as the character set. This is
>>not necessary. The fonts don't care which encoding is used -- all they
>>care about is that the text be TC (i.e., the Big5 character set, not
>>the encoding).
>
> This is something I have to look into. The reason I asked this question
> here in the first place is that you guys know a lot more about the inner
> workings of Chinese fonts than I do. I was under the impression that
> B5/GB5 fonts all use B5 encoding, just with different glyphs.
> I thought there were only two factors: The encoding (which basically
> defines the address where any specific character can be located) and the
> glyph (the image displayed for that address). If I understand you
> correctly, there is some third factor involved which I confess I do not
> understand.

TrueType/OpenType fonts use "cmap" tables to map glyphs to multiple
encodings. CID-keyed PostScript fonts use something similar. So a
Chinese font is likely to have a Unicode cmap and at least one other
cmap, like one for Big5 and/or the Hong Kong SCS, or GB 2312 or GB
18030.

> Wenlin and Ideographer's ChineseRewriter have been suggested; I'll have a
> look and see what they come up with.

Good.

> I did try Cyclone and TextEdit, and output files could not be imported
> into AI under OSX in the correct font (total gibberish in a Japanese font;
> font could be changed after placing into AI, but the gibberish remained,
> indicating the text data has been corrupted at import).

If you like, send me a couple of the DOC files that you are starting
with. I'm happy to take a look (and keep them confidential) and help
you find the smoothest possible workflow.

Eric

Edward Lipsett /t

unread,
Nov 10, 2011, 9:40:26 PM11/10/11
to Chinese Mac
Thank you, Eric.

Your explanations have helped quite a bit.


>>Wenlin and Ideographer's ChineseRewriter have been suggested; I'll have a
>> look and see what they come up with.

I downloaded ChineseRewriter with generally good results. This is a
workable solution, although there are still some things that could be
simplified:

The following procedure works:

1. Click on Chinese Rewriter, select any file to open window.
Set Rewriter for SC (Unicode) to TC (B5)

2. Cut-and-paste SC Unicode DOC file content into Chinese Rewriter screen.
Save bottom window as B5 txt.

3. Create text box in Illustrator, set to some B5 font.
Place text.

Issues:
1. It would be even better if there were some way to output a Word DOC in
some format that can be input into Chinese Rewriter. Many documents are
extremely long, although formatting is not required.
2. Some symbol characters in the DOC file (in this case, a circle) were
skipped entirely in the Chinese Rewriter output. Presumably there is no
equivalent mapping. However, if the DOC file is converted via NJstar to a
B5 file, and that B5 file placed into AI8/OS9, the circles show. In this
particular instance this is not an issue, but in a long document with lots
of symbols, having symbols vanish would be painful. The best solution
would be if they are converted, of course, but at the least it would be
best if SOME marker character was output to indicate that something didn't
work.
All our documents are proofed by the Chinese translators after layout so
in theory these errors would be caught, but if they can be avoided even
better.

If you have any suggestions on addressing the above by all means please
let me know.

And, FYI, alternate methods that don't work with any settings:
- Open SC Unicode DOC file and save as txt: ChineseRewriter displays
garbage in both windows
- Open SC Unicode DOC file and save as ms-dos txt: ChineseRewriter
displays garbage in both windows


Thank you again -- and thank you to the rest of the people who made
suggestions that helped me get this far!

Edward Lipsett
Fukuoka, Japan


Eric Rasmussen

unread,
Nov 11, 2011, 8:24:48 AM11/11/11
to chine...@googlegroups.com

On getting the TXT documents to work (so you can just drag-and-drop
them into Chinese Rewriter without copying and pasting), I'm going to
guess Word is encoding the text as UTF-16, instead of UTF-8 -- Chinese
Rewriter only supports UTF-8 for Unicode TXT documents. Unfortunately,
I don't see any way to change that setting in Word 2011.

You can use TextEdit to open the DOC files and save them as UTF-8
plain-text (TXT) files. In Preferences, set "Plain Text File Encoding
> Saving files" to "Unicode (UTF-8)". Jedit X can also do this.

On the problems with symbols, it's hard to comment without knowing
precisely which symbols you are talking about, but I think maybe you
can edit the dictionary files in Chinese Rewriter to account for them.
Unfortunately, Chinese Rewriter has been around for a long time and it
does not run in OS X 10.7 Lion, so I can't check that feature for you.
You might want to see if Wenlin does a better job -- I think it will
just leave the symbols in place.

Also, I should have mentioned that OS X has a built-in Chinese-text
conversion capability. I use Wenlin for this (I want full control over
the conversion), so I haven't checked it out in quite while, but it is
there in the Services menu in Lion. As a result, you could conceivably
do the whole process within TextEdit or Jedit X -- open the DOC file,
save it as TXT, select the entire text, then do "Convert Selected
Simplified Chinese Text" in the Services menu (in the application
menu). Depending on which OS X you are running, this may differ
slightly. Honestly, I'm not quite sure when it became part of the
Services menu -- maybe Tiger?

Eric

Reply all
Reply to author
Forward
0 new messages