Encoding for AntConc and WordSmith

521 views
Skip to first unread message

Lan-fen

unread,
Aug 2, 2012, 9:24:13 AM8/2/12
to ant...@googlegroups.com
Hi Laurence

Thanks again for your help! You might be interested in knowing this.

I resaved all my files using the UTF-8 encoding and successfully processed them in AntConc 3.3.5w. But when I go back to use WordSmith 6, it asked me to change the encoding to Unicode. Otherwise, the files couldn't be processed. I got the following message from HELP in WS6, for your reference.

"UTF8 is a name for a multi-byte character encoding method, widely used for Chinese, Japanese etc. WordSmith cannot handle UTF8, but you can convert UTF8 to Unicode first using Text Converter. This is a format which was devised for many languages some years ago when disk space was limited and character encoding was problematic. That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character. Quite a messy kludge."

I think I'll keep my files in Unicode, so I can use WS6 in the future and when I use AntConc, I'll have to change the Global Preferences to UTF-16LE, as you suggested.

Cheers

Lanfen

Lan-fen Huang
PhD in English Language, University of Birmingham
Assistant Professor, Language Centre
Shih Chien University, Kaohsiung, Taiwan


2012/8/1 Laurence Anthony <ant...@waseda.jp>
Great!

The UCS-2 LE encoding was probably a result of the PDF files
containing non-English characters (e.g. greek characters). Actually,
the UCS-2 LE encoding might actually be UTF-16LE, which is the default
on Windows systems. All other systems (Mac, Linux) use UTF-8 as
default and also the world-wide-web uses UTF-8 as an international
standard. Microsoft Windows is just strange!

Regards,
Laurence.

###############################################################
Laurence Anthony, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: ant...@waseda.jp
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


On Wed, Aug 1, 2012 at 5:14 PM, Lanfen Huang <lanfen...@gmail.com> wrote:
> Dear Laurence
>
> Thank you very much for your reply! I've resaved my files and successfully
> processed them in AntConc.
> I simply copied abstracts from PDF files and pasted in the notepad. I guess
> the default is UCS-2 LE.
> I didn't know I should have used UTF-8 encoding. Anyway, I learned a new
> thing this time.
>
> I've joined the AntConc discussion group. I wasn't sure if I could send you
> files via it.
> I'll raise questions there if there is any in the future.
>
> Many thanks!
>
> Lanfen
> 2012/8/1 Laurence Anthony <ant...@waseda.jp>
>>
>> Dear Lan-fen Huang,
>>
>> Thank you for coming to my workshop last week! It was nice to see you
>> there.
>>
>> I checked your files and it seems that you have saved these three
>> files in an unusual character encoding (UCS-2 LE). If you want to
>> process these files, you can go to Global Preferences->Character
>> Encodings in AntConc and change the encoding from the default (UTF-8)
>> to UCS-2LE. However, I would recommend that you resave all your files
>> using the UTF-8 encoding. This is a universal standard that most
>> people in corpus linguistics use.
>>
>> You can resave files by opening them and resaving them in Notepad or
>> another text editor.
>>
>> I hope that helps.
>>
>> Regards,
>> Laurence.
>>
>> p.s. Why don't you join the AntConc discussion group?
>> http://groups.google.com/group/antconc
>>
>>

Laurence Anthony

unread,
Aug 2, 2012, 10:09:28 AM8/2/12
to ant...@googlegroups.com
Dear Lanfen,

It seems that you've switched from a personal email discussion to a
discussion within the AntConc discussion group. Please note that
everybody in the group can see your comments!

For others, who have not seen the earlier emails, you will have to
scroll down to the bottom of her posting to follow the thread.

Here's a reply to your latest comments:

I didn't realize that you used WordSmith. Yes, WordSmith basically only works
with UTF-16LE, which is the default internal encoding on Windows
systems. It's a shame because almost all other systems use UTF-8, and most
corpora in the world are now published in the UTF-8 encoding.

Saving your files in UTF-16LE is fine, but be careful when you send
them to anyone. Many people may not be able to read the files, even in
a standard program like Notepad.

Although the WordSmith help page
doesn't explain this, the *huge* advantage of UTF-8 over UTF-16LE is
that legacy English (e.g., ASCII) text is identical in UTF-8. So, the
English in your UTF-8 files will be perfectly rendered in any software
in the world, including Notepad, WordPad, browsers, etc.. (That's why
it is now so popular). Calling UTF-8 "Quite a messy
kludge" is quite misleading. I'm surprised Mike Scott writes
this. Below is a detailed explanation of UTF-8:
http://en.wikipedia.org/wiki/UTF-8

You will see that it is the most commonly used standard.

> I think I'll keep my files in Unicode

Both UTF-8 and UTF-16LE are *implementations* of the 'Unicode'
standard. UTF-8 is the most popular 'Unicode' implementation at present.
However, if you do want to work with WordSmith, you will have to use the
UTF-16LE implementation. Just be careful when you send corpus files to
other people. Make sure they know what encoding you used. They are
also likely to recommend that you use UTF-8, so be ready to explain why you
have chosen UTF-16LE.

Best regards,
Laurence.
Reply all
Reply to author
Forward
0 new messages