Hi Laurence
Thanks again for your help! You might be interested in knowing this.
I resaved all my files using the UTF-8 encoding and successfully processed them in AntConc 3.3.5w. But when I go back to use WordSmith 6, it asked me to change the encoding to Unicode. Otherwise, the files couldn't be processed. I got the following message from HELP in WS6, for your reference.
"UTF8 is a name for a multi-byte character encoding method, widely used for
Chinese, Japanese etc. WordSmith cannot handle UTF8, but you can convert UTF8 to
Unicode first using Text Converter. This is a format which was devised for many
languages some years ago when disk space was limited and character encoding was
problematic. That's because it uses a variable number of bytes to represent the
different characters. A to Z will be only 1 byte but for example Japanese
characters may well need 2, 3 or even more bytes to represent one character.
Quite a messy kludge."
I think I'll keep my files in Unicode, so I can use WS6 in the future and when I use AntConc, I'll have to change the Global Preferences to UTF-16LE, as you suggested.
Cheers
Lanfen
Lan-fen Huang
PhD in English Language, University of Birmingham
Assistant Professor, Language Centre
Shih Chien University, Kaohsiung, Taiwan
2012/8/1 Laurence Anthony
<ant...@waseda.jp>
Great!
The UCS-2 LE encoding was probably a result of the PDF files
containing non-English characters (e.g. greek characters). Actually,
the UCS-2 LE encoding might actually be UTF-16LE, which is the default
on Windows systems. All other systems (Mac, Linux) use UTF-8 as
default and also the world-wide-web uses UTF-8 as an international
standard. Microsoft Windows is just strange!
Regards,
Laurence.
###############################################################
Laurence Anthony, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail:
ant...@waseda.jp
WWW:
http://www.antlab.sci.waseda.ac.jp/
###############################################################
On Wed, Aug 1, 2012 at 5:14 PM, Lanfen Huang <
lanfen...@gmail.com> wrote:
> Dear Laurence
>
> Thank you very much for your reply! I've resaved my files and successfully
> processed them in AntConc.
> I simply copied abstracts from PDF files and pasted in the notepad. I guess
> the default is UCS-2 LE.
> I didn't know I should have used UTF-8 encoding. Anyway, I learned a new
> thing this time.
>
> I've joined the AntConc discussion group. I wasn't sure if I could send you
> files via it.
> I'll raise questions there if there is any in the future.
>
> Many thanks!
>
> Lanfen
> 2012/8/1 Laurence Anthony <
ant...@waseda.jp>
>>
>> Dear Lan-fen Huang,
>>
>> Thank you for coming to my workshop last week! It was nice to see you
>> there.
>>
>> I checked your files and it seems that you have saved these three
>> files in an unusual character encoding (UCS-2 LE). If you want to
>> process these files, you can go to Global Preferences->Character
>> Encodings in AntConc and change the encoding from the default (UTF-8)
>> to UCS-2LE. However, I would recommend that you resave all your files
>> using the UTF-8 encoding. This is a universal standard that most
>> people in corpus linguistics use.
>>
>> You can resave files by opening them and resaving them in Notepad or
>> another text editor.
>>
>> I hope that helps.
>>
>> Regards,
>> Laurence.
>>
>> p.s. Why don't you join the AntConc discussion group?
>>
http://groups.google.com/group/antconc
>>
>>