Antconc RC - Character encoding of the text files

577 views
Skip to first unread message

Umut Demirhan

unread,
Dec 3, 2013, 12:43:03 AM12/3/13
to ant...@googlegroups.com

Dear Laurence,

 

As far as I know, the latest version  of the application does not allow us to load text files which has Unicode characters, and the previous versions of the application generates weird characters in “corpus files” pane when the filenames contain characters like İ,Ş,ı etc. If it is possible and compatible with the backend programming language, could you please change the behaviour of that section?

Since we have to track some concordances and the related text file, it will be very useful for us. Otherwise, we will need to use the 3.2.4 version for the texts which has Unicode characters in their names.

 

 

Best regards,

Umut

 

Laurence Anthony

unread,
Dec 3, 2013, 12:51:26 AM12/3/13
to ant...@googlegroups.com
Dear Umut,

AntConc 3.2.4 and 3.3.5 should both work exactly the same in terms of the way they display file names. The only change is that the default encoding is now set to UTF-8 in AntConc 3.3.5. If you set it to "Latin 1", which is the default in 3.2.4, they should work identically. Can you confirm that this is a problem?

Regards,
Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/groups/opt_out.

Umut Demirhan

unread,
Dec 3, 2013, 12:57:45 AM12/3/13
to ant...@googlegroups.com

Dear Laurence,

Let me show the problem via screenshots.

When I try to load all the text files in a directory, AntConc 3.2.4 generates such kind of invalid characters in filenames (see ss1.png).

In addition to that, if I try to load the text files in 3.2.5, it never lists the text files in “Corpus Files” pane, and generates an error message (see ss2.png).

 

If the backend programming language allows you to change the behaviour of that section, could you please change it also to UTF-8 encoding? By the way, all of my text files are processed in the tools supplied by AntConc such as Wordlist, Concordance, Collocation etc.

 

Best regards,

Umut

ss1.png
ss2.png

Laurence Anthony

unread,
Dec 3, 2013, 1:18:00 AM12/3/13
to ant...@googlegroups.com
Dear Umut,

I never realized that this was a problem. As I said, the two versions should be effectively identical on this aspect. (But obviously not!)

At the moment, the default encoding is used to render both the file names and the file contents. So, if you set the encoding to UTF-8, UTF-8 file names should appear correctly. But, if you are on a Windows system, I suspect that your filenames are not encoded in UTF-8 at all. They are probably being rendered in your locale. (What country are you based in?). So, the problem is that the file names are in one encoding and the file contents are in another. File encodings are always a huge problem (especially on Windows systems). I am just about to release a minor update to AntConc (ver. 3.4) in the next day or so. I will try to fix this problem before it goes live. Note that AntConc 4.0 is the big release that I also hoped to get finished this week but haven't managed to.

Laurence.

 

###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


Umut Demirhan

unread,
Dec 3, 2013, 1:25:24 AM12/3/13
to ant...@googlegroups.com

Dear Laurence,

 

We are still working on AntConc 3.2.4. My operating system is Windows 7 *64. The System Locale is Turkish (I’m not sure whether it uses ANSI or UTF-8, but probably ANSI).

However, I also tried to use 3.2.4 and 3.3.5 with Windows Server 2008 *64. It’s default language is English, and the problem still exists.

 

All my texts are encoded in UTF-8 since it supplies a wide range of character encoding.

 

Hope to see AntConc 4.0 soon.

Laurence Anthony

unread,
Dec 3, 2013, 1:48:44 AM12/3/13
to ant...@googlegroups.com
Hi,

Even though your files are encoded in UTF-8, I suspect that the file names are not. Can you try going to the AntConc character encoding global settings and choosing the ISO-8859-9 (Turkish) option. I suspect that your file names will suddenly appear correctly. The problem will be that the file content will then *not* appear correctly. But, at least we will know where the problem lies. I can add a new option to set the file name encoding to be independent of the file content encoding.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


Umut Demirhan

unread,
Dec 3, 2013, 2:14:00 AM12/3/13
to ant...@googlegroups.com

Dear Laurence,

 

If I change the character encoding to ISO-8859-9 (Turkish), all the file names seem to be listed regularly in both versions. Moreover, I tried to generate concordance lines with ISO-8859 -9 (Turkish) encoding, and there is no problem at all.

It is also good to mention that all the texts are encoded in UTF-8. The problem is resolved.

 

Thank you for all.

Laurence Anthony

unread,
Dec 3, 2013, 2:22:35 AM12/3/13
to ant...@googlegroups.com
Hi,

I'm not clear about 1 point. If your files are encoded in UTF-8, it would suggest that using ISO-8859-9 will cause some characters to not render correctly. Many character in ISO encodings for European languages are just a subset of UTF-8 so they will be fine. But, not all of them.

If you have no problem characters at all, it would seem that your files are actually encoded in ISO-8859-9 (not UTF-8)!

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


Laurence Anthony

unread,
Dec 3, 2013, 2:27:04 AM12/3/13
to ant...@googlegroups.com
By the way, as you are working on a Windows system, you should probably use cp 1254 (WinTurkish), which is the Windows variant of the ISO-8859-9 encoding.

They are almost identical, but you may find a couple of problem characters rendering correctly.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


Umut Demirhan

unread,
Dec 3, 2013, 2:32:07 AM12/3/13
to ant...@googlegroups.com

Dear Laurence,

I closed and opened the application again. This time, both 3.2.4 and 3.3.5 generated the concordances with errors, but the filenames appear correctly. You are right, there is a problem with the tools as you expected. Probably, in the previous process, I just closed all the tools and files and reopened them after generating concordances.

ss3.png

Laurence Anthony

unread,
Dec 3, 2013, 2:34:24 AM12/3/13
to ant...@googlegroups.com
Dear Umut,

Great. That's what I would expect. So, we now know exactly what the problem is. I also know how to fix it.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


Umut Demirhan

unread,
Dec 3, 2013, 4:33:14 AM12/3/13
to ant...@googlegroups.com

Dear Laurence,

It’s great to hear that fixing is possible.

If you need to test the recent version before the release, I can test the application, and send feedback.

Laurence Anthony

unread,
Dec 3, 2013, 4:36:34 AM12/3/13
to ant...@googlegroups.com
Thank you!

Laurence.
Reply all
Reply to author
Forward
0 new messages