Diacritics and punctuation show as \x plus number

113 views
Skip to first unread message

Maria Alba

unread,
Nov 11, 2018, 1:42:25 PM11/11/18
to AntConc-Discussion
Hello everyone,

first of all many thanks to Mr. Anthony for offering us his fantastic tools for free. 

I am using AntConc 3.5.7 for Mac OS X 10.11.6. My file is a .txt encoded with Unicode UTF-8.

My little problem is that some special characters for Spanish and Asturian (another Romanic language) don't show as such but as a code. They are the following:

\x97 for ó
\x87 for á
\x92 for í
\x96 for ñ
\x9C for ú
\x8E for é
\x9F for ü

It´s not only a problem with diacritics, as I found following punctuation signs are also not decoded as such:
\C0 for ¿
\C1 for !

I read a thread with a similar problem with German characters and tried to change the options in "global settings", first marking both "punctuation" and "mark" and then typing my characters with the "user-defined token class" options, but nothing seems to work.

I can't figure out how to solve the issue and would appreciate any help very much.

Thanks in advance for your time!

Best,
María
Screen Shot 2018-11-11 at 13.00.01.png

Laurence Anthony

unread,
Nov 15, 2018, 6:45:23 AM11/15/18
to ant...@googlegroups.com
Hi María,

If the files are correctly encoded in UTF-8, you shouldn't need to do anything to get them working properly in AntConc. Can you go back to your files and make sure again that they are really encoded correctly as this error looks almost certainly an encoding problem.

If the files are encoded correctly, you may find that the encoding problem happened at an earlier stage (e.g. Microsoft Word to Text) and the data is corrupted before the UTF-8 encoding stage.

I hope that helps.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at https://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Maria Alba

unread,
Nov 16, 2018, 11:57:26 AM11/16/18
to AntConc-Discussion

Dear Mr. Anthony,

that was the problem indeed. The file details showed "UTF-8" as the encoding system, so I thought the file was ok but it happened to be corrupted. I simply saved a new document as .txt and selected UTF-8 manually. Now all the characters appear in AntConc as they are meant to do. 

Thank you so much for taking time to help us. I will recommend AntConc to everyone working in corpus linguistics.

Best regards.
Maria

Laurence Anthony

unread,
Nov 18, 2018, 4:21:47 AM11/18/18
to ant...@googlegroups.com
Great that you got the problem sorted out!

Laurence

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Reply all
Reply to author
Forward
0 new messages