'utf-8' codec can't decode byte 0x96 in position 12: invalid start

2,754 views
Skip to first unread message

mek

unread,
Apr 27, 2022, 5:16:14 AM4/27/22
to AntConc-Discussion
I have a corpus of 79 cleaned .txt files from the early months of the COVID-19 pandemic. AntConc 3.5.7 has no trouble loading and processing all of them. However, AntConc 4.0.10 declines to load 29 of them. Here is one of the error reports:

C:/Users/MEK/Desktop/CORPORA/2020 COVID-19 corpus/COVID-19 cleaned txt/2020_Article_CancerSurgeryAndCOVID19.txt

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 29: invalid start byte


How can I fix whatever's wrong in the file? Isn't it odd that 3.5.7 will process it but 4.0.10 won't?

Laurence Anthony

unread,
Apr 27, 2022, 7:32:15 AM4/27/22
to ant...@googlegroups.com
Hi,

As the error reports, your files are not properly encoded in UTF-8 (the default setting of AntConc). As some of your files did load, it means your corpus has mixed encodings, which is not so good.

In AntConc 3x, the program did no checks of the file encoding and would just load what it could (possibly introducing errors if the user tried to load files not matching the encoding in the settings). AntConc 4 now checks this and warns the user. So, if you look at the warning list and open those files, you'll see that they are not UTF-8 encoded. If you resave the files in the UTF-8 encoding, they will work perfectly. I don't know how you created those files, but if you used AntFileConverter to convert Word/PDF files to text, the output will always be in the correct UTF-8 encoding.

I hope that helps!

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/cd2dae2d-74e5-4d77-b68a-8010f2d65204n%40googlegroups.com.

mek

unread,
Apr 28, 2022, 4:40:59 AM4/28/22
to AntConc-Discussion
Thanks so much for the fast reply, Laurence. I'll follow your instructions and also find out how the invalid files were created. This was a team-created corpus a small group of Mediterranean Editors and Translators compiled early in the pandemic. Some files were converted with AntFileConverter, some with SketchEngline, and some with Adobe. (After that, we all followed the same instructions to clean up the files by hand, removing all references, for example.) Once I know if there was a pattern to which files went rogue, I'll get back to the group. Thanks again for everything!

Laurence Anthony

unread,
Apr 28, 2022, 5:47:51 AM4/28/22
to ant...@googlegroups.com
Sounds good!

I'll certainly be adding the footer ignore function to AntConc in the next release that you mentioned at our recent seminar. It's a great suggestion.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Stephen Gourlay

unread,
Apr 29, 2022, 7:34:30 AM4/29/22
to AntConc-Discussion
If you can add a footer-ignore option that would be great. How about an 'extreme right margin ignore' function? Some publishers put a line of text in the extreme right margin that appears as a long vertical string of text - it's a pain to edit out.
Best wishes
Stephen

Laurence Anthony

unread,
Apr 29, 2022, 8:37:42 AM4/29/22
to ant...@googlegroups.com
Hi Stephen,

How is the "extreme right margin" formatted? My guess is that it might be hard-coded as some kind of table element that is very difficult to identify (unless it was HTML/XML). In plain text, I've seen left margin formatting (with the Brown corpus being a classic example), but I haven't seen plain text with a right-side margin.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Stephen Gourlay

unread,
Apr 30, 2022, 8:31:04 AM4/30/22
to AntConc-Discussion
Hi Lawence
Apologies - my habitual confusion of left and right - of course it's the left margin! Readable text is e.g. "Downloaded by  ... ' which was transformed into 29 rows of text like: f=f+l

\o
t-rO
c\
Best wishes
Stephen

Laurence Anthony

unread,
Apr 30, 2022, 8:35:50 AM4/30/22
to ant...@googlegroups.com
Hi Stephen,

Removing the left margin is certainly something I've considered adding (especially after using the Brown/Lob raw corpus files). Let me look into this. It should be relatively easy to implement (especially if we got with a fixed width).

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Reply all
Reply to author
Forward
0 new messages