Importing .txt files

229 views
Skip to first unread message

Adrian Stenton

unread,
Oct 2, 2024, 9:27:45 PM10/2/24
to AntConc-Discussion
Hello, I’ve just started using AntConc 4.3.1 for MacOS. When I import .txt files from a folder containing 556 files, only 424 were imported. Does this mean that the rest were corrupt?

Laurence Anthony

unread,
Oct 2, 2024, 9:30:41 PM10/2/24
to ant...@googlegroups.com
Hi Adrian,

My guess is that the rest were not properly encoded in UTF-8. If you go to the Global file settings, you can have AntConc ignore errors in your files. I wouldn't recommend this, but you might try it to establish that it is an encoding problem.

Regards,

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


On Thu, 3 Oct 2024 at 11:27, Adrian Stenton <adrian...@gmail.com> wrote:
Hello, I’ve just started using AntConc 4.3.1 for MacOS. When I import .txt files from a folder containing 556 files, only 424 were imported. Does this mean that the rest were corrupt?

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/5edd4303-96fd-4e04-81a0-2328c7a2d1b8n%40googlegroups.com.

Adrian Stenton

unread,
Oct 3, 2024, 9:34:26 PM10/3/24
to AntConc-Discussion
Thanks, Laurence. That seems to be exactly right. I’ll try reformatting the files, which were originally converted from .docx to .txt about seven years ago.

With best wishes, Adrian

Laurence Anthony

unread,
Oct 3, 2024, 9:40:51 PM10/3/24
to ant...@googlegroups.com
Hi Adrian,

If you have the original doc files, I'd recommend you just load those directly into AntConc. Converting docx files to txt can lead to all sorts of trouble, so it's better to just let AntConc process the files as necessary.

I hope that helps!

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

Adrian Stenton

unread,
Oct 5, 2024, 1:16:37 AM10/5/24
to ant...@googlegroups.com
Thanks, Laurence. That’s interesting. I’d always assumed that .txt files were better (because they’re simpler?), but I’ll use the .docx files instead. I finally worked out that my problem was down to about half of my Word files being .doc rather than .docx. Both AntConc and AntFileConverter tend to spit out the .doc files!

All the best, Adrian

Laurence Anthony

unread,
Oct 5, 2024, 1:23:58 AM10/5/24
to ant...@googlegroups.com
Have you tried the latest versions of AntConc and AntWordProfiler? They should support doc files (as well as docx files).

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Adrian Stenton

unread,
Oct 5, 2024, 7:53:52 PM10/5/24
to ant...@googlegroups.com
Laurence, To clarify: If I try to load my 555 files through “Quick Corpus” I lose about 200 files with that error message. It I try to load those same files through “Corpus Manager” it rejects just the one file. I’ll continue with “Corpus Manager”!

Adrian

On Sat, Oct 5, 2024 at 3:49 PM Adrian Stenton <adrian...@gmail.com> wrote:
Hi Laurence,

I’m using AntConc 4.3.1 and AntFileConverter 2.1.0 and I get the same message in both cases:

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Adrian

Adrian Stenton

unread,
Oct 5, 2024, 7:53:52 PM10/5/24
to ant...@googlegroups.com
Hi Laurence,

I’m using AntConc 4.3.1 and AntFileConverter 2.1.0 and I get the same message in both cases:

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Adrian

Laurence Anthony

unread,
Oct 5, 2024, 7:58:31 PM10/5/24
to ant...@googlegroups.com
That's interesting. Both the Quick Corpus and Corpus Manager are using the same back engine. Have you changed the settings to ignore Unicode errors? I wonder if that might be affecting things. If you are allowed to send me your corpus, I can have a look at it here.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Adrian Stenton

unread,
Oct 6, 2024, 6:52:11 PM10/6/24
to ant...@googlegroups.com
That’s very kind of you. The copyright owner doesn’t permit the sharing of the data for analysis, but I assume they would be OK with sharing for technical issues. The .zip file is 232.9 MB. Are you OK with that?

I have tried switching to “Ignore errors”, and that resulted in fewer files being imported (232/555 vs. 305/555).

Adrian

Laurence Anthony

unread,
Oct 6, 2024, 6:54:47 PM10/6/24
to ant...@googlegroups.com
Hi,
I'm mostly intrigued about the difference between the Quick Corpus and Corpus Manager. They should act the same way. If you can email a link to say a Drive folder with the corpus in it, that would be fine.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Adrian Stenton

unread,
Oct 7, 2024, 7:59:22 PM10/7/24
to ant...@googlegroups.com
Hi Laurence, You should have been sent a link by e-mail to my <AntConc files> folder on Google Drive.

Adrian

Laurence Anthony

unread,
Oct 7, 2024, 8:12:41 PM10/7/24
to ant...@googlegroups.com
Hi,

I've got it. Thanks! I'll reply back to you directly and perhaps you can report here was the conclusion was.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Adrian Stenton

unread,
Oct 8, 2024, 7:45:18 AM10/8/24
to ant...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages