conversion to text files seem to create word storms!

24 views
Skip to first unread message

Deborah Davidson

unread,
Jan 13, 2025, 8:18:56 PMJan 13
to AntConc-Discussion
Dear Professor Anthony,
I am working on a Windows 11 PC. 

I have created a database from 73 healthcare documents relating to the English NHS and used AntfileConverter to convert these documents from pdfs to text files.  My analysis at this early stage is to look at the frequencies and collocates/accumulated collocates of three terms:
admin*
manag*
lead* 

I've generated frequencies spreadsheets from the Antconc tool and started to develop a series of graphics. However, when examining the wordform outputs from these searches (using word tool in Antconc, I have noticed  for admin* and manag* that the wordforms include a large number of misspelt words (see below), which when explored more closely in the documents seems to have been created as a result of converting the files into text. Some misspelt words are just misspelt, but the majority appear to have resulted from conversion to text. Here is an example from one report, where not only the key term has been changed but many of the other words as well: 



Misspelt words.jpg
This has left me at a loss as to what do as going through each of the 73 text files and correcting them is a huge task, and one I don't really have the time to do. What can I do to correct this? 

Many thanks 
Deborah

Laurence Anthony

unread,
Jan 14, 2025, 1:02:20 AMJan 14
to ant...@googlegroups.com
Hi Deborah,

Conversion from PDF to text is never perfect, but my guess is that some of your PDF files are particularly problematic and might even be designed to not convert to text properly. Authors sometimes do this to prevent copying.

If it is essential to use these files in your corpus, you might want to try uploading the PDF files to ChatGPT and asking that to convert the PDF to text. The future of PDF to text conversion is almost certainly going to involve the help of AI.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/antconc/2ca07279-5279-4d79-a36d-f7598a4bde80n%40googlegroups.com.

Deborah Davidson

unread,
Jan 14, 2025, 7:56:05 AMJan 14
to ant...@googlegroups.com
Thank you Professor Anthony,
I did wonder about this. 
Many thanks.

Deborah Davidson


Deborah Davidson

unread,
Jan 14, 2025, 8:14:48 PMJan 14
to ant...@googlegroups.com
For information, CHATGPT is ok but equally poor or worse for cleaning and converting some documents  

Deborah Davidson


On Tue, 14 Jan 2025 at 01:02, Laurence Anthony <antho...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages