Inquiry on Analyzing Large Text Files

45 views
Skip to first unread message

Amirmasoud Iravani

unread,
Oct 3, 2024, 9:34:23 PM10/3/24
to AntConc-Discussion

Dear Prof. Anthony,
I hope this message finds you well. I am a corpus linguistics student seeking your guidance on analyzing large text files (over 2 GB). Some friends suggested splitting these files into smaller segments, but I'm unsure which tools or methods to use for this. I've also heard that the Linux `grep` tool is useful for searching large text files. I am particularly interested in extracting frequency word lists and conducting cluster (n-gram) or collocation analysis. Any advice on handling large datasets, proper toolkits for doing so or suggestions for text splitting would be greatly appreciated. Thank you for your time.

Best regards,
Amirmasoud Iravani,
PhD. C. in Linguistics

Laurence Anthony

unread,
Oct 3, 2024, 9:45:13 PM10/3/24
to ant...@googlegroups.com
Hi,

You should be able to just load the files directly into AntConc. They may take a while to be initially processed and converted into a database, but once that's finished, accessing the files should be very, very fast. The only problem will be attempting to view the files using the File View tool. My guess is that you'll run out of memory and the interface will become very slow. All other tools should work fine.

Saying that, I'm not sure why your files are so big. If they are actually multiple texts combined together into a single large blob, I'd recommend actually recovering the original true texts and loading them properly. If you do that, you'll then be able to get proper range statistics and dispersion values. 

I hope that helps.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/b18eebcd-1684-4fa2-bcff-60b751e759ebn%40googlegroups.com.

Amirmasoud Iravani

unread,
Oct 5, 2024, 1:16:32 AM10/5/24
to AntConc-Discussion
Hi prof,
Thanks for your answer. The files are loaded into AntConc with ease but the problem is my machine cannot handle analyzing (cluster analysis + frequency wordlist) such large files even with 32 Gb of RAM. I have 70 files each 2 Gb together making up a Persian large corpus for large language modelling. 

Thank You,
Amir.

Laurence Anthony

unread,
Oct 5, 2024, 1:21:31 AM10/5/24
to ant...@googlegroups.com
HI Amir,

AntConc should be able to handle those files without problem. The RAM is not an issue either, because everything is stored in a database. The only problem you might have is if you try to *view* the entire list of clusters or the entire list of words on a single page. Start with the word list tool. If you build your corpus and generate a word list showing only the first 100 words. Do you still have a problem?

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Amirmasoud Iravani

unread,
Oct 7, 2024, 3:21:40 AM10/7/24
to ant...@googlegroups.com
Hi Prof,Thanks for the time you dedicated to guiding us. I will check again, and if anything unexpected happens, I will let you know.Best regards,
Amir

Reply all
Reply to author
Forward
0 new messages