How to "familize" offlist words?

84 views
Skip to first unread message

dpmi...@ucsc.edu

unread,
Jun 23, 2017, 1:36:02 PM6/23/17
to AntWordProfiler-Discussion

I'm interested in identifying frequency/range of word families across approx. 1,500 academic articles.  Two questions:

1. I'm hoping to sort the families by GSL 1000, 2000, AWL, and offlist.  The program seems to provide this info through the first 3 lists, but it uses types (rather than families) for offlist words.  Is there a way for the program to "familize" offlist words?  I've considered trying to use the BNC/COCA lists, but there is so much overlap w/ the GSL 1000/2000 + AWL....

2. Approx. how long should it take to run 1,500 files (ranging in size from 100-2000KB) through the profiler?   I've tried now on 3 different computers: 4hrs., 6hrs., and 10hrs., and it doesn't get beyond "creating lexical profile").  (Note: I used antfileconverter to convert pdfs to txt, and then EncodeAnt to endure the files are utf8)

Thanks for any advice!

Laurence Anthony

unread,
Jun 24, 2017, 3:41:37 AM6/24/17
to antword...@googlegroups.com
Hi,

Here are the answers to your questions:

1) I am fairly sure that the words in the texts are processed by the first list in which they appear. So, if there are duplicates, the later lists won't be considered. This means that you can load the BNC/COCA lists after the AWL and only words not in the other lists will be counted.

I suggest you check this with a toy example first.

2) I suspect your files are not UTF-8 encoded. The program should complete in just a few minutes. Try processing batches of 100 and see where the program suddenly slows down and that will help you identify a problem file. My EncodeAnt program might also help you to identify files that are not UTF-8 encoded.

I hope that helps.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntWordProfiler-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antwordprofiler+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages