Question about AntConc limitations & performance issues 3.2.4w/3.3.5w

949 views
Skip to first unread message

Diabolicus

unread,
Dec 5, 2012, 11:21:20 AM12/5/12
to ant...@googlegroups.com
Hi Laurence,
thanks for letting me join this group so I can write comments in here ;-)
I have been using your excellent program for quite a while now and it has helped me a lot in my personal language studies, mostly spanish. Thank you very much for that!
I usually tend to update software as soon as a newer version becomes available, hoping for better performance, fixed bugs or extended functionality. A while ago, after using version 3.2.4w for almost a year, I gave 3.3.2w (I think?) a try and noticed it was significantly slower than the older version. However since 3.2.4w worked just fine I didn't investigate the issue any further. Recently, I decided to take a look and see if the problem had been resolved in version 3.3.5w, but it still exists.

18 months ago when I started to learn spanish I wanted to do it the most efficient way, i.e. learn the most common words first. I took a frequency list of a large spanish corpus and I worked my way through the first couple hundred words to see if there were any unknown ones for me to learn. I soon realized that this method of learning was somewhat unsatisfactory, because the frequency list was not lemmatized, and thus the words were not in the real order of importance. So I went on looking, found your very useful little program, built me a lemma-list from various sources as well as an ever growing stoplist, and used all this to generate me a frequency list of "words yet to learn" for any given spanish text.

Right now, I have the following files I am working with:
list_of_known_spanish_words with 8,488 unique words in it, taken from my vocabulary database
spanish_lemma_list with 385,231 entries / 46,994 lemmata
stoplist_file with 31,573 unique entries (mostly names, places, typos etc.)
~200 ebooks / texts / movie subtitle-files etc. I plan to read / watch / listen to

Let me describe my workflow to you so you can better understand what I am trying to do:
I open AntConc, load my ~200 texts as corpus files, load my lemma_list_file, add my list_of_known_spanish_words and my stoplist_file to the word list range, and set it to "use a stoplist listed below".
"treat all data as lowercase", "use lemma list file" and "treat word list range as lemma list range" are all checked as well. Then I hit start on the Word List tab to generate me a list of the words I am still missing.

The 3.3.5w performance issues I noticed seem to be somehow related to the size of the stoplist, so I ran various tests for comparison:

Results for a small single file with 827 Word Types, 3738 Tokens:
Processing time, no stoplist:
3.2.4w: ~2 seconds
3.3.5w: 5 seconds
Processing time, 5,000 words stoplist:
3.2.4w: ~2 seconds
3.3.5w: 26 seconds

Results for 215 files with a total of 86,365 Word Types and 13,083,833 Word Tokens:
Processing time, no stoplist:
3.2.4w: 72 seconds
3.3.5w: 150 seconds
Processing time, 100 words stoplist:
3.2.4w: 72 seconds
3.3.5w: aborted after 5 minutes, progressbar stopped at ~10%
Processing time, 1,000 words stoplist:
3.2.4w: 72 seconds
3.3.5w: aborted after 5 minutes, progressbar at 0%
Processing time, 5,000 words stoplist:
3.2.4w: 72 seconds
3.3.5w: -

So what exactly is the problem here?
What are the AntConc limitations for
- number of corpus files used
- number of words processed
- number of word types
- number of word tokens
- size of lemma list
- size of stop list

Thanks,

Olaf

Laurence Anthony

unread,
Dec 5, 2012, 1:06:30 PM12/5/12
to ant...@googlegroups.com
Dear Olaf,

Thank you for taking the time and care to look into the problem and
report such useful numbers for comparison.

AntConc 3.2.4 and 3.3.5 are actually very, very different under the
hood. 3.2.4 uses something called PerlTk to create the graphical
interface, whereas 3.3.5 uses something called Tcl, which is a
completely different programming language. Although Tcl produces a
prettier interface (and works much better on OS X), it does seem to be
more sluggish in performance. Since releasing 3.3.x, I've made some
changes to overcome the problem (in some places) and improve the
performance. But, it seems you have found another area that needs
addressing.

Let me look at the code and see if there is a bottleneck in the stop
list processing.

An important point to note is that AntConc (3.2.x and 3.3.x) does most
of its processing in RAM memory. So, the only limit is really the
amount of memory that you have. However, AntConc is also built as a 32
bit application, which means that it cannot go beyond 4GB of memory.
In practice, it means that AntConc generally works well with small
corpora of 1-5 million words, but struggles with bigger corpora.

Another important point to note is that I now have a grant to redesign
AntConc to work much faster and with much bigger corpora. I'm using a
new computer language to code the program to avoid the sluggish
performance of 3.3.x and I'm using the BNC (100 million words) as my
basic test corpus. This is being developed each day and will be
released in the new year.

If you find that 3.2.4 works well, I would continue to use that until
the new version is released. However, keep checking on the progress of
3.3.x, because it does have some very nice features that 3.2.x does
not have. I will keep updating 3.3.x to address performance problems
until the new version is released. If you monitor this discussion
group, you will get notices as soon as a new update is released.

Thank you again for posting such useful information.

Laurence.
> --
> You received this message because you are subscribed to the Google Groups
> "AntConc-discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/antconc/-/XBMfmYcoRAMJ.
> To post to this group, send email to ant...@googlegroups.com.
> To unsubscribe from this group, send email to
> antconc+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/antconc?hl=en.

Gregory

unread,
Dec 6, 2012, 5:40:42 AM12/6/12
to ant...@googlegroups.com
Hello

Along these same lines, I've been building a corpus of art writing from texts I've been gathering and it seems to me, (I
haven't got any hard data) that AntConc works faster when dealing with the corpus as a set of smaller files than it did
with the same corpus data concatenated into one rather large file, all other factors being equal.

I'm using AntConc 3.3.5u in Linux Mint LMDE XFCE4 64 bit.

Gregory

Laurence Anthony

unread,
Dec 6, 2012, 6:56:07 AM12/6/12
to ant...@googlegroups.com
Gregory,

Your observation about the different in performance when using smaller
files over large files is very true. One sure way to crash AntConc is
to try to load a *single* file holding a corpus of 10-20 million
words. Because the program has to read in the entire file into memory
and process this, it can sometimes use up all your memory and freeze
the system. It is much safer to have a corpus divided up into separate
files (as the BNC is designed).

Saying that, I am now addressing the memory problem, too. In the new
version of AntConc, the system will process even huge single file
corpora incrementally and never use up all the memory.

Laurence.

Mustafa Özer

unread,
Feb 4, 2024, 9:08:50 PM2/4/24
to AntConc-Discussion
Hi Laurence,

I hope all is well.
Using v4.2.4 I had the same issue with two different systems, Mac and Windows. Both were powerful setups. Even though I tried several versions of the same corpus (an 8-million-word corpus divided into 4, 8, and 17 smaller files) the issue persisted. I think it is the size rather than the number of files. When attempting to jump into the file view in the KWIC module, the software crashed every time on Mac silicon. Not sure if AntConc uses up system resources, though.
I wonder if there is a limit to the size of a corpus that AntConc can work well with, or what is the size range? This really curbs the usability of the software in class. 

I started a new discussion, but you seem to have overlooked it. Below is the link to the discussion I started on 13 October 2023:

Kind regards.
Mustafa.

Laurence Anthony

unread,
Feb 5, 2024, 9:21:59 PM2/5/24
to ant...@googlegroups.com
Hi,

>an 8-million-word corpus divided into 4, 8, and 17 smaller files

My guess is that the problem is that you are trying to view very large files in a single view. For example with four divisions, you are trying to view 2 million words in a single view, which is equivalent to trying to view the entire contents of several books in the same view window. This should work, but you would need to wait, and you may exhaust your RAM memory.

The number of files should not be a problem at all. 

Ideally, you should compose your corpus from individual, meaningful files. You might even consider dividing your corpus into individual sentences or paragraphs. You will then find that the performance in AntConc is much better.

I hope that helps.

Laurence.


You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/d012cfc0-7f85-4fda-bd46-fa32ac6aca9cn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages