Output difference between clusters and n-grams

1,019 views
Skip to first unread message

Warren Tang

unread,
May 11, 2014, 10:40:37 PM5/11/14
to ant...@googlegroups.com
Hi Laurence,
Continuing on clusters/n-gram tool post there is an output difference between clusters and n-grams.

In the clusters, for example, the following 2-word sequences

be; that
be: that
be 23 that

(the last example is theoretical but numbers are also read in clusters) are treated as separate clusters even though the default word definition used is "letters".

These however are treated as the same in n-grams (n-grams correctly removes the non-character defined characters). So the output frequencies match those in the collocates tool (for those instances) with the n-gram but not clusters.

Having the output frequencies match would make analysis easier. I am guessing this kind of tweaking shouldn't be too difficult and that clusters and n-grams are using slightly different definitions for the moment.

Thanks as always.


Warren

Laurence Anthony

unread,
May 12, 2014, 3:50:33 AM5/12/14
to ant...@googlegroups.com
Hi Warren,

Just a quick comment on this post. As you say, the cluster and n-gram definition is different. I designed the cluster tool to work consistently with the KWIC tool (hence the appearance of ; : and 23 between the two words in your example. (Note that they are not being counted as 'words' - they are showing chunks where the 23 for example is just another two characters like the spaces before and after). This is also what you would see in the KWIC tool. So, clicking on a cluster would give you all the correct hits in the KWIC tool. But, people have a narrower and more generally accepted definition of n-grams. They don't think of "be 23 that" as an n-gram. So, the output of the n-gram is just 'words' separated by a space character (which has no relation to the possible space character in the original file - hence "be: that" become "be that").

Saying all that, many people complained the the n-gram results did not match the KWIC tool! So, in AntConc 3.3.5 onwards, I changed the way searches work throughout AntConc. Now, the default is for non-tokens to be ignored in searches (a bit like in Google), so "be: that" is equivalent to "be that" under default settings. As a result, I could now change the output of the cluster tool to match the n-gram tool and still produce sensible results in the KWIC tool.

The question is, what do people expect/want to see in the clusters tool? Something that matches the n-gram results? Something that matches the KWIC results?

Any thoughts?

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Warren Tang

unread,
May 16, 2014, 11:05:44 PM5/16/14
to ant...@googlegroups.com
Hi Laurence,
This post is relevant to the bug other question I posted.

I have attached the test.txt which is an artificial corpus with even sentences that revolve around the verb "own".

The settings are:
  1. treat all data as lowercase
  2. word definition = letters

The problem is clearly illustrated if we look at the clusters.jpg output where the min. freq. is set at 2 and sort by prob. If we are using clusters (and if it were working correctly like the other sort functions) then we would have had 4 counts less. This is worrying considering they are valid instances and that only they should be considered as one cluster type.

So even if the bug didn't exist the results would have been far if minimum frequency and sort by probability had been implemented.

The counts in n-gram and collocates when trying to internally cross check the results. But again the showing of meant-to-be-filtered-out results (note clusters/n-grams type/token quantities and the number of listed results) means:

  1. the bug is within the clusters/n-grams tool, and
  2. the cleaning of the non-character spaces are different between clusters and n-grams.

This only makes consistant results hard to present. Clusters should follow the results of n-grams and collocates. So my solutio would be to make the clusters results be dependent on the word definition, which n-grams and collocates are ready are. The confusion only comes from not knowing that word definitions can be tweaked.


Warren

test.txt
ngrams.jpg
collocates.jpg
clusters.jpg
Reply all
Reply to author
Forward
0 new messages