keyword clusters with based on n-gram lists

112 views
Skip to first unread message

Hannes Widmann

unread,
Jul 27, 2011, 10:39:16 AM7/27/11
to wordsmi...@googlegroups.com
Hi all,

as far as I understand it, keyword clusters are usually two-word clusters with a variable number of intervening words.
Has anyone managed to make WS create keyword clusters based on more than two words?

Here's what I tried:
I created a 1-word wordlist and a 2-4 word wordlist. Then I combined the two files and then I did the same for my reference corpus.
Then I ran keywords on the basis of these two combined files.
However, the problem ist that there are a very high number of meaningless single keyword + keyword ngram artifacts (such as "cultural cultural diversity", "a distinct cultural identity distinct cultural identity") as I have not yet found a way to automatically make WS filter out the repetitive chunks, even if it is true that, on their own, each of them is a keyword/keyword chunk.
Anyone got an idea for me?

PS: sorry Mike for being confusing with my emails. If this post still doesn't show up in the WS group, I'll get my personal Google account, I promise :-)


Mike

unread,
Jul 27, 2011, 2:55:11 PM7/27/11
to WordSmith Tools
Hannes, hi

> Has anyone managed to make WS create keyword clusters based on more than two
> words?

Try this:
(1) make a WordList index of your text.
(2) use it to create a 2-5 word wordlist, and save that as
mytext.lst.
(3) make a WordList index of your reference corpus.
(4) use it to create a 2-5 word wordlist, and save that as
ref_corp.lst.
(5) run the KeyWords procedure using mytext.lst as your word list and
ref_corp.lst as your reference corpus list.

Cheers -- Mike

Yenny Kwon

unread,
Jul 4, 2012, 2:28:17 AM7/4/12
to wordsmi...@googlegroups.com
Dear Scott
 
I've made keyword clusters (4 words) with based on n-gram lists (after making index on WordList).
 
But i'm wondering what principle applied to produce n-gram clusters.
 
I've used Collocate 1.0(Barlow, 2004) to produce n-gram lists before but still did not understand what principles were applied to produce n-gram.
 
Will you briefly explain or let m know which articles or books that i could refer to?
 
 
 
Thank you so much in advance,
 
Yenny
 

Mike Scott

unread,
Jul 4, 2012, 4:59:59 AM7/4/12
to WordSmith Tools
Yenny, hi

N-grams (in WordSmith) are computed like this. With a piece of text
containing say

My uncle used to work for the BBC

when computing 3-grams, the program stores these sequences

my uncle worked
uncle worked for
worked for the
for the BBC

thus using a moving 3-word window. It may also decide whether to take
note of sentence or other breaks as it works. As it works, it sorts
and if it finds a repeat it adds to a suitable counter. After it
finishes, the program counts how many of each sequence it found and
then saves them for the user to see.

I hope that helps. You will find more WordSmith help explanation at
http://www.lexically.net/downloads/version6/HTML/index.html?wordlist_clusters.htm

Best -- Mike
Reply all
Reply to author
Forward
0 new messages