Using StopList for ngrams

606 views
Skip to first unread message

Matthew Humphries

unread,
Aug 24, 2013, 9:44:42 PM8/24/13
to ant...@googlegroups.com
Hey Laurence, I have another question related to AntConc (hope I'm not flooding the discussion). Right now I'm attempting to generate an ngram list that is filtered with a StopList. This works for the Word List tool that generates only single words (rather than ngrams), but unless I'm doing something wrong it doesn't seem to filter words when generating ngrams. For example, the word 'the' will appear in the ngram list even though it will be removed when looking at the Word List. 

Is there a way to get around this, or am I doing something wrong?

Thanks

Laurence Anthony

unread,
Aug 25, 2013, 5:16:07 AM8/25/13
to ant...@googlegroups.com
Hi Matthew,

Post to the discussion group are great. You are certainly not flooding the discussion.

I'm sorry to say that AntConc does not have a stop list function for n-grams. There is a simple workaround though. If you export the results of the N-gram search to a good text editor like Notepad++, you can then use the search and replace function to delete all entries that contain your stop list words. You will need to use a regular expression for this to work. You will also have to remove each stopword one by one, but assuming the stoplist is not that long it is relatively simple.

By the way, why do you want to apply a stoplist? Often, the combination of content words and function words etc reveal interesting patterns.

Laurence.

Liliana Wai

unread,
Aug 25, 2013, 8:48:59 AM8/25/13
to ant...@googlegroups.com
Dear all: I need some help to use AntConc. I am working with Appraisal theory and I need to analize all the adjectives that appear in the corpus. How can I do this?

Thanks,

Liliana


2013/8/25 Laurence Anthony <antho...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/groups/opt_out.

Laurence Anthony

unread,
Aug 25, 2013, 9:00:11 AM8/25/13
to ant...@googlegroups.com
Liliana,

Can you post a new question for this, rather than adding it to a different thread?

Laurence.
Message has been deleted

Matthew Humphries

unread,
Aug 25, 2013, 8:16:11 PM8/25/13
to ant...@googlegroups.com
Ah great, thanks for the workaround! And as for why I want to use a stop list, the texts I'm working with are rather large in size so I think it might help me sift through the many results that the ngram tool generates. I might be losing out on some interesting patterns, but I'm willing for now to let that slide to make things a bit easier for myself.

Cheers!
Matt

Warren Tang

unread,
Aug 25, 2013, 8:22:11 PM8/25/13
to ant...@googlegroups.com
Hi Matthew and Anthony,
This sounds like a job more suited for clusters in combination with wildcards rather than ngrams. A brief example might help us help you better. 

Also another workaround may be to make an ngrams list then turning the ngram list into a txt life for further analysis in word list. Then you can the stop list feature. 


Warren. 

On Monday, August 26, 2013, Matthew Humphries wrote:
Ah great, thanks for the workaround! And as for why I want to use a stop list, the texts I'm working with are rather large in size so I think it might help me sift through the many results that the ngram tool generates. I might be losing out on some interesting patterns, but I'm willing for now to let that slide to make things a bit easier for myself.

Cheers!
Matt

On Sunday, August 25, 2013 5:16:07 AM UTC-4, Laurence Anthony wrote:

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/groups/opt_out.


--
Sent from Gmail Mobile
Message has been deleted

umu...@gmail.com

unread,
Feb 15, 2014, 7:37:14 AM2/15/14
to ant...@googlegroups.com
Dear Dr. Lawrence

First of all , I wish to thank you from the bottom of my heart for every effort you put in Antconc and other software that are real life savers and perfect magical wands for anyone willing to work with texts and corpora. While browsing through the discussion board for a specific question in my mind, I came across Matthew Humphries's question (by the way Thanks Matheww).  I am working with a corpus of 90 full length English engineering textbooks and for teaching and testing purposes I have to extract bi-grams and tri-grams that are beyond the GSL 2000 and AWL levels to use with my students.

The point here is vocabulary levels tests have indicated that the students are proficient at these levels to some extent and they are going to receive extra vocabulary study on GSL +AWL lists.  Therefore I wish to explore bi-grams and tri-grams that occur frequently and are found in a range of textbooks in my corpus. Your suggestion to work with a word processor and  remove each stopword one by one seems like a lot of hard work to me considering the 2570 + word families in GSL-AWL lists. 

For example when I looked at the results in the "Clusters" window for my search on the word "lignin" I was able to find the term  "lignin degradation" or "anoxic biotrickling filter" to be frequently used over many tetxs but they were found only  through manual searches.

 I was wondering if it there would be any solution  for using the GSL+AWL lists as a stoplist and be able to find the most frequently occurring n-grams over a large number of texts in my corpus to do a more corpus-informed teaching to my students. 

Thank you once again. 

Best Regards 

Umut Salihoglu



25 Ağustos 2013 Pazar 04:44:42 UTC+3 tarihinde Matthew Humphries yazdı:

Laurence Anthony

unread,
Feb 15, 2014, 9:53:57 AM2/15/14
to ant...@googlegroups.com
Hi Umut,

I'm a little confused. Did I suggest that you use a word processor to remove stop words. I rarely use word processors and generally dislike stop lists. This does not sound like me. Can you cite the post where I said that?

Am I right in thinking that you want to create a set of n-grams, but do not want to include 'noisy' n-grams that contain words like "the" and "a"? Please confirm this, and I'll think of an easy way to do it.

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


--

umu...@gmail.com

unread,
Feb 15, 2014, 10:16:14 AM2/15/14
to ant...@googlegroups.com

Dear Dr. Anthony
 I am delighted and really surprised to receive your quick response. To solve the confusion about the word processor: I was actually referring to your message dated 28 August 2013 that was written for Matthew. I will block quote that below 

"Hi Matthew,

Post to the discussion group are great. You are certainly not flooding the discussion.

I'm sorry to say that AntConc does not have a stop list function for n-grams. There is a simple workaround though. If you export the results of the N-gram search to a good text editor like Notepad++, you can then use the search and replace function to delete all entries that contain your stop list words. You will need to use a regular expression for this to work. You will also have to remove each stopword one by one, but assuming the stoplist is not that long it is relatively simple. 

By the way, why do you want to apply a stoplist? Often, the combination of content words and function words etc reveal interesting patterns.

Laurence."

In reference to your sentences that I have boldfaced in the above quotation, I have thought that this process had to be completed manually. I might be mistaken that I may have grasped the explanation about the manual work. What I actually wish to do is to find n-grams that may possibly be assumed unknown to the students which I will soon confirm with vocabulary knowledge scale test by Paribakht&Wesche. 

I would very much appreciate your suggestions on this matter. Once again, thank you very much for reserving your precious time for people like me.


Regards 

Umut    

  

25 Ağustos 2013 Pazar 04:44:42 UTC+3 tarihinde MatthWew Humphries yazdı:

Laurence Anthony

unread,
Feb 15, 2014, 10:21:55 AM2/15/14
to ant...@googlegroups.com
Umut,

If you are intending to find which n-grams students know or not, how do you justify applying a stop list. Can you be sure that the students know the n-grams with the stop list words that you are about to delete before the experiment starts?

What about "on the house"? A stop list is likely to delete this n-gram because of the include of "the" but I would suspect that many students do not know what it means in the sense of "free".

Again, am I right in thinking that you want to create a set of n-grams, but do not want to include 'noisy' n-grams that contain words like "the" and "a"? Please confirm this, and I'll think of an easy way to do it

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


--

umut s

unread,
Feb 15, 2014, 10:55:04 AM2/15/14
to ant...@googlegroups.com
Dear Dr. Anthony

You are in all ways right about the application of stop lists in educational settings. Students might miss the opportunity to be exposed to various n-grams and idiomatic expressions and lexical bundles; however for my current project the students will get the chance to use various n-grams in communicative vocabulary learning activities.  

On the other hand, I need the "above GSL level n-grams" (lets say bigrams) for the purposes of testing the effectiveness of various types of glossing with vocabulary learning task effectiveness. I will use single word and multiple word glosses. I had the chance to use non-words or very rare items for testing purposes but for ethical reasons I didn't want my students to learn about those non-words and rare items. 

I wish to supply them with bigrams that they may meet in their future school years and careers. That's why I aim to find bigrams that are not only unknown to my students but also are frequent and exist in a range of books in their field (so that the words may be regarded useful by students)

I share all your concerns about the authenticity of the language that the students are exposed to, therefore I will aim to provide them with single words and n-grams based on corpus findings for vocabulary learning. 


You are absolutely  right in thinking that I want to create a set of n-grams, but do not want to include the n-grams that contain words from the GSL + AWL lists.

What I aim to do might not seem like a sound approach but I think my quest for finding words and n-grams that are both unknown and useful will require such a work.    

Thanks 

Umut




--
You received this message because you are subscribed to a topic in the Google Groups "AntConc-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antconc/kx5V_JXtNik/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antconc+u...@googlegroups.com.

Laurence Anthony

unread,
Feb 15, 2014, 10:59:43 AM2/15/14
to ant...@googlegroups.com
OK. I understand what you need to do. What size of n-grams do you need to filter?

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


umut s

unread,
Feb 15, 2014, 11:03:29 AM2/15/14
to ant...@googlegroups.com

I wish to extract bigrams and trigrams as a start up.

Umut

15 Şub 2014 18:00 tarihinde "Laurence Anthony" <antho...@gmail.com> yazdı:

Laurence Anthony

unread,
Feb 15, 2014, 11:06:38 AM2/15/14
to ant...@googlegroups.com
OK. Let me think if there's an easy way to do this without writing a programming script.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


umut s

unread,
Feb 15, 2014, 11:13:11 AM2/15/14
to ant...@googlegroups.com

Dr. Anthony
As I have told you in my first e-mail,  you are a true inspiration to a great number of people. I will be waiting for your reply. I do appreciate your every effort.

Thank you

Regards Umut

15 Şub 2014 18:07 tarihinde "Laurence Anthony" <antho...@gmail.com> yazdı:
Reply all
Reply to author
Forward
0 new messages