R-package for Text Mining: QUANTEDA (Quantitative Analysis of Textual Data)

234 views
Skip to first unread message

Neeraj Kaushik

unread,
May 30, 2022, 12:59:54 AM5/30/22
to dataanalysistraining
Dear Friends

The R-package Quantitative Analysis of Textual Data (QUANTEDA) v3, developed by Prof Kenneth Benoit & team, is one of the most powerful packages for text analysis. Its version 3 was released on 01.03.2022

Prof Kenneth Benoit (https://www.lse.ac.uk/Methodology/People/Academic-Staff/Kenneth-Benoit/Kenneth-Benoit) is Director of the Data Science Institute at the London School of Economics and Professor of Computational Social Science in the Department of Methodology. The initial development of QUANTEDA was supported by the European Research Council (https://quanteda.io/).

QUANTEDA is a family of packages and it includes the following:
1. Quanteda
2. Quanteda.textstats
3. Quanteda.textplots
4. Quanteda.textmodels
5. Quanteda.sentiment
6. Quanteda.dictionary
7. Quanteda.tidy

It's one of the most powerful yet easiest R-package for the Text-Analysis which will replace many proprietary software used for the Text Analysis 

I've explained about it in this video:

Quanteda-1. Introduction to R-package Quanteda: https://youtu.be/ypMODS_onn0

Happy Learning
Neeraj

Shubham Kakran

unread,
May 30, 2022, 1:20:19 AM5/30/22
to dataanalys...@googlegroups.com
Thank you! sir, for sharing this package is really helpful.

--
Protocols of this Group:
 
1. Plz search previous post in group before posing the question.
2. Don't write query in someone's post. Always use the option of New topic for the new question. You can do this by writing to dataanalys...@googlegroups.com
3. Its better to give a proper subject to your post/query. It'll help others while searching.
4. Never write Open ended queries. This group intend to help research scholars NOT FOR WORK THEM.
5. Never write words like URGENT in ur posts. People will help them when they are free.
6. Never upload any info about National Seminars/Conferences. Send such info on personal emails. And feel free to share any RESEARCH related info.
7. No Happy New Year, Happy Diwali, Happy Holi, Happy B'day, Happy Anniversary etc. allowed on this group.
8. Few months back there was a facility for asking & sharing the Research Papers. Now there is no provision of asking for the research paper here.
 
Let’s make a better research environment.
---
You received this message because you are subscribed to the Google Groups "DataAnalysis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataanalysistrai...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataanalysistraining/CAAd%3Dc8MX-rBXPY0Rbuo-F35Y-WvtGZJYt_LQ%3Dnoq2K0kMhTq%3DA%40mail.gmail.com.

Dr. Raja Sankaran

unread,
May 30, 2022, 1:27:29 AM5/30/22
to Data Analysis
Dear Prof. Neeraj Kaushik,

It is great to see tons of useful and informative videos on various research tools from you. 
It is a definite treat for academicians and research scholars who are part of this group.
 
Regards

Dr Raja Sankaran


--

Neeraj Kaushik

unread,
May 30, 2022, 9:24:39 PM5/30/22
to dataanalysistraining
Dear Friends

Working on Quanteda is very easy as we work using a sequence of functions. It goes like this:

1. readtext (for reading text)
2. corpus (maintaining the position of each word in each document)
3. tokens (text cleaning using tokens)
4. document feature matrix (counting frequency of each word (called feature here) in each document) 
5. Using higher end text techniques on the output of dfm

I've explained this sequence in this video:

Quanteda-2. Flowchart of Quanteda Functions: https://youtu.be/IUqpeToYqq8

Happy Learning
Neeraj

Neeraj Kaushik

unread,
May 31, 2022, 8:42:15 PM5/31/22
to dataanalysistraining
Dear Friends

Before we start our joyride with Quanteda, I've provided a brief summary of the prominent functions of different quanteda family packages in this video:

Quanteda-3. Introduction to different functions: https://youtu.be/MQjS8i5-p5Y

From the next video, we'll see the fun of practical working with quanteda.

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 2, 2022, 4:50:49 AM6/2/22
to dataanalysistraining
Dear Friends

Now is the time for practical working on R-Studio.
I've explained the steps for the installation of quanteda family packages from CRAN and GITHUB; then import textfiles and create corpus in this video:

Quanteda-4 Installing packages, importing text and creating corpus: https://youtu.be/2M5n5MJk3_g

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 3, 2022, 2:35:23 AM6/3/22
to dataanalysistraining
Dear Friends

Afte rcreating the corpus, the next step in QUANTEDA is to clean the text using the tokens function and then taking the clean tokens into the Document feature Matric (dfm). I've explained these in this video:

Quanteda-5.  Text cleaning using tokens, convert to dfm and finding word frequency: https://youtu.be/p7FA7mC7PT4

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 3, 2022, 11:19:45 PM6/3/22
to dataanalysistraining
Dear Friends

One of the common problems in text analysis is that though we can count the frequency of a word, we can compute the find association of a given word with other words but we miss the context in which a particular word is used. For this Quanteda provides the function KeyWord In Context (KWIC).

KWIC provide the given word along with 5 words prior to it and 5 words after it. Of course, we can change the number of words before and after the given words. This helps us to understand the complete context of using a word.

Further, a number of text analytic packages provide the visualization functionality but Quanteda surpasses them by providing a new visualization X-ray plot.

The X-ray plot shows how many times a particular word repeat in the text.

I've explained these features of Quanteda in this video:

Quanteda-6  Comparison cloud, Keyword in context and X-ray plot: https://youtu.be/qiSdsSECygE

Happy Learning
Neeraj

samiran

unread,
Jun 4, 2022, 7:04:10 AM6/4/22
to dataanalys...@googlegroups.com
Dear Neeraj Sir

Thank you so much for this update . Yes , though we can make word cloud, frequency count, topic modeling etc. but it was problematic to find out the context.  Will try to apply this definitely.

Regards
Samiran Sur

--
Protocols of this Group:
 
1. Plz search previous post in group before posing the question.
2. Don't write query in someone's post. Always use the option of New topic for the new question. You can do this by writing to dataanalys...@googlegroups.com
3. Its better to give a proper subject to your post/query. It'll help others while searching.
4. Never write Open ended queries. This group intend to help research scholars NOT FOR WORK THEM.
5. Never write words like URGENT in ur posts. People will help them when they are free.
6. Never upload any info about National Seminars/Conferences. Send such info on personal emails. And feel free to share any RESEARCH related info.
7. No Happy New Year, Happy Diwali, Happy Holi, Happy B'day, Happy Anniversary etc. allowed on this group.
8. Few months back there was a facility for asking & sharing the Research Papers. Now there is no provision of asking for the research paper here.
 
Let’s make a better research environment.
---
You received this message because you are subscribed to the Google Groups "DataAnalysis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataanalysistrai...@googlegroups.com.

Neeraj Kaushik

unread,
Jun 5, 2022, 8:51:18 PM6/5/22
to dataanalysistraining
Dear Friends

While discussing the Text Analysis, we often encounter a problem related to multi-word expressions e.g. New York. or Los Angeles  Technically these are 2 separate words which are always read together as the name of a city. Such multi-word expressions of two words are called Bigram and an expression of 3 words is called Trigram (Republic of India) so on.

But conventional text analysis considers both as separate words and thus handle them accordingly. Though there are procedures related to handling bigrams and trigrams in other text analysis packages too but they are quite cumbersome.

Quanteda provides two functions collocation to identify such words and tokens_compund to join such words using underscore.

I've explained this concept in this video:

Quanteda-7. Multi-word Expressions (Bigrams and Trigrams): https://youtu.be/hXnq0-PLR2g

Happy Learning
Neeraj


Neeraj Kaushik

unread,
Jun 7, 2022, 12:17:04 AM6/7/22
to dataanalysistraining
Dear Friends

The text frequency provides the words which are used max in a given text.
But at times, we need to find the words which are unique to one document as compared to other.
This concept is called as Keyness,

I've explained this concept in this video:

Quanteda-8. Keyness in a target document: https://youtu.be/erXLNi5ENfw

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 7, 2022, 8:11:15 PM6/7/22
to dataanalysistraining
Dear Friends

Quanteda provides the options of finding similarities and dissimilarities between the different texts.
Also, it provides the lexical diversity (https://en.wikipedia.org/wiki/Lexical_diversity)

I've explained these concepts in this video:
Quanteda-9. Similarity, Distance  and Lexical Diversity of documents: https://youtu.be/nB9SAhKtfF4

Happy Leaarning
Neeraj

Neeraj Kaushik

unread,
Jan 8, 2024, 7:12:14 PM1/8/24
to dataanalysistraining
Dear Friends

Topic Modeling can be done by using quanteda and stm package.
 I've explained the working of the same in this video:

Quanteda-10  Topic Modeling using stm package: https://youtu.be/nOnKT714j3g

Happy Learning
Neeraj

Niranjan Daddikar

unread,
Jan 10, 2024, 4:29:53 AM1/10/24
to dataanalys...@googlegroups.com
Thank you for sharing Sir!

--
The members of this group are expected to follow the following Protocols:
1. Please search previous posts in the group before posting the question.
2. Don't write the query in someone's post. Always use the option of New topic for the new question. You can do this by writing to dataanaly...@googlegroups.com
3. It’s better to give a proper subject to your post/query. It'll help others while searching.
4. Never write Open-ended queries. This group intends to help research scholars, NOT TO WORK FOR THEM.
5. Never write words like URGENT in your posts. People will help when they are free.
6. Never upload any information about National Seminars/Conferences. Send such information
in personal emails and feel free to share any RESEARCH-related information.
7. No Happy New Year, Happy Diwali, Happy Holi, Happy Birthday, Happy Anniversary, etc. allowed in this group.
8. Asking or sharing Research Papers is NOT ALLOWED.
9. You can share your questionnaire only once.

---
You received this message because you are subscribed to the Google Groups "DataAnalysis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataanalysistrai...@googlegroups.com.


--
Niranjan V. Daddikar
+91 98864 37515

Neeraj Kaushik

unread,
Nov 3, 2024, 2:31:45 AM11/3/24
to dataanalysistraining
Dear Friends

Now that the US Presidential Elections are just 3-4 days away, I thought of using QUANTEDA  Family 
R-Packages to analyze the Speeches of DonalD Trump and Kamala Harris. I downloaded the 2 speeches of Donald Trump and 3 speeches of Kamala Harris. 

Following Research Questions are explored:
1. Choice of words - Word they speak 
2. Choice of words - words they never speak
3. Who uses simple text/English in speeches?
4. How much novelty is there in speeches?
5. Who has a rich vocabulary?
6. In which context the different words are spoken and how much emphasis is given on different words
7. Relative positing of documents
8. Relative positing of words in documents 
9. Probability of each word for different documents
10. Which topics/themes are stressed upon by speakers?
11. Which emotions are triggered most by both candidates?

I explained these in the following 3 videos:

Text Analysis of USA Presidential Speeches using QUANTEDA Family Packages Part-1: https://youtu.be/VWeOpN_XSpw

Text Analysis of USA Presidential Speeches using QUANTEDA Family Packages Part-2: https://youtu.be/jRgUhDqX2kk

Text Analysis of USA Presidential Speeches using QUANTEDA Family Packages Part-3: https://youtu.be/ELWYMMhaWGo

All the relevant files are given in 

Happy Learning
Neeraj

Aleena Ilyas

unread,
Nov 3, 2024, 4:56:13 AM11/3/24
to dataanalys...@googlegroups.com
Dear Kaushik Sir,

Thank you so much Sir for teaching us text mining using Quanteda.

Recently, I started learning text analytics in R studio from your videos on quanteda and topic modelling. They have been very helpful. 

While referring to some recent papers in Technological Forecasting and Social Change, I read some new concepts like topic correlation, topic prevalence, topic prevalence using any co-variate, diagnostic values by number of topics using held-out likelihood and residual method. I am keen to learn these techniques also to enhance the quality of my paper that I am currently working on. 

Please suggest some books to learn about these topics. Your videos on similar topics would also be very helpful. 

Regards,  
--
Aleena Ilyaz
Research Scholar
Department of Management Studies
Jamia Millia Islamia, New Delhi






--
The members of this group are expected to follow the following Protocols:
1. Please search previous posts in the group before posting the question.
2. Don't write the query in someone's post. Always use the option of New topic for the new question. You can do this by writing to dataanaly...@googlegroups.com
3. It’s better to give a proper subject to your post/query. It'll help others while searching.
4. Never write Open-ended queries. This group intends to help research scholars, NOT TO WORK FOR THEM.
5. Never write words like URGENT in your posts. People will help when they are free.
6. Never upload any information about National Seminars/Conferences. Send such information
in personal emails and feel free to share any RESEARCH-related information.
7. No Happy New Year, Happy Diwali, Happy Holi, Happy Birthday, Happy Anniversary, etc. allowed in this group.
8. Asking or sharing Research Papers is NOT ALLOWED.
9. You can share your questionnaire only once.
---
You received this message because you are subscribed to the Google Groups "DataAnalysis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataanalysistrai...@googlegroups.com.

Neeraj Kaushik

unread,
Nov 3, 2024, 8:43:58 PM11/3/24
to dataanalys...@googlegroups.com
Dear Aleena

I will read these concepts and will get back to you soon.
In the meanwhile, enjoy one of the fastest R-package sentopics for topic modelling.
We get awesome output in just 5 lines of code.

I explained this R-package in this video:

Text Analysis of USA Presidential Speeches using sentopicshttps://youtu.be/dFPv_0brRH0

Happy Learning
Neeraj


Neeraj Kaushik

unread,
Nov 22, 2024, 7:00:48 PM11/22/24
to dataanalysistraining
Dear Friends

In the entire text analysis, I missed discussing about the Stemming and lemmatization 

Stemming and lemmatization are both techniques used in text analysis to reduce words to their base or root form. However, they differ in their approach and results. 

Here's a comparison (CourtseyL chatGPT):

1. Stemming
Definition: Stemming reduces words to their root form by chopping off suffixes, often resulting in non-dictionary forms.
Method: It uses heuristic algorithms (e.g., Porter Stemmer, Snowball Stemmer) that are rule-based and operate on the surface structure of words.
Output: The root may not always be a valid word in the language.
Example:
"running" → "run"
"better" → "bett"
"studies" → "studi"
"festivals" → 'festiv"
Use Case: Quick and computationally efficient tasks where approximate results are acceptable, such as search engines.

2. Lemmatization
Definition: Lemmatization reduces words to their base or dictionary form (lemma) by considering the word's context and grammatical meaning.
Method: It uses a vocabulary and morphological analysis to ensure that the root form is a meaningful word. It particularly takes care of singular and plurals, forms of verb like go-went-gone etc.
Output: The root is always a valid word.
Example:
"running" → "run"
"better" → "good"
"studies" → "study"
Use Case: Tasks requiring high precision and accuracy, such as language translation or sentiment analysis.

Illustrations:
Stemming: ("argues," "argued," "arguing") → "argu"
Lemmatization: ("argues," "argued," "arguing") → "argue"

I've explained the working of lemmatization in this video:

Quanteda-13  Lemmatization of words: https://youtu.be/lgnSyquQC2I

The correct sequence of Text Analysis is as under:
Import text->corpus->lemmatize->token->dfm->Topic modeling->Predictive Text Analysis

Happy Learning
Neeraj

Oussama R

unread,
Nov 23, 2024, 6:23:11 PM11/23/24
to dataanalys...@googlegroups.com

Hello Professor Nerraj, could you make a video on regression based on factors (topics) extracted from a text corpus? For example, we want to understand the impact of an extracted topic on overall satisfaction. please see the attached file 

Thank you :) 


--
The members of this group are expected to follow the following Protocols:
1. Please search previous posts in the group before posting the question.
2. Don't write the query in someone's post. Always use the option of New topic for the new question. You can do this by writing to dataanaly...@googlegroups.com
3. It’s better to give a proper subject to your post/query. It'll help others while searching.
4. Never write Open-ended queries. This group intends to help research scholars, NOT TO WORK FOR THEM.
5. Never write words like URGENT in your posts. People will help when they are free.
6. Never upload any information about National Seminars/Conferences. Send such information
in personal emails and feel free to share any RESEARCH-related information.
7. No Happy New Year, Happy Diwali, Happy Holi, Happy Birthday, Happy Anniversary, etc. allowed in this group.
8. Asking or sharing Research Papers is NOT ALLOWED.
9. You can share your questionnaire only once.
---
You received this message because you are subscribed to the Google Groups "DataAnalysis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataanalysistrai...@googlegroups.com.
factors influencing recommanadation for women LDA.pdf

Neeraj Kaushik

unread,
Nov 23, 2024, 6:30:40 PM11/23/24
to dataanalys...@googlegroups.com

Oussama R

unread,
Nov 23, 2024, 8:12:19 PM11/23/24
to dataanalys...@googlegroups.com

Neeraj Kaushik

unread,
Dec 1, 2024, 6:56:48 PM12/1/24
to dataanalys...@googlegroups.com
Dear Friends

LDAvis is an excellent package for quick Topic Modeling.
I've explained the same in this video:

LDAvis R-package for Interactive Visualization of Topic Models: https://youtu.be/zJTtMhggZxc



Happy Learning
Neeraj


LDAvis.R

Neeraj Kaushik

unread,
Dec 21, 2024, 5:42:33 AM12/21/24
to dataanalys...@googlegroups.com
Dear Friends
The scholars in their initial research journey can perform SLR and then work on the following:
1. Hypothesis formulation
2. Meta Analysis
3. Bibliometric Analysis
4. Text Analysis
5. Content Analysis

In the Text Analysis, one way of working is to extract data from Scopus into csv format and apply Text Analysis on the same.

I've explained the same in this video:

Quanteda-14 Working on Scopus Abstract: https://youtu.be/YmjBEYUp8tY

Happy Learning
Neeraj
Pre processing Text Analysis.R
Reply all
Reply to author
Forward
0 new messages