🔥 Beat the Heat with Cool Advanced Text Analysis🔥: A Supervised & Unsupervised Machine Learning Text Analysis Series!

142 views
Skip to first unread message

Neeraj Kaushik

unread,
May 31, 2026, 7:22:51 PMMay 31
to dataanalysistraining

Dear Friends

I would like to begin by expressing my heartfelt gratitude to Dr. Fakhar Shahzad, The Hong Kong Polytechnic University, Hong Kong (SAR), China, whose motivation and encouragement inspired me to explore and learn these powerful concepts in machine learning and text analysis.

Let's ponder this brilliant quote by John Maxwell:

image.png

In today’s fast-changing world, the best investment we can make is in learning something new. Whether you're a student, researcher, or professional, stepping out of your comfort zone and exploring new concepts like machine learning can open doors to endless opportunities. So let’s challenge ourselves to learn, experiment, and grow together!

In this series, we will explore both Supervised and Unsupervised Machine Learning techniques for text analysis, combining conceptual clarity with practical implementation in R.

Supervised machine learning works with labeled data to predict outcomes. We will cover:

  • Naive Bayes (NB / NBA) – a probabilistic classifier based on Bayes’ theorem, highly effective for text classification.
  • Logistic Regression (LR) – widely used for binary and multiclass prediction problems.
  • Support Vector Machines (SVM) – powerful models that find optimal boundaries between classes.
  • PLS-DA (Partial Least Squares Discriminant Analysis) – ideal for high-dimensional datasets with many features.

Unsupervised machine learning, on the other hand, focuses on discovering hidden patterns in unlabeled data. We will explore:

  • Latent Semantic Analysis (LSA) – uncovers relationships between terms and documents using dimensionality reduction.
  • Word Embeddings (GloVe, Word2Vec) – transform words into meaningful vector representations capturing semantic relationships.
  • Principal Component Analysis (PCA) – reduces dimensionality by transforming features into principal components.
  • t-SNE – useful for visualizing high-dimensional text data in a lower-dimensional space.
  • UMAP – an advanced technique for dimension reduction that better preserves structure and clusters.
  • Latent Dirichlet Allocation (LDA) – a popular method for extracting topics from large text corpora.

We are starting this journey with
Naive Bayes Analysis (NBA), building a strong foundation in probability, preprocessing, model building, and evaluation step by step.

In the first video, I revisited the foundational concepts that are essential for understanding supervised machine learning, especially in the context of text analysis. 

We begin with probability, which helps quantify uncertainty and measure how likely an event is to occur. Building on this, we introduce conditional probability, which tells us the likelihood of an event happening given that another event has already occurred—an important idea when dealing with real-world data dependencies.

The video then progresses to Bayes’ Theorem, a powerful mathematical formula that connects prior knowledge with new evidence. We explain how prior probability (what we already know), likelihood (the probability of observing the data), and evidence combine to compute the posterior probability—our updated belief after seeing the data. This concept forms the backbone of the Naive Bayes algorithm. 

I've explained this in the video:  Naive Bayes Analysis-1: https://youtu.be/kOxveoGyCx4

Happy Learning

Neeraj

NBA.pdf

Neeraj Kaushik

unread,
Jun 1, 2026, 7:26:28 PMJun 1
to dataanalysistraining
Dear Friends

In this second video, we walk through the step-by-step workflow of performing Naive Bayes analysis for text data using the quanteda framework. 

The process begins with importing raw text data, which is then converted into a quanteda corpus—a structured format that makes text easier to manage and analyze.

Next, we transform the corpus into tokens, where detailed text cleaning takes place, such as removing punctuation, stopwords, and unnecessary characters. These cleaned tokens are then converted into a document-feature matrix (DFM), which represents the frequency of terms across documents. To improve model efficiency, we apply dfm_trim to remove sparse and less informative features.

Finally, the dataset is split into training and testing sets, allowing us to train the Naive Bayes model and evaluate its performance. This workflow provides a clear and practical pipeline for text classification tasks.

Naive Bayes Analysis-2: https://youtu.be/dC8ahihbw8M

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 2, 2026, 8:12:39 PMJun 2
to dataanalysistraining
Dear Friends

In this 3rd video, we dive into the practical implementation of Naive Bayes Analysis (NBA) using R.

Starting with reading text data from an Excel file, we extract the input text and corresponding labels for classification.

The preprocessing phase involves cleaning the text by removing numbers, punctuation, and common stopwords to ensure meaningful feature extraction.

The cleaned text is then transformed using the quanteda-based workflow, where we generate a document-feature matrix and apply dfm_trim to retain only the most relevant terms. Next, we split the data into training and testing sets using a reproducible sampling method.

We then build the Naive Bayes model using the training data and align the test features with the training set using dfm_match.

Finally, we generate predictions and evaluate model performance using a confusion matrix. This end-to-end demonstration provides a clear understanding of implementing text classification in R.

Naive Bayes Analysis-3: https://youtu.be/hStV7HR7BuY

Happy Learning
Neeraj

NBYT.R
speech.xlsx

Neeraj Kaushik

unread,
Jun 3, 2026, 7:17:04 PMJun 3
to dataanalysistraining
Dear Friends

In the 4th video, we build upon the concepts covered in the previous lessons by revising the complete Naive Bayes workflow and applying it to predict outcomes on new, unseen text data.

We demonstrate how to preprocess fresh inputs by converting text to lowercase, performing lemmatization, and transforming it into tokens and a document-feature matrix using quanteda. The new data is then aligned with the training features using dfm_match to ensure compatibility with the trained model.

We then generate predictions along with their associated probabilities, helping viewers understand not just the classification but also the model’s confidence.

Additionally, the video introduces the dfm_lookup function, which allows us to search for specific words (e.g., “immigrant”) within the dataset and analyze their frequency and distribution across documents.

To deepen understanding, we also convert the results into a structured dataframe for interpretation.

Finally, we discuss different types of DFMs available in quanteda, giving learners a broader perspective on text feature representation and analysis techniques.

Naive Bayes Analysis-4: https://youtu.be/E37eVfXhGqM

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 4, 2026, 7:14:53 PMJun 4
to dataanalysistraining
Dear Friends

In the 5th video, we extend our Naive Bayes text classification analysis by exploring how to extract the top words associated with each class (party) from the trained model.

Using the model parameters, we identify the most influential terms for both Democratic and Republican categories and compare them to find overlapping as well as unique words that distinguish each group. This helps in interpreting the model and understanding key linguistic patterns across classes.

Next, we perform additional and more rigorous text cleaning using the tm package, including removing numbers, punctuation, and stopwords, followed by rebuilding the document-feature matrix with optimized trimming thresholds. The model is retrained on this refined dataset to observe improvements in feature quality and classification insights.

Finally, we use dfm_group and textplot_wordcloud to create a comparison word cloud, visually representing the most important terms for each party. This graphical approach makes it easier to interpret differences in language usage and enhances the overall understanding of text classification results.

Naive Bayes Analysis-5: https://youtu.be/lY5uIM5QDqw

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 7, 2026, 10:26:44 PMJun 7
to dataanalysistraining
Dear Friends,

Having explored Naive Bayes Analysis, we now move to Logistic Regression (LR) for text analysis. While Naive Bayes assumes feature independence, Logistic Regression is a discriminative model that estimates the probability of a document belonging to a specific class using the logistic function.

So we now examine how LR handles high-dimensional text data. We will cover the implementation of binary and multiclass classification, the importance of regularization to prevent overfitting, and how to interpret model coefficients to identify key predictive terms.

We start by revisiting the key differences between supervised and unsupervised machine learning techniques to help understand where Logistic Regression fits in the broader machine learning landscape.

Next, we focus on the Document-Feature Matrix (DFM)—a crucial representation in text analysis. We explain what a DFM is and what its values signify, giving insight into how text data gets transformed into a structured numerical format suitable for modeling.

We then introduce Green’s formula for sample size in regression, highlighting an important practical challenge in text analysis: having a small number of documents but a very large number of features (words). This imbalance can lead to issues such as overfitting and instability in Logistic Regression models.

To address this, we explain the concept of regularization, which helps control model complexity. Finally, we compare Lasso (L1) and Ridge (L2) regression, discussing how each technique penalizes model coefficients differently and helps improve model performance in high-dimensional text data scenarios.

This video sets the stage for effectively applying Logistic Regression to real-world text datasets.

Logistic Regression PDF: https://drive.google.com/file/d/1Mf1aL_TPyNVCwHhpDYZlzgqFW_A4Vubi/view?usp=drive_link

Logistic Regression-1: https://youtu.be/KP3TgR3NvkA

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 8, 2026, 7:17:21 PMJun 8
to dataanalysistraining
Dear Friends

In this second video, we move from concepts to hands-on implementation of Logistic Regression in R for text analysis.
We begin by constructing the Document-Feature Matrix (DFM) and then explain two important transformations: dfm_trim, which helps reduce sparsity by removing less informative terms, and dfm_tfidf, which assigns weights to words based on their importance across documents.

A key highlight of this video is understanding common pitfalls in preprocessing pipelines.

We discuss how an incorrect sequence like dfm → dfm_trim → dfm_tfidf → Split → dfm_match can lead to data leakage, where information from the test set unintentionally influences the model.

Similarly, using dfm → dfm_trim → Split → dfm_tfidf → dfm_match can cause feature weight desynchronization, resulting in inconsistent term weighting between training and testing data.

To address these issues, we demonstrate the correct workflow:
dfm → dfm_trim → Split → dfm_match → dfm_tfidf

This ensures proper separation of training and testing data and consistent feature representation. By the end, viewers gain both practical skills and a deeper understanding of building reliable and robust text classification models.

Logistic Regression-2: https://youtu.be/jQkriszaoeY: https://youtu.be/jQkriszaoeY

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 9, 2026, 8:43:31 PMJun 9
to dataanalysistraining
Dear Friends

Continuing from the previous discussion, let's begin by importing the text data and performing thorough preprocessing: removing numbers, punctuation, and stopwords. The cleaned text then passes through a structured pipeline (lower → lemmatization → corpus → tokens → dfm) using a custom function to ensure consistency and efficiency.

Next, we refine the document-feature matrix using dfm_trim, followed by grouping data to extract key discriminating words for each party using textstat_keyness function to identify distinctive words across groups of text. Specifically, we apply this technique to analyze the inaugural speeches of Democratic and Republican U.S. presidents, aiming to uncover the most characteristic terms used by each party.

keyness” means—words that are statistically significantly more frequent in one group compared to another. Unlike simple frequency counts, keyness highlights words that truly differentiate one category from another, making it highly valuable for comparative text analysis.

Using textstat_keyness, we demonstrate how to compute these distinguishing features from a document-feature matrix grouped by political party. The output helps us identify which terms are overused or underused by each party, providing insights into their ideological focus, communication style, and priorities. This approach is particularly useful in political text analysis, sentiment studies, and discourse analysis.

I've explained these concepts in this video:

Logistics Regression-3: https://youtu.be/YL_vPDBVLK8

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 10, 2026, 7:15:53 PM (13 days ago) Jun 10
to dataanalysistraining
Dear Friends

Next we bring together all the concepts learned so far and demonstrate a complete end-to-end pipeline for Logistic Regression (LR) in text analysis using R. We begin with importing the text data and performing thorough preprocessing by removing numbers, punctuation, and stopwords. The cleaned text is then passed through a structured pipeline (lower → lemmatization → corpus → tokens → dfm) using a custom function to ensure consistency and efficiency.

Next, we refine the document-feature matrix using dfm_trim, followed by grouping data to extract key discriminating words for each party using textstat_keyness. This provides interpretability by identifying features that distinguish Democratic and Republican speeches.

We then split the data into training and testing sets, apply TF-IDF weighting, and build the Logistic Regression model. Model performance is evaluated first on test data and then on new and random words, highlighting how predictions work in practice.

Finally, we examine model coefficients and discuss an important limitation of Lasso regularization (default in textmodel_lr)—it tends to retain only a very small number of features (in this case, just a few words), which may oversimplify complex text patterns.

Logistics Regression-4: https://youtu.be/7kYAv_dE2Hg

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 11, 2026, 7:20:20 PM (12 days ago) Jun 11
to dataanalysistraining
Dear Friends

In the next video, we advance our text analysis journey by implementing Ridge Logistic Regression, a powerful alternative to Lasso that helps overcome the limitations of excessive feature reduction.

We begin by importing and preprocessing the text data using a structured pipeline, followed by creating a Document-Feature Matrix (DFM) and applying dfm_trim to retain meaningful terms. The data is then split into training and testing sets, and TF-IDF weighting is applied to ensure proper feature scaling.

A key highlight of this video is the use of the glmnet package to implement Ridge Regression with automatic penalty (lambda) selection using cross-validation. This ensures that the model is optimally tuned, especially important for small datasets with high-dimensional text features. We then evaluate model performance on test data using a confusion matrix and also test predictions on new unseen text inputs.

Finally, we extract and analyze model coefficients to identify the top words associated with each party. Unlike Lasso, Ridge retains a larger set of words, providing richer insights into text patterns.

This video clearly demonstrates how Ridge regression improves model interpretability and stability in text classification tasks.

Logistics Regression-5: https://youtu.be/oWJU2ULnlAs

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 14, 2026, 7:26:10 PM (9 days ago) Jun 14
to dataanalysistraining
Dear Friends

Now we can proceed with the next technique: Support Vector Machine (SVM), one of the most powerful supervised machine learning algorithms for text classification.

We start by comparing the "Decision Makers" of the NLP family, explicitly differentiating between Naive Bayes (NBA) (the fast probabilistic), Logistic Regression (LR) (the probability predictor), and SVM (the geometric boundary finder that searches for the optimal separating hyperplane).

Next we discuss

Navigating the Docs: We consult rdocumentation.org live to break down the exact syntax, arguments, and underlying structures required to implement an SVM in R.

The e1071 Package: A deep dive into the industry-standard e1071 R package, looking at how to set up, tune, and execute your classification models.

Feature Engineering Weights: The ultimate text classification workflow showdown!

We discuss and implement the workflow using three distinct types of matrix weighting strategies to see how they impact model accuracy:

dfm_tfidf: Frequency-inverse document frequency weighting.
dfm_weight (Proportion): Relative word frequencies within documents.
dfm_weight (Boolean): Binary presence/absence tracking (1 or 0).

I've explained these concepts in this video:
Support Vector Machine-1: https://youtu.be/jDOwJe8vRVY

Happy Learning
Neeraj
SVM.R

Neeraj Kaushik

unread,
Jun 15, 2026, 7:18:23 PM (8 days ago) Jun 15
to dataanalysistraining
Dear Friends

In the next video, we jump straight into R to build and deploy high-performance Support Vector Machine (SVM) classifiers for text data. We move from theory to implementation by exploring two vital ecosystem approaches for text mining.

What we cover in this hands-on coding tutorial:

Fast Linear SVMs with textmodel_svmlin: We dive into the quanteda.textmodels package to build a fast, linear SVM classifier. You will see how to fit the model directly on a sparse Document-Feature Matrix (DFM) and use the predict() function to classify brand-new, unseen text strings.  

Deep Feature Extraction: Using the classic e1071 package, we look past simple prediction accuracy to explore model interpretability. We extract the internal feature weights to isolate and rank the top 20 most influential words driving the classification split for each target group or political party.

By the end of this session, we will know exactly how to train an SVM, project new textual data into your decision space, and reverse-engineer the model to pull out the definitive keywords shaping your predictions.

I've explained these concepts in this video:

Support Vector Machine-2: https://youtu.be/DPlaJR7GL0U

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 16, 2026, 7:18:47 PM (7 days ago) Jun 16
to dataanalysistraining
Dear Friends

Before we discuss Partial Least Squares Discriminant Analysis, let's revisit the math and mechanics of factor reduction by Principal Component Analysis (PCA). Using a practical dataset of 21 items and 21 computed factors, we move away from automated software black-boxes to calculate the core metrics completely using MS-Excel.

What we cover in this manual walkthrough:

1. Eigenvalues & Variance Explained: Learn exactly how to compute eigenvalues and determine the precise percentage of variance each principal component captures.

2. Communality Calculations: See step-by-step how to calculate communalities to see how much variance in each original item is accounted for by the retained factors.

3. Factor vs. Construct: We break down the crucial theoretical distinction between a statistical "factor" (a mathematically derived variance cluster) and a theoretical "construct" (the underlying psychological or real-world concept you are trying to measure).

I've explained all these in this video:

Principal Component Analysis: https://youtu.be/8iFRnEuqYFc


Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 17, 2026, 7:19:39 PM (6 days ago) Jun 17
to dataanalysistraining
Dear Friends

Next, we dive into Partial Least Squares Discriminant Analysis (PLS-DA) for text mining, a powerful method when handling high-dimensional, highly collinear data where you have far more features than samples.

We first contrast the workflow architecture of standard algorithms against the structural shift required by PLS-DA.

The Workflow Contrast:

Standard Pipeline (NBA, LR, SVM): Starts with a Document-Feature Matrix (dfm), trims it globally (dfm-trim), splits into train/test sets, aligns vocabularies via dfm-match, applies tfidf, and feeds straight into the model.

PLS-DA Pipeline: Shifts the order significantly. The dfm is split into train/test sets before trimming. The training set is trimmed (trim), matched against the test set (dfm-match), weighted via tfidf, and crucially converted using as.matrix because PLS-DA requires a dense matrix format rather than a sparse quanteda object.

We walk through preparing our target labels by converting our X and Y values using the ifelse() function to format our classes. Finally, we build the PLS-DA model on our structured training matrix and deploy it to predict our unseen test data.

Partial Least Square Discriminant Analysis-1: https://youtu.be/qf9gHAgI8-A

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 18, 2026, 7:16:25 PM (5 days ago) Jun 18
to dataanalysistraining
Dear Friends

Next we continue our deep dive into Partial Least Squares Discriminant Analysis (PLS-DA) by turning our focus toward model interpretability. While PLS-DA is excellent at handling high-dimensional text data, its true power lies in its latent variables—allowing us to see exactly which vocabulary words are driving the separation between categories.

We walk through how to extract feature loadings and predict the most impactful terms for two separate PLS-DA dimensions, tackling them one by one.

What we cover in this step-by-step coding session:

Extracting Dimension Loadings: Navigating the internal structure of the trained PLS-DA object in R to pull out the exact feature weights for your target latent variables.

Isolating Top 20 Words (One by One): Writing the filtering and sorting script to extract the top 20 most influential words from individual dimensions, looking at how they align with your underlying target classes.

Interpreting the Latent Space: Understand how these isolated keyword arrays reveal the semantic themes the model uses to draw its ultimate classification boundaries.

Partial Least Square Discriminant Analysis-2: https://youtu.be/9Wjd9_1Z8eo

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 21, 2026, 7:46:24 PM (2 days ago) Jun 21
to dataanalysistraining
Dear Friends

Continuing, let's break down two of the most popular tree-based algorithms—Decision Trees and Random Forests—using the classic mtcars dataset in R!

First, we dive into the fundamentals of a Decision Tree. You’ll learn how it splits data based on features (like cylinders or am) to predict a target variable (like wt, disp, wt etc), creating an intuitive, flowchart-like structure.

Next, we take it a step further with Random Forests. We explore how this powerful ensemble method combines multiple decision trees to reduce overfitting, handle variance, and deliver much more accurate and robust predictions.

What We Cover:
  • Introduction to the mtcars dataset in R
  • How Decision Trees split data and make predictions
  • Moving from a single tree to a Random Forest (Ensemble Learning)
  • Comparing the performance of both models

I've explained these concepts in this video:

Decision Tree (on mtcars): https://youtu.be/MQWZF0EYdAY

Happy Learning
Neeraj
mtcars.csv
mtcars.sav

Neeraj Kaushik

unread,
Jun 22, 2026, 7:19:09 PM (2 days ago) Jun 22
to dataanalysistraining
Dear Friends

Using the mtcars dataset, this video breaks down how to transition from a single decision tree to a powerful forest of diverse, aggregated models to maximize predictive accuracy and prevent overfitting.

What we will learn:
  • Ensemble Modeling with randomForest: Setting up the core package and initializing a robust forest structure.
  • The Power of Bagging (ntree = 500): Understanding how building 500 distinct, randomized decision trees creates a stable, democratic voting system for predictions.
  • Reproducible Workflows: Implementing set.seed to ensure your randomized ensemble yields consistent and verifiable results every time.
  • The SMPC Evaluation: Walking through the complete pipeline—splitting your data, training the forest model, predicting unseen test data, and generating a Confusion Matrix to evaluate overall classification performance.
I've explained these concepts in this video:

Random Forest (on mtcars): https://youtu.be/UMUza6chC6Y

Happy Learning
Neeraj
decision tree random forest knn on mtcars.R
mtcars.csv
mtcars.sav

Neeraj Kaushik

unread,
Jun 23, 2026, 7:14:12 PM (9 hours ago) Jun 23
to dataanalysistraining
Dear Friends

Next, we continue our machine learning journey by diving deep into the K-Nearest Neighbors (KNN) algorithm using R!

Before running the model, we break down the critical concept of data normalization—explaining why scaling your features is essential for distance-based algorithms like KNN to prevent certain variables from dominating the model. From there, we walk through the complete workflow: splitting the dataset into training and testing sets, training the KNN classifier, and evaluating its accuracy on the unseen test data.

To wrap things up, we bring everything together with a comprehensive recap

What We Cover:
  • The intuition behind K-Nearest Neighbors (KNN)
  • Why and how to normalize data before applying distance metrics
  • Step-by-step train-test data splitting in R
  • Training KNN and evaluating model accuracy on test data
I've explained all these in this video:

K-Nearest Neighbors-KNN (Supervised Machine Learning using R-Studio): https://youtu.be/gDu2WiO7xNY

Happy Learning
Neeraj
Reply all
Reply to author
Forward
0 new messages