🔥 Beat the Heat with Cool Advanced Text Analysis🔥: A Supervised & Unsupervised Machine Learning Text Analysis Series!

36 views
Skip to first unread message

Neeraj Kaushik

unread,
May 31, 2026, 7:22:51 PM (3 days ago) May 31
to dataanalysistraining

Dear Friends

I would like to begin by expressing my heartfelt gratitude to Dr. Fakhar Shahzad, The Hong Kong Polytechnic University, Hong Kong (SAR), China, whose motivation and encouragement inspired me to explore and learn these powerful concepts in machine learning and text analysis.

Let's ponder this brilliant quote by John Maxwell:

image.png

In today’s fast-changing world, the best investment we can make is in learning something new. Whether you're a student, researcher, or professional, stepping out of your comfort zone and exploring new concepts like machine learning can open doors to endless opportunities. So let’s challenge ourselves to learn, experiment, and grow together!

In this series, we will explore both Supervised and Unsupervised Machine Learning techniques for text analysis, combining conceptual clarity with practical implementation in R.

Supervised machine learning works with labeled data to predict outcomes. We will cover:

  • Naive Bayes (NB / NBA) – a probabilistic classifier based on Bayes’ theorem, highly effective for text classification.
  • Logistic Regression (LR) – widely used for binary and multiclass prediction problems.
  • Support Vector Machines (SVM) – powerful models that find optimal boundaries between classes.
  • PLS-DA (Partial Least Squares Discriminant Analysis) – ideal for high-dimensional datasets with many features.

Unsupervised machine learning, on the other hand, focuses on discovering hidden patterns in unlabeled data. We will explore:

  • Latent Semantic Analysis (LSA) – uncovers relationships between terms and documents using dimensionality reduction.
  • Word Embeddings (GloVe, Word2Vec) – transform words into meaningful vector representations capturing semantic relationships.
  • Principal Component Analysis (PCA) – reduces dimensionality by transforming features into principal components.
  • t-SNE – useful for visualizing high-dimensional text data in a lower-dimensional space.
  • UMAP – an advanced technique for dimension reduction that better preserves structure and clusters.
  • Latent Dirichlet Allocation (LDA) – a popular method for extracting topics from large text corpora.

We are starting this journey with
Naive Bayes Analysis (NBA), building a strong foundation in probability, preprocessing, model building, and evaluation step by step.

In the first video, I revisited the foundational concepts that are essential for understanding supervised machine learning, especially in the context of text analysis. 

We begin with probability, which helps quantify uncertainty and measure how likely an event is to occur. Building on this, we introduce conditional probability, which tells us the likelihood of an event happening given that another event has already occurred—an important idea when dealing with real-world data dependencies.

The video then progresses to Bayes’ Theorem, a powerful mathematical formula that connects prior knowledge with new evidence. We explain how prior probability (what we already know), likelihood (the probability of observing the data), and evidence combine to compute the posterior probability—our updated belief after seeing the data. This concept forms the backbone of the Naive Bayes algorithm. 

I've explained this in the video:  Naive Bayes Analysis-1: https://youtu.be/kOxveoGyCx4

Happy Learning

Neeraj

NBA.pdf

Neeraj Kaushik

unread,
Jun 1, 2026, 7:26:28 PM (2 days ago) Jun 1
to dataanalysistraining
Dear Friends

In this second video, we walk through the step-by-step workflow of performing Naive Bayes analysis for text data using the quanteda framework. 

The process begins with importing raw text data, which is then converted into a quanteda corpus—a structured format that makes text easier to manage and analyze.

Next, we transform the corpus into tokens, where detailed text cleaning takes place, such as removing punctuation, stopwords, and unnecessary characters. These cleaned tokens are then converted into a document-feature matrix (DFM), which represents the frequency of terms across documents. To improve model efficiency, we apply dfm_trim to remove sparse and less informative features.

Finally, the dataset is split into training and testing sets, allowing us to train the Naive Bayes model and evaluate its performance. This workflow provides a clear and practical pipeline for text classification tasks.

Naive Bayes Analysis-2: https://youtu.be/dC8ahihbw8M

Happy Learning
Neeraj

Neeraj Kaushik

unread,
Jun 2, 2026, 8:12:39 PM (22 hours ago) Jun 2
to dataanalysistraining
Dear Friends

In this 3rd video, we dive into the practical implementation of Naive Bayes Analysis (NBA) using R.

Starting with reading text data from an Excel file, we extract the input text and corresponding labels for classification.

The preprocessing phase involves cleaning the text by removing numbers, punctuation, and common stopwords to ensure meaningful feature extraction.

The cleaned text is then transformed using the quanteda-based workflow, where we generate a document-feature matrix and apply dfm_trim to retain only the most relevant terms. Next, we split the data into training and testing sets using a reproducible sampling method.

We then build the Naive Bayes model using the training data and align the test features with the training set using dfm_match.

Finally, we generate predictions and evaluate model performance using a confusion matrix. This end-to-end demonstration provides a clear understanding of implementing text classification in R.

Naive Bayes Analysis-3: https://youtu.be/hStV7HR7BuY

Happy Learning
Neeraj

NBYT.R
speech.xlsx
Reply all
Reply to author
Forward
0 new messages