Unsupervised Topic Modeling for Text Analysis

Neeraj Kaushik

unread,

Jun 30, 2026, 8:31:53 PMJun 30

to dataanalysistraining

Dear Friends,

Welcome to the Learning July 2026!

So far, we have explored Supervised Machine Learning for Text Analysis, where the dependent variable (DV) is known and guides the learning process. Now, it's time to move on to Unsupervised Machine Learning for Text Analysis, where no dependent variable is predefined. Instead, the goal is to uncover hidden patterns, structures, and relationships within textual data.

In this session, we'll begin by exploring the fascinating world of semantic vector spaces and develop an intuition for how machines learn and understand relationships between words.

Before we begin, let's understand two important terms:

A vector is simply a list of numbers that represents an object. For example, the word "King" might be represented as [0.82, -0.15, 0.64, ...] , while "Queen" has its own unique numeric representation.
A vector space is a geometric space where these vectors are placed. Words with similar meanings, such as King–Queen or Doctor–Nurse , tend to appear closer together than unrelated words like King–Banana .

Some of the topics we'll cover include:

The intuition behind semantic vector spaces: Learn how words are converted into vectors so that mathematical operations can capture semantic meaning.
The famous vector algebra example: King − Man + Woman = Queen . We'll see how shifting vectors in a high-dimensional space enables machines to discover meaningful relationships and analogies using pure mathematics.
Navigating semantic spaces with cosine similarity: Learn how cosine similarity measures the angle between word vectors to determine how conceptually similar two words are, regardless of document length.
Building vs. importing semantic spaces: Explore the two primary approaches for obtaining semantic vector spaces for your projects:

Training your own semantic space: Generate a domain-specific semantic space by applying Latent Semantic Analysis (LSA) to your own text corpus.
Using pre-trained semantic spaces: Leverage large, publicly available semantic spaces trained on billions of words from trusted repositories, such as the Homepage of Fritz Günther – Semantic Spaces

I've explained these concepts in detail in the following video:

Semantic Vector Space: https://youtu.be/fmvnie-5JOw

Happy learning

Neeraj

Semantic Vector Space: https://youtu.be/fmvnie-5JOw

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 1, 2026, 7:50:25 PMJul 1

to dataanalysistraining

Dear Friends

In the second video, we explore the mathematical foundation of Latent Semantic Analysis (LSA) by understanding how a text corpus decomposes into three matrices using Singular Value Decomposition (SVD).

We discuss the role of the

Term–Concept Matrix (Tk), which represents the relationship between words and latent concepts;
Singular Value Matrix (Sk), which captures the importance of each latent concept; and
Document–Concept Matrix (Dk), which shows how strongly each document associates with those concepts.

Together, these matrices enable dimensionality reduction, uncover hidden semantic structures, and form the basis for building meaningful semantic vector spaces used in modern text analysis.

Let's learn the end-to-end workflow of Latent Semantic Analysis (LSA) using a practical text model example of 20 speeches and 800 words: how raw text transitions from a trimmed Document-Feature Matrix (dfm_trim) to a weighted TF-IDF matrix (dfm_tfidf) before applying LSA with 10 dimensions (nd=10).

Key Concepts Covered:

Feature/Token Vector Space: How the 800x10 matrix maps features into a dense vector space to find semantic "Cosine Neighbors.

Singular Values (Sk): How to compute eigenvalues and determine the total Variance Explained

Document Vector Space (dk): How the 20x10 matrix maps documents to evaluate similarity and calculate distances for downstream Cluster Analysis.

I've explained these concepts in this video:

Latent Semantic Analysis-1: https://youtu.be/FF-uDKgPyAo

Happy Learning

Neeraj

Unsupervised ML Text Analysis.xopp

Neeraj Kaushik

unread,

Jul 2, 2026, 7:48:20 PMJul 2

to dataanalysistraining

Dear Friends

Next we build a complete Latent Semantic Analysis (LSA) pipeline from scratch using R. We walk through importing raw text files and executing foundational text cleaning using the tm package—including removeNumbers, removePunctuation, dropping stopwords via removeWords, and lemmatization.

Next, I pull back the curtain on my custom R package built on quanteda. We pass our cleaned tokens into this package to seamlessly generate a Document-Feature Matrix (DFM).

We cover the following:

Matrix Refinement: How to trim the sparse data (dfm_trim) and apply TF-IDF weighting (dfm_tfidf) to isolate highly meaningful features.

Dimensionality Reduction: Fitting the LSA model using textmodel_lsa.

The Math of Sk: Understanding singular values (Sk) and learning how to calculate eigenvalues to determine the exact total Variance Explained by your semantic dimensions.

I've explained all of it in this video:

Latent Semantic Analysis-2: https://youtu.be/H1SrWJUrakc

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 3, 2026, 8:47:35 PMJul 3

to dataanalysistraining

Dear Friends

Next we dive deep into the document portion (dk) of the Latent Semantic Analysis (LSA) output. Moving from text vectors to visual insights, we work with a 47 text x times 10 dimensions table to map out and analyze document similarities.

What we cover in this step-by-step coding tutorial:

1. Distance Matrices: Transforming high-dimensional LSA document coordinates into a geometric distance matrix.
2. Hierarchical Clustering (hclust): Applying agglomerative clustering to group similar speeches and texts together based on their underlying semantic profiles.
3. Advanced Cluster Visualization: Plotting the final dendrograms side-by-side using R's native base graphics as well as the polished, highly customizable ggdendro package.
4. String Manipulation (str_sub): Using the stringr package to clean up, truncate, and beautifully label the leaves of our dendrogram for publication-ready presentation.

I've explained these concepts in this video:

Latent Semantic Analysis-3: https://youtu.be/wLRtBu1uREg

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 5, 2026, 8:00:12 PMJul 5

to dataanalysistraining

Dear Friends

Next, Let's tackle a crucial technique -rotation, for improving the interpretability of your Latent Semantic Analysis (LSA) models: Varimax Rotation.

While raw LSA dimensions provide orthogonal vectors, they can often be mathematically abstract and difficult to interpret conceptually. By applying a Varimax rotation to our document matrix (dk), we maximize the variance of the squared loadings, effectively forcing the documents to align more distinctly with specific semantic themes.

What we cover in this hands-on session:

1. Applying Varimax Rotation: Step-by-step implementation in R to shift our coordinate system for clearer factor separation.

2. Rotated 2D Scatter Plots: Mapping the 2 newly rotated dimensions using R’s base graphics package to visualize clean, distinct document clusters.

3. The Ultimate Comparison: Side-by-side evaluation of the unrotated vs. rotated dimensions. We plot both coordinate spaces to visually demonstrate exactly how rotation "sharpens" the boundaries and simplifies data interpretation.

I've explained these concepts in this video:

Latent Semantic Analysis-4: https://youtu.be/FmvaUyYqcJ4

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 7, 2026, 3:19:17 AMJul 7

to dataanalysistraining

Dear Friends

Next, we shift our focus to the features of the Latent Semantic Analysis (LSA) output to understand exactly which words drive our semantic dimensions.

Instead of treating the model like a black box, we extract the Features x 10 Dimensions matrix to uncover the mathematical weights assigned to each token in our vocabulary.

What we cover in this practical R walkthrough:

Extracting Top Semantic Words: Step-by-step code to isolate the top 20 most impactful words from both the positive and negative sides of each dimension, revealing the underlying conceptual contrasts.

Data Wrangling with dplyr: Master the core data manipulation verbs needed to slice and dice your text metrics.

I break down how to use:
rownames() to preserve token identities.
select() to isolate target dimensions.
arrange() to sort weights from highest to lowest (and vice versa).
head() to capture the top 20 extreme values.
pull() to extract clean vector inputs for downstream analysis.

Latent Semantic Analysis-5: https://youtu.be/GXC92WF4qPc

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 7, 2026, 8:05:07 PMJul 7

to dataanalysistraining

Dear Friends

Next, we tackle the ultimate goal of Latent Semantic Analysis: projecting a brand-new, unseen text string directly into our trained LSA semantic space to make predictions.

We walk through building a production-ready text prediction pipeline step-by-step using a sample search query: "poverty reduction and economic opportunity for family".

What we cover in this code-along tutorial:

Tokenization & Stopword Stripping: Converting the raw user_query string into a tokens object (t1) and cleaning it with tokens_remove() to drop standard stopwords and custom domain-specific words like "american", "nation", and "world".

Vocabulary Alignment via dfm_match: A crucial step! We create a Document-Feature Matrix (dq = dfm(t1)) and force it to match our original training vocabulary structure using dfm_match().

LSA Projection & Coordinate Extraction: Using the predict() function on our LSA model (l_new) with our aligned query_dfm. We then extract the exact multi-dimensional semantic coordinates of the new query from pred$docs_newspace.

I've explained these concepts in this video:

Latent Semantic Analysis-6: https://youtu.be/VYTnSmV4-Ak

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 8, 2026, 7:18:58 PMJul 8

to dataanalysistraining

Dear Friends

Finally, we explore the deep geometric side of Latent Semantic Analysis (LSA) by breaking down 8 essential semantic similarity LSAfun functions and demonstrating how to align different R text mining frameworks.

We walk through the core calculations required to measure semantic distance and evaluation metrics, including:

Cosine() & pairwise(): Measuring similarity from a single word to another versus checking words in pre-defined pairs.

multicos() & neighbors(): Generating a complete symmetric cross-similarity grid matrix and hunting down the top 10 nearest conceptual vocabulary neighbors.

costring() & multicostring(): Stepping up to document levels by comparing phrase vs. phrase, and matching a phrase vector against an array of separate words.

coherence() & Matrix Math: Running text flow structural evaluations where local coherence measures the step-by-step thematic transitions between adjacent sentences, alongside computing cosine values between sets of sentences or documents.

The Ultimate Framework Fix:
To wrap up the session, we run the classic lsa R package. A major pitfall for many programmers is that the lsa package natively expects a Term-Document Matrix (TDM). I demonstrate exactly how to successfully pass your quanteda pipelines into it by first transposing your Document-Feature Matrix (DFM) to ensure identical, mathematically consistent results.

LSAfun - An R package for computations based on Latent Semantic Analysis | Behavior Research Methods

I've explained these concepts in this video:

Latent Semantic Analysis-7: https://youtu.be/HceIKJLZqxw

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 12, 2026, 8:25:06 PM (13 days ago) Jul 12

to dataanalysistraining

Dear Friends

Let's continue with demystifying modern Natural Language Processing (NLP) in this comprehensive guide to Word2Vec, the foundational framework for deep learning with text.

This concept explains how machines move beyond simple counting to truly understand the contextual meaning and relationships between words.

Topics discussed:

Vectorization vs. Embeddings: Understanding the critical shift from sparse, high-dimensional matrices (like TF-IDF) to dense, continuous word embeddings.

CBOW vs. Skip-gram: Comparing the two core Word2Vec architectures—predicting a target word from its context versus using a single word to predict its surrounding context.

The Complete Workflow: Walking through text cleaning, structuring input data (including practical hard drive export/re-import pipelines), and vector space optimization.

Vector Space Arithmetic: Demonstrate the magic of semantic queries and vector math e.g., King - Man + Woman = Queen.

I've explained these concepts in this video:

Word2Vec-1: https://youtu.be/2zXLrgp2l_Y

Happy Learning

Neeraj

On Fri, Jul 10, 2026 at 4:47 AM Neeraj Kaushik <kaushi...@gmail.com> wrote:

Dear Friends

Let's set the stage for our upcoming Natural Language Processing (NLP) projects. I introduce the core techniques we will use—GloVe and text2vec—and reveal the text dataset we’ll analyze: Jane Austen’s classic novel, Pride and Prejudice.

What we cover next:

An Inspiration: Introducing Julia Silge, a professional data scientist at Posit whose incredible work combining data science with Jane Austen's literature inspired this approach.

Her FREE eBooks:
Welcome to Text Mining with R
Supervised Machine Learning for Text Analysis in R
Tidy Modeling with R
The Dataset: A short, quick summary of the plot of Pride and Prejudice to establish the context for our future text mining and word embedding models.

Let's start with a story:

Pride and Prejudice by Jane Austen: https://youtu.be/1UgkTdUvsCk

Happy Learning
Neeraj

word2vec.R

Unsupervised ML Text Analysis.xopp

Neeraj Kaushik

unread,

Jul 13, 2026, 8:25:12 PM (12 days ago) Jul 13

to dataanalysistraining

Dear Friends

Next let's learn the entire lifecycle of text modeling using Word2Vec—from extreme corpus cleaning to advanced vector space arithmetic.

What we discuss:

Advanced Text Preprocessing: Cleaning textual data using tm and textstem by stripping punctuation, numbers, custom stop words, and short noise words.

Model Training & Architecture: Writing structured text to disk and training a Word2Vec model with customized hyper-parameters like 50-dimensional vectors, a context window of 5, and negative sampling.

Semantic Cross-Comparison Matrices: Querying directional cosine similarity profiles between main characters (e.g., Darcy vs. Elizabeth, Wickham, and Bingley) and building comparative cross-grids for male and female leads.

Vector Mathematics & Concept Spaces: Executing classic word equations (such as modifying character traits using wealth/poverty attributes) and creating a unified "Marriage & Romance" theme vector using vector averaging.

Vector Rejection: Use the reject() function to strip the romance theme out of specific character vectors to reveal their raw, underlying thematic associations.

I've explained these concepts in this video:

Word2Vec-2: https://youtu.be/I6mleNbbnYc

Happy Learning

Neeraj

Unsupervised ML Text Analysis.pdf

Neeraj Kaushik

unread,

Jul 14, 2026, 8:13:22 PM (11 days ago) Jul 14

to dataanalysistraining

Dear Friends

Next, let's break down the theoretical shift from local context windows to global statistics, giving you a comprehensive look at how Global Vectors for Word Representation (GloVe) capture the overarching meaning of an entire text corpus.

Contents:

Word2Vec vs. GloVe Architecture: Understand the key difference between Word2Vec's predictive, local window-based approach (CBOW/Skip-gram) and GloVe's count-based, global matrix factorization approach.

Text Cleaning & Standardization: The critical preprocessing steps required to prepare raw textual corpora for global co-occurrence tracking.

Structuring Input Data: Build the massive Word-Context Co-occurrence Matrix that forms the backbone of the GloVe algorithm.

Vector Space Optimization: How GloVe uses a weighted least-squares objective function to optimize dense word vectors based on log-probabilities of word co-occurrences.

Vector Space Queries & Statistics: Run semantic queries, measure cosine similarities, and extract meaningful statistical relationships from the optimized embedding space.

I've explained these concepts in this video:

Global Vector for Word Representation (GloVe)-1: https://youtu.be/OPn7CYixWNg

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 15, 2026, 8:04:31 PM (10 days ago) Jul 15

to dataanalysistraining

Dear Friends

Next let's discuss the text2vec pipeline to build, optimize, and interrogate a global co-occurrence space to uncover deep semantic interactions and societal themes.

Contents covered:

Tokenization & TCM Construction: Cleaning a literary corpus and building a Word-Context Term-Co-occurrence Matrix (TCM) with a customized 10-word skip-gram window.

Training Global Vectors: Fitting a GloVe model (rank = 30, n_iter = 50) and mathematically combining main and context vector components for high-fidelity representations.

Direct Similarity & Mapping: Calculating 1-to-1 cosine similarities (e.g., Darcy vs. Wickham) and mapping social distance circles or thematic shifts across main characters.

Unsupervised Word Clustering: Utilizing $K$-Means clustering to group global vectors and automatically isolate thematic buckets, such as words closely tied to "marriage"

Vector Algebra & Contrasts: Performing vector subtraction to capture exclusive character traits (e.g., separating the personalities of Elizabeth and Jane) and executing semantic analogy tasks Darcy- Money+ Poor.

Quantifying Text Bias (WEAT): Building a Word Embedding Association Test matrix to measure structural biases comparing Male vs. Female leads against Wealth and Family attributes.

I've explained these concepts in this video:

Global Vector for Word Representation (GloVe)-2: https://youtu.be/a8ENXL29McI

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 16, 2026, 8:09:07 PM (9 days ago) Jul 16

to dataanalysistraining

Dear Friends

Next, this tutorial focuses entirely on the downstream applications of word embeddings, demonstrating how to extract profound semantic, thematic, and sociological insights from an optimized vector space.

Contents covered:

Unsupervised Word Clustering: Implementing K-Means clustering (kmeans) on continuous word vectors to automatically discover hidden thematic buckets, such as extracting vocabulary tied specifically to "marriage."

Intersecting Semantic Spaces: Computing overlapping cosine similarities (sim2) to uncover shared semantic worlds, such as identifying the structural vocabulary bounding the relationship between characters like Elizabeth and Jane.Vector Subtraction &

Contrast Geometry: Performing vector arithmetic to strip away shared traits and isolate distinguishing character profiles—mapping words that are uniquely "More Elizabeth" versus "More Jane."

Word Algebra & Analogy Tasks: Executing classic geometric equations (DarcyMoney + Poor = ?) and applying logical filtering matrices to clean raw query outputs.) and applying logical filtering matrices to clean raw query outputs.

Quantifying Bias: Building a manual Word Embedding Association Test matrix to measure, isolate, and print directional biases comparing target groups (Male vs. Female leads) against specific attribute domains (Wealth vs. Family).

I've explained these concepts in this video:

Global Vector for Word Representation (GloVe)-3: https://youtu.be/xIrugOgV4a8

R-Script for GloVe:
https://drive.google.com/file/d/1T6Xek7nAvJSDHJ-xYuD_3czcs6EYpo2y/view?usp=sharing

PDF file:
https://drive.google.com/file/d/11tT41fM1wa6Pscg5iEnwYgksy12bIIWs/view?usp=sharing

Tutorial on GloVe:
https://nlp.stanford.edu/projects/glove/

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 17, 2026, 5:49:17 AM (9 days ago) Jul 17

to dataanalysistraining

Dear Friends

Next we moves from abstract mathematics to visual maps, demonstrating how to project complex word relationships from Pride and Prejudice into highly interpretable 2D and interactive 3D spaces.

Contents covered:

Dimensionality Reduction with PCA: Subsetting a trained semantic space and executing standard PCA (prcomp) to capture maximum vocabulary variance.

Varimax Axis Rotation: Applying the Varimax rotation method (varimax) to raw loadings to clean up structural overlap and maximize the alignment of text concepts along clear visual axes.

Geometric Grouping & Annotation: Using ggplot2 to map characters and themes, adding red reference intercepts, and overlaying semi-transparent bounding boxes (geom_rect) to isolate distinct "Semantic Zones."

Interactive 3D Semantic Mapping: Expanding the vocabulary corpus to construct a multi-dimensional dataframe across three principal components.

Dynamic Visualization with Plotly: Building an interactive 3D scatter plot (plot_ly) where you can rotate, pan, and explore character relationships across custom thematic axes like Social Status and Emotional Tone.

I've explained these concepts in this video:

Visualization using PCA (after GloVe): https://youtu.be/xh4ZvZimsJU

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 19, 2026, 8:03:53 PM (6 days ago) Jul 19

to dataanalysistraining

Dear Friends

Next let's move beyond linear projections like standard PCA to map complex, non-linear text semantics from Pride and Prejudice into highly interpretable low-dimensional narrative clusters.

Contents covered:

Non-Linear Manifold Learning (t-SNE & UMAP): Implementing Rtsne and umap to capture local structures and complex word distributions, mapping them into precise 2D layouts.

The ReLU Trick for NMF: Transforming dense GloVe embeddings into entirely non-negative feature matrices by isolating positive and negative spatial blocks.

NMF Factorization (nmf): Decomposing the rectified GloVe matrix using the Brunet method down to fundamental semantic dimensions to track pure character-to-theme association signals.

Polished Data Visualization: Building multi-layered layouts in ggplot2 featuring optimized axis limits, customized thematic borders, and overlap-free label scaling via ggrepel.

Narrative Communities in 3D: Uncovering core plot groups by applying $K$-Means clustering (centers = 4) to your vector structures and rendering the final interactive output in plotly.

I've explained these in this video:

Visualization using tSNE UMAP NMF: https://youtu.be/msM2VB-NLSM

Happy Learning

Neeraj

Neeraj Kaushik

unread,

Jul 20, 2026, 8:42:02 PM (5 days ago) Jul 20

to dataanalysistraining

Dear Friends,

In our continued exploration of unsupervised machine learning, next let's discuss FastText. While previous models like GloVe treat words as atomic units, FastText, developed by Tomáš Mikolov Facebook’s AI Research lab, breaks words down into smaller n-grams.

This approach addresses a critical limitation in traditional word embeddings known as the out-of-vocabulary (OOV) problem. In standard models, if a word was not present in the training data, the model could not generate a vector for it, effectively rendering the word "unknown."

FastText solves this by representing words as a sum of their sub-word n-grams. Consequently, even if a specific word is missing from the training set, the model can still construct a meaningful vector by aggregating the representations of its constituent character sequences.

This sub-word information allows the model to capture the internal structure of words, making it exceptionally effective at handling rare words and out-of-vocabulary terms. In the video, we discuss how this approach improves morphological understanding and enhances performance across various NLP tasks.

You can watch the full explanation here:

FastText Explained: Solving the Out-of-Vocabulary (OOV) Problem in NLP Part-1: https://youtu.be/gXU-_0iCF64

Happy learning,
Neeraj

Neeraj Kaushik

unread,

Jul 21, 2026, 8:01:01 PM (4 days ago) Jul 21

to dataanalysistraining

Dear Friends,

In the second video of our FastText series, we dive into practical implementation using English word vectors. While models like GloVe treat words as atomic units, FastText breaks them down into subword information, allowing for a deeper understanding of language structure.

In this session, we cover:

Working with pre-trained English word vectors (https://fasttext.cc/docs/en/english-vectors.html) from the official FastText repository.
Understanding how these vectors capture semantic meaning more effectively than traditional methods.
Practical steps for integrating these vectors into your text analysis projects.

You can access the resources and explore the vectors here:

English word vectors · fastText

Word vectors for 157 languages · fastText

FastText Explained: Solving the Out-of-Vocabulary (OOV) Problem in NLP Part-2: https://youtu.be/_rhYbTi3jVY

Happy Learning,
Neeraj Kaushik

Neeraj Kaushik

unread,

Jul 22, 2026, 7:56:21 PM (3 days ago) Jul 22

to dataanalysistraining

Dear Friends,

Next, we explore ELMo (Embeddings from Language Models) and how it addresses the fundamental limitations of previous models like Word2Vec, GloVe, and FastText.

While Word2Vec, GloVe, and FastText were revolutionary, they share a common POLYSEMY problem: they produce static embeddings. In these models, a word has a single fixed vector regardless of its context. This means the word "bank" would have the same mathematical representation whether you are talking about a "river bank" or a "financial bank." While FastText improved upon this by using subword information to handle out-of-vocabulary words, it still could not resolve polysemy—the phenomenon where one word has multiple meanings.

ELMo solves this by introducing contextualized word representations. Instead of a fixed lookup table, ELMo looks at the entire sentence before assigning a vector to each word. It achieves this using a deep bidirectional LSTM (BiLSTM) trained on a massive corpus.

Key advantages of ELMo include:

Contextual Awareness: The representation of a word changes based on the words surrounding it, accurately capturing different meanings in different contexts.

Deep Representations: ELMo combines the internal states of the BiLSTM layers, allowing it to capture both low-level morphological features and high-level semantic nuances.

Improved Performance: By integrating ELMo into existing NLP models, we see significant improvements across tasks like question answering, sentiment analysis, and named entity recognition.

I have detailed these concepts and the transition from static to dynamic embeddings in the following video:

ELMo Explained: Solving Polysemy and Context in Word Embeddings: https://youtu.be/wg8V03hjdLw

Happy learning,
Neeraj

Neeraj Kaushik

unread,

Jul 23, 2026, 7:22:18 PM (2 days ago) Jul 23

to dataanalysistraining

Dear Friends,

Next, we explore BERT (Bidirectional Encoder Representations from Transformers), a landmark model in NLP. Unlike previous models that read text sequentially, BERT processes words in relation to all other words in a sentence simultaneously. This bidirectional approach allows it to capture deep contextual meaning and resolve ambiguities in language more effectively than ever before.

While transformer models are traditionally built in Python, this video bridges the gap for R users—demonstrating how to leverage modern R packages to implement contextual language models without abandoning our preferred workflow.

Contents covered:

The BERT Paradigm Shift: How bidirectional transformer architectures capture deep contextual meaning compared to static embeddings (Word2Vec, GloVe, FastText).

Demystifying the BERT Ecosystem: Breaking down specialized and optimized BERT variants, including

DistilBERT (lightweight),
ALBERT (parameter-efficient),
DeBERTa (enhanced attention), and
domain-specific models like Legal-BERT and BioMed/Medical-BERT.

R vs. Python for NLP: Comparing the language ecosystems and bridging Python's deep learning stack with R's data-wrangling efficiency.

The "Talk, Text, Topics" Framework: Exploring Prof. Oscar's approach (talk, text, topics) to structure modern NLP pipelines and streamline transformer workflows directly inside R.

Meet Prof Oscar https://harmonyresearchgroup.com/author/oscar/

I've explained these concepts in this video:

BERT Explained: Solving Context and Bidirectionality in NLP Part-1: https://youtu.be/H34fe61LMfo

Happy learning,
Neeraj

BERT.R

Neeraj Kaushik

unread,

Jul 24, 2026, 11:13:31 PM (2 days ago) Jul 24

to dataanalysistraining

Dear Friends

Next let's step into practical transformer modeling by bridging R and Python to analyze real-world political speeches.

Let's work with the text R package, which serves as a seamless interface between R and Python's Hugging Face ecosystem. Using a dataset of 47 US Presidential Inaugural Speeches (speech.xlsx), we will learn how to extract rich text representations and use them in a supervised classification pipeline.

Contents covered:

Bridging R & Python: Installing and configuring the text package to access Python-backed transformer models directly inside your R environment.

Dataset Walkthrough: Structuring and preparing the US Presidential Inaugural Speeches dataset filtered for Democratic and Republican party affiliations.

Generating Text Representations: Converting raw speech text into high-dimensional numerical embeddings using modern NLP architectures.

Predictive Modeling with LDA: Applying Linear Discriminant Analysis (LDA) to predict political party labels based on the underlying linguistic features of each speech.

Model Evaluation: Evaluating how effectively discriminant analysis separates political discourse and classifies unseen speeches.

I've explained these concepts in this video:

BERT Explained: Solving Context and Bidirectionality in NLP Part-2: https://youtu.be/43UT_IWo3Xk

Happy Learning

Neeraj

speech.xlsx

Reply all

Reply to author

Forward