Problem with CaGalt() function

Gabriel Parriaux

unread,

Mar 12, 2024, 1:24:46 PM3/12/24

to FactoMineR users

Hello,

I’m doing textual data analysis on a corpus of audio recordings that have been transcribed.

After having done a Reinert clustering with Rainette method in R, I want to perform a Correspondence analysis that crosses the frequency of words in the segments and other categorical variables that I have for my corpus.

Some investigation brought me to CaGalt() function which seems to be the one I need.

I referred to this article which presents the method: https://journal.r-project.org/archive/2015/RJ-2015-010/RJ-2015-010.pdf

Actually, my data.frame contains around 2800 segments as individuals in rows, around 1000 words in columns with frequencies as numeric values and nearly 20 categorical variables in columns (the docvars from my Quanteda corpus object).

One question I have is: do I have to convert my categorical variables to dummy variables before the analysis?

I thought it would be done automatically by the CaGalt() function, but it wouldn’t.

Then I read somewhere that the categorical variables needed to be dummy, so I converted them to dummy.

But now, I have an error when I perform the following function:

res.cagalt <- CaGalt(Y = table_for_ca_galt[, 108:ncol(table_for_ca_galt)], X = table_for_ca_galt[, 1:107], type = "n", conf.ellip = TRUE, nb.ellip = 100, level.ventil = 0, sx = NULL, graph = TRUE, axes = c(1, 2))

Here is the error message:

Error in 1:ncol(U) : argument of length 0

Is it appropriate to use CaGalt in my case?

Do I have to convert my categorical variables to dummy variables?

Do you have an idea how I can solve the error I get?

Thanks a lot for your help,

Gabriel

François Husson

unread,

Mar 13, 2024, 4:00:23 AM3/13/24

to factomin...@googlegroups.com

Hello,

The categorical variables must be considered as categorical, so it must be factors (not characters). The function will tranform them as dummy variables so you don not have to transform them.

FH

--
Vous recevez ce message, car vous êtes abonné au groupe Google Groupes "FactoMineR users".
Pour vous désabonner de ce groupe et ne plus recevoir d'e-mails le concernant, envoyez un e-mail à l'adresse factominer-use...@googlegroups.com.
Cette discussion peut être lue sur le Web à l'adresse https://groups.google.com/d/msgid/factominer-users/b9c28b91-7e51-4f48-989e-f3e49780f0d3n%40googlegroups.com.

--
François Husson
Department Statistics & Computer Science
L'Institut Agro
65 rue de St-Brieuc - 35042 Rennes
Tel: +33 2 23 48 58 86
https://husson.github.io/
https://www.youtube.com/@HussonFrancois/videos

Gabriel Parriaux

unread,

Mar 13, 2024, 11:43:50 AM3/13/24

to FactoMineR users

Thanks a lot for helping!

I could create my table for CaGalt appropriately, I think, with Factor columns for variables and numeric columns for word frequencies.

When I run the CaGalt() function, I have no more error in the console, but the process gets stuck for quite a long time and I get no output, even after several minutes (I didn’t wait more than 10 minutes…). I have to interrupt the process on each intent. My machine (MacBook Pro M2) usually computes things quite fast, even for heavy computations like clustering. I have the impression that the process is stuck, but have no clue of what is happening.

My table is composed of 3990 observations × 18 variables and 1073 words.

Here is the output of the beginning of str(myTable):

data.frame': 3990 obs. of 1091 variables:
$ Speaker : Factor w/ 11 levels "teacher1","teacher10",..: 1 1 1 1 1 1 1 1 1 5 ...
$ To : Factor w/ 3 levels "address to class",..: 1 1 1 1 1 1 1 1 1 1 ...
$ lesson_id : Factor w/ 10 levels "cl01_pr1","cl01_pr2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ lesson_topic : Factor w/ 2 levels "lesson prog1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ programming_type : Factor w/ 2 levels "textual programing",..: 2 2 2 2 2 2 2 2 2 2 ...
$ class_id : Factor w/ 5 levels "class01","class02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gender : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
$ age_range : Factor w/ 3 levels "age < 23 yo",..: 1 1 1 1 1 1 1 1 1 3 ...
$ professional_role : Factor w/ 2 levels "lower sec teacher",..: 2 2 2 2 2 2 2 2 2 2 ...
$ discipline : Factor w/ 3 levels "discipl maths",..: 3 3 3 3 3 3 3 3 3 3 ...
$ teaching_experience : Factor w/ 4 levels "teaching exp < 3y",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cs_teaching_experience : Factor w/ 3 levels "cs teaching exp 0y",..: 1 1 1 1 1 1 1 1 1 1 ...
$ teaching_qualification : Factor w/ 2 levels "graduated teacher",..: 2 2 2 2 2 2 2 2 2 2 ...
$ degree : Factor w/ 4 levels "Lower sec teacher's Master deg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ cs_education : Factor w/ 3 levels "cs ed inside teacher ed",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Integrated_TPCK_Mastery : Factor w/ 4 levels "Integ TPCK Mastery Fair",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Foundational_Knowledge_Base: Factor w/ 4 levels "Found Knowledge Base Fair",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Cluster : Factor w/ 40 levels "cluster_1","cluster_10",..: 8 34 34 34 19 14 26 15 34 34 ...
$ être : num 0 1 4 2 0 1 2 3 5 0 ...
$ avoir : num 0 1 0 0 0 0 0 1 0 1 ...
$ faire : num 0 1 0 0 0 0 1 1 2 1 ...
$ pouvoir : num 0 0 0 0 0 0 0 0 0 0 ...
$ aller : num 1 2 0 0 0 2 0 2 6 0 ...
$ là : num 0 1 0 0 0 0 0 0 0 0 ...
$ alors : num 1 1 0 0 0 0 0 0 0 0 ...
$ ouais : num 0 0 0 0 1 2 1 0 1 0 ...
$ donc : num 0 0 0 0 0 0 0 2 1 0 ...

…

The function I run is this one:

res.cagalt <- CaGalt(Y=table_for_ca_galt[,19:1091],X=table_for_ca_galt[,1:18],type="n")

If I try to run it with less categorical variables at the same time, like:

res.cagalt <- CaGalt(Y=table_for_ca_galt[,19:1091],X=table_for_ca_galt[,1:2],type="n")

I have the same problem… 7 minutes and waiting.

And if I try to limit the numeric columns (with word frequencies), like this:

res.cagalt_temp<-CaGalt(Y=table_for_ca_galt[,19:250],X=table_for_ca_galt[,1:2],type="n")

I get the following error:

Error in eigen(crossprod(X, X), symmetric = TRUE) :
infinite or missing values in 'x'

Do you have an idea of what I’m doing wrong?

Should I just be more patient and wait longer?

Thanks a lot for helping again,

Gabriel

Gabriel Parriaux

unread,

Mar 17, 2024, 1:07:48 PM3/17/24

to FactoMineR users

Hi everyone,

I’m coming back for the problem mentioned with CaGalt().

I triple-checked my data to be sure that there was no problem with the table I want to analyse.

In my dataframe:

1. categorical variable columns are all factors (18 factor columns)
2. tokens columns are all numeric (1062 tokens)
3. there are no NA values

I checked for NA values with this function:

df[!complete.cases(df), ]

and the output is this one:

<0 rows> (or 0-length row.names)

Which I understand as “there are no NA values”

4. I have no column with only 0 values (no token that would not belong to any of my documents)

I checked that by ordering all numeric columns in descending order and when I input this to check the sum of the last column (the one with smallest values):

sum(df[ , ncol(df)])

I get an output of

[1] 3

The sum of values in the last column is 3, which shows that there are no columns with only zeros.

5. I have no rows with only 0 values (no document that would not contain any of the tokens)

I checked that by subsetting my dataframe with the condition of keeping rows that would have a zero sum on the numeric columns:

rows_with_zero_sums <- subset(df, rowSums(df[ , 19:ncol(df)], na.rm = TRUE) == 0)

and the resulting table has zero rows.

So my data seems to be clean… it’s a dataframe with 4032 documents × 18 categorical variables and 1062 tokens.

I also exported it as .rds file and imported it to a new project where I just loaded those libraries:

library(FactoMineR)
library(factoextra)
library(tidyverse)

I imported my dataframe and executed CaGalt() this way:

res.cagalt<-CaGalt(Y=df[,19:ncol(df)],X=df[,1:18],type="n")

I hangs for minutes, even hours long. I have to interrupt the process with Esc key and then I have no error or warning message. Just nothing happens.

I also tried to reduce the number of variables by only taking two factors that have 2 levels each:

res.cagalt<-CaGalt(Y=df[,19:ncol(df)],X=df[,4:5],type="n")

But I have the same result: it hangs for ever.

If you have any idea of what I could be doing wrong, it would be of great help!

Thanks a lot,

Gabriel

Reply all

Reply to author

Forward