Interactive terms in model formulae

21 views
Skip to first unread message

Jared Anderson

unread,
Jan 25, 2022, 6:13:04 PM1/25/22
to breedR


devtools::install_github('famuvie/breedR')
library(breedR)
library(jsonlite)
#I have a data from a Split Plot experiment design 
> fake_data factor1 factor2 rep response 1 C U 1 84.02 2 B T 1 84.02 3 C U 2 218.56 4 B T 2 218.56 5 C U 3 330.00 6 B T 3 330.00 7 C V 1 409.31 8 B T 1 409.31 9 C V 2 458.20 10 B T 2 458.20 11 C V 3 293.85 12 B T 3 293.85 13 C W 1 117.39 14 B T 1 117.39 15 C W 2 302.99 16 B T 2 302.99 17 C W 3 36.25 18 B T 3 36.25 19 C X 1 53.82 20 B T 1 53.82 21 C X 2 440.62 22 B T 2 440.62 23 C X 3 469.63 24 B T 3 469.63 25 C Y 1 375.44 26 B T 1 375.44 27 C Y 2 209.88 28 B T 2 209.88 29 C Y 3 128.76 30 B T 3 128.76 31 C Z 1 195.09 32 B T 1 195.09 33 C Z 2 70.87 34 B T 2 70.87 35 C Z 3 467.48 36 B T 3 467.48 37 D U 1 325.11 38 A S 1 325.11 39 D U 2 368.38 40 A S 2 368.38 41 D U 3 355.41 42 A S 3 355.41 43 D V 1 171.54 44 A S 1 171.54 45 D V 2 492.26 46 A S 2 492.26 47 D V 3 279.85 48 A S 3 279.85 49 D W 1 312.87 50 A S 1 312.87 51 D W 2 480.80 52 A S 2 480.80 53 D W 3 65.35 54 A S 3 65.35 55 D X 1 59.26 56 A S 1 59.26 57 D X 2 273.04 58 A S 2 273.04 59 D X 3 286.85 60 A S 3 286.85 61 D Y 1 306.65 62 A S 1 306.65 63 D Y 2 145.89 64 A S 2 145.89 65 D Y 3 351.04 66 A S 3 351.04 67 D Z 1 2.20 68 A S 1 2.20 69 D Z 2 156.83 70 A S 2 156.83 71 D Z 3 141.18 72 A S 3 141.18


fake_fixed_formula = "response ~ 1 + factor1 + factor2 + factor1:factor2"
fake_random_formula =  "~ rep + rep:factor1"
output = tryCatch({
  outputREMLf90 =breedR::remlf90(fixed = as.formula(fake_fixed_formula),
                                 random = as.formula(fake_random_formula),
                                 data = fake_data,method='ai')
},
warning = function(war){
  print(paste('WARNING:', war))
},
error = function(err){
  print(paste('ERROR:', err))
  return(err)
})

I'm used to expressing model formulae to have a colon to signify interactive terms, but the remlf90 model seems to have an issue with any terms with colons in them (separate issue that I can reproduce as well)

Fixed_New_Formula    <- gsub(':','_',fake_fixed_formula)

treatment_factor_levels = attr(terms(as.formula(Fixed_New_Formula)),"term.labels")
if (any(grepl('_',treatment_factor_levels))){
  new_cols = treatment_factor_levels[grep('_',attr(terms(as.formula(Fixed_New_Formula)),"term.labels"))]
  for (nc in new_cols){
    interacting_factors = lapply(nc,function(x){strsplit(x,'_')[[1]]})[[1]]
    fake_data[,nc] = do.call(paste, c(fake_data[interacting_factors], sep = "_"))
  }
}

Random_New_Formula   <- gsub(':','_',fake_random_formula)

blocking_factor_levels = attr(terms(as.formula(Random_New_Formula)),"term.labels")
if (any(grepl('_',blocking_factor_levels))){
  new_cols = blocking_factor_levels[grep('_',attr(terms(as.formula(Random_New_Formula)),"term.labels"))]
  for (nc in new_cols){
    interacting_blocking_factors = lapply(nc,function(x){strsplit(x,'_')[[1]]})[[1]]
    fake_data[,nc] = do.call(paste, c(fake_data[interacting_blocking_factors], sep = "_"))
  }
}


> fake_data factor1 factor2 rep response factor1_factor2 rep_factor1 1 C U 1 84.02 C_U 1_C 2 B T 1 84.02 B_T 1_B 3 C U 2 218.56 C_U 2_C 4 B T 2 218.56 B_T 2_B 5 C U 3 330.00 C_U 3_C 6 B T 3 330.00 B_T 3_B 7 C V 1 409.31 C_V 1_C 8 B T 1 409.31 B_T 1_B 9 C V 2 458.20 C_V 2_C 10 B T 2 458.20 B_T 2_B 11 C V 3 293.85 C_V 3_C 12 B T 3 293.85 B_T 3_B 13 C W 1 117.39 C_W 1_C 14 B T 1 117.39 B_T 1_B 15 C W 2 302.99 C_W 2_C 16 B T 2 302.99 B_T 2_B 17 C W 3 36.25 C_W 3_C 18 B T 3 36.25 B_T 3_B 19 C X 1 53.82 C_X 1_C 20 B T 1 53.82 B_T 1_B 21 C X 2 440.62 C_X 2_C 22 B T 2 440.62 B_T 2_B 23 C X 3 469.63 C_X 3_C 24 B T 3 469.63 B_T 3_B 25 C Y 1 375.44 C_Y 1_C 26 B T 1 375.44 B_T 1_B 27 C Y 2 209.88 C_Y 2_C 28 B T 2 209.88 B_T 2_B 29 C Y 3 128.76 C_Y 3_C 30 B T 3 128.76 B_T 3_B 31 C Z 1 195.09 C_Z 1_C 32 B T 1 195.09 B_T 1_B 33 C Z 2 70.87 C_Z 2_C 34 B T 2 70.87 B_T 2_B 35 C Z 3 467.48 C_Z 3_C 36 B T 3 467.48 B_T 3_B 37 D U 1 325.11 D_U 1_D 38 A S 1 325.11 A_S 1_A 39 D U 2 368.38 D_U 2_D 40 A S 2 368.38 A_S 2_A 41 D U 3 355.41 D_U 3_D 42 A S 3 355.41 A_S 3_A 43 D V 1 171.54 D_V 1_D 44 A S 1 171.54 A_S 1_A 45 D V 2 492.26 D_V 2_D 46 A S 2 492.26 A_S 2_A 47 D V 3 279.85 D_V 3_D 48 A S 3 279.85 A_S 3_A 49 D W 1 312.87 D_W 1_D 50 A S 1 312.87 A_S 1_A 51 D W 2 480.80 D_W 2_D 52 A S 2 480.80 A_S 2_A 53 D W 3 65.35 D_W 3_D 54 A S 3 65.35 A_S 3_A 55 D X 1 59.26 D_X 1_D 56 A S 1 59.26 A_S 1_A 57 D X 2 273.04 D_X 2_D 58 A S 2 273.04 A_S 2_A 59 D X 3 286.85 D_X 3_D 60 A S 3 286.85 A_S 3_A 61 D Y 1 306.65 D_Y 1_D 62 A S 1 306.65 A_S 1_A 63 D Y 2 145.89 D_Y 2_D 64 A S 2 145.89 A_S 2_A 65 D Y 3 351.04 D_Y 3_D 66 A S 3 351.04 A_S 3_A 67 D Z 1 2.20 D_Z 1_D 68 A S 1 2.20 A_S 1_A 69 D Z 2 156.83 D_Z 2_D 70 A S 2 156.83 A_S 2_A 71 D Z 3 141.18 D_Z 3_D 72 A S 3 141.18 A_S 3_A

# Make sure factors are R-factors
factor_names <- c(treatment_factor_levels, blocking_factor_levels)
fnId         <- 1:length(factor_names)
for(kk in fnId){
  fake_data[, factor_names[kk]] <- as.factor(as.character(fake_data[, factor_names[kk]]))
}

output = tryCatch({
  outputREMLf90 =breedR::remlf90(fixed = as.formula(Fixed_New_Formula),
                                 random = as.formula(Random_New_Formula),
                                 data = fake_data,method='ai')
},
warning = function(war){
  print(paste('WARNING:', war))
},
error = function(err){
  print(paste('ERROR:', err))
  return(err)
})

Now the model fits without error, but the results for the fixed effects look wrong... it appears that it voids some values for the fixed effects tables for the individual factors. In a single factor fixed effect model, factor1 would produce the means of each treatment as expected. But with these interactive terms introduced it seems to defaults the means and standard errors of single fixed effects such as factor1 to 0. 

> fixef(object = outputREMLf90) $factor1 value s.e. A 254.1394 41.35154 B 259.0089 41.35154 C 0.0000 0.00000 D 0.0000 0.00000 $factor2 value s.e. S 0 0 T 0 0 U 0 0 V 0 0 W 0 0 X 0 0 Y 0 0 Z 0 0 $factor1_factor2 value s.e. A_S 0.0000 0.00000 B_T 0.0000 0.00000 C_U 210.8600 85.53294 C_V 387.1200 85.53294 C_W 152.2100 85.53294 C_X 321.3567 85.53294 C_Y 238.0267 85.53294 C_Z 244.4800 85.53294 D_U 349.6333 85.53294 D_V 314.5500 85.53294 D_W 286.3400 85.53294 D_X 206.3833 85.53294 D_Y 267.8600 85.53294 D_Z 100.0700 85.53294

This looks half-right to me, as the means by factor level I see using aggregate function in R look like this:

> aggregate(response ~ factor1,data=fake_data,FUN=mean) factor1 response 1 A 254.1394 2 B 259.0089 3 C 259.0089 4 D 254.1394 > aggregate(response ~ factor2,data=fake_data,FUN=mean) factor2 response 1 S 254.1394 2 T 259.0089 3 U 280.2467 4 V 350.8350 5 W 219.2750 6 X 263.8700 7 Y 252.9433 8 Z 172.2750

> aggregate(response ~ factor1 + factor2,data=fake_data,FUN=mean) factor1 factor2 response 1 A S 254.1394 2 B T 259.0089 3 C U 210.8600 4 D U 349.6333 5 C V 387.1200 6 D V 314.5500 7 C W 152.2100 8 D W 286.3400 9 C X 321.3567 10 D X 206.3833 11 C Y 238.0267 12 D Y 267.8600 13 C Z 244.4800 14 D Z 100.0700

I was wondering how that could be happening? Is there a better way for me to specify the factors and formula?


Facundo Muñoz

unread,
Jan 26, 2022, 4:30:53 AM1/26/22
to bre...@googlegroups.com

Dear Jared,

The issue is that your design matrix is singular. I.e., the arrangement of the observations does not allow to identify all the effects that you are specifying in the model.

You can see this in the contingency table of the number of observations by categorical factor:

> xtabs( ~ factor1+factor2, fake_data)
       factor2
factor1  S  T  U  V  W  X  Y  Z
      A 18  0  0  0  0  0  0  0
      B  0 18  0  0  0  0  0  0
      C  0  0  3  3  3  3  3  3
      D  0  0  3  3  3  3  3  3

from where you can see that you cannot possibly estimate separate effects for A, for S and for AxS, and the same occurs with the effects of B, T and BxT. They are "confounded" and are not identifiable.

For the block with 3 observations per group, the situation is trickier to see, but the underlying problem is the same. You can, for instance, estimate the effect of C by aggregating observations from the third row, and the effects of U..Z by aggregating the respective columns. But then you ran out of observations and cannot assess the effect of D nor any of the interactions.

What breedR does is to arbitrarily put some of the estimates at 0, and give results for those that can be estimated. It is similar to what function lm() does:

summary(
  lm(response ~ 1 + factor1 + factor2 + factor1_factor2, data = fake_data)
)
#> 
#> Call:
#> lm(formula = response ~ 1 + factor1 + factor2 + factor1_factor2, 
#>     data = fake_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -267.54 -110.19   20.55   97.99  238.12 
#> 
#> Coefficients: (10 not defined because of singularities)
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         254.139     34.676   7.329  8.2e-10 ***
#> factor1B              4.869     49.039   0.099   0.9212    
#> factor1C             -9.659     91.743  -0.105   0.9165    
#> factor1D           -154.069     91.743  -1.679   0.0985 .  
#> factor2T                 NA         NA      NA       NA    
#> factor2U            249.563    120.120   2.078   0.0422 *  
#> factor2V            214.480    120.120   1.786   0.0794 .  
#> factor2W            186.270    120.120   1.551   0.1264    
#> factor2X            106.313    120.120   0.885   0.3798    
#> factor2Y            167.790    120.120   1.397   0.1678    
#> factor2Z                 NA         NA      NA       NA    
#> factor1_factor2B_T       NA         NA      NA       NA    
#> factor1_factor2C_U -283.183    169.875  -1.667   0.1009    
#> factor1_factor2C_V  -71.840    169.875  -0.423   0.6739    
#> factor1_factor2C_W -278.540    169.875  -1.640   0.1065    
#> factor1_factor2C_X  -29.437    169.875  -0.173   0.8630    
#> factor1_factor2C_Y -174.243    169.875  -1.026   0.3093    
#> factor1_factor2C_Z       NA         NA      NA       NA    
#> factor1_factor2D_U       NA         NA      NA       NA    
#> factor1_factor2D_V       NA         NA      NA       NA    
#> factor1_factor2D_W       NA         NA      NA       NA    
#> factor1_factor2D_X       NA         NA      NA       NA    
#> factor1_factor2D_Y       NA         NA      NA       NA    
#> factor1_factor2D_Z       NA         NA      NA       NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 147.1 on 58 degrees of freedom
#> Multiple R-squared:  0.1517, Adjusted R-squared:  -0.03843 
#> F-statistic: 0.7979 on 13 and 58 DF,  p-value: 0.6595

Unfortunately, there is not much you can do about it as the data simply does not contain the information necessary for identifying those effects (it does not mean that the effects do not exist). You can always use the model for prediction. But estimating the individual effects seems unfeasible with these data.

Hope it helps

ƒacu.-

--
Report issues at https://github.com/famuvie/breedR/issues
---
You received this message because you are subscribed to the Google Groups "breedR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to breedr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/breedr/6cffae9f-3e6d-4db6-b535-67fbc9f4b8fan%40googlegroups.com.

Jared Anderson

unread,
Jan 26, 2022, 2:31:25 PM1/26/22
to breedR
Thanks for the response Facundo. Is there another example of breedR running a model with interactive terms on a different dataset that does not have rank problems? 

Facundo Muñoz

unread,
Jan 26, 2022, 3:02:46 PM1/26/22
to bre...@googlegroups.com

Jared,

There is a toy example [1] in breedR's "Overview" which briefly explains how to introduce interactions. You already discovered that. Although simple, the example works correctly.

I have used interactions to illustrate Genotype x Environment effects in the slides [2]. These are from the workshop materials [3], which I also include in case of interest.

Hope it helps.

ƒacu.-


[1] https://github.com/famuvie/breedR/wiki/Overview#interactions

[2] https://famuvie.github.io/breedR/workshop_IBL/GEI.html

[3] https://famuvie.github.io/breedR/workshop_IBL/

Reply all
Reply to author
Forward
0 new messages