6 min read

Contrast pitfall I: Dummy coding and the main effects

Nowadays, linear (mixed) models are frequently used in psychology and other fields. One important concept in applying linear (mixed) models is contrast coding, which has to be employed to deal with categorical independent variables. Contrast coding is a relatively new notion, especially for those who first learned t-tests or ANOVA. Probably not too surprisingly, linear (mixed) models sometimes are incorrectly applied due to the imperfect understanding of contrast coding.

In this blog, I would like to discuss one common mistake: interpreting some coefficients as the main effects when linear (mixed) models employ dummy coding, which is the default in R, Julia, Python, and possibly also other programming languages.

Setup

library(tidyverse)

Let’s simulate a data set with a 2 \(\times\) 2 (between-subject) design:

# a 2*2 between-subject design
N_subj <- 50   # each group
A_labels <- c("a1", "a2")
B_labels <- c("b1", "b2")

# true values (used for simulating data)
a1_b1 <- 30
a2_b1 <- 20
a1_b2 <- 25
a2_b2 <- 50

# sd for the residuals
e_sd <- 10

set.seed(20201111)
A <- rep(gl(n = length(A_labels), k = N_subj, labels = A_labels), times = 2)
B <- gl(n = length(B_labels), k = N_subj *2, labels = B_labels)
y_ <- rep(c(a1_b1, a2_b1, a1_b2, a2_b2), each = N_subj)
Y <- y_ + rnorm(length(y_), 0, e_sd)

df <- tibble(A, B, Y)
str(df)
## tibble[,3] [200 × 3] (S3: tbl_df/tbl/data.frame)
##  $ A: Factor w/ 2 levels "a1","a2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ B: Factor w/ 2 levels "b1","b2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Y: num [1:200] 35.9 41.7 30.6 35.6 36.2 ...

For simplicity, a between-subject design is simulated. But all following discussion applies to linear mixed models as well.

What is the main effect?

Let’s first make sure that we are on the same page on understanding the main effect. Take the above simulated design as an example, the main effect of A refers to the differences between a1 and a2 at the average level of b1 and b2. In other words, the main effect of A is (a1_b1 + a1_b2)/2 - (a2_b1 + a2_b2)/2 and its true population value is -7.5. Similarly, the main effect of B is (a1_b1 + a2_b1)/2 - (a1_b2 + a2_b2)/2 and its true population value is -12.5.

Note: if our understanding of the main effect differs, it may not be necessary to read the rest of this blog. But please clarify what you mean by main effects in publication.

What we get from ANOVA

The notion of the main effect should come from ANOVA. Therefore, ANOVA is performed and its results will be compared to those obtained with linear models.

anova(lm(Y ~ A * B, df))
## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## A           1  2429.1  2429.1  25.280 1.115e-06 ***
## B           1  6109.8  6109.8  63.586 1.247e-13 ***
## A:B         1 17611.0 17611.0 183.281 < 2.2e-16 ***
## Residuals 196 18833.1    96.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA does not report the effects of differences directly. Instead, it provides the F-value and p-value. Therefore, if an effect estimated in the linear model matches the main effect (in ANOVA), their statistical evidence should be similar. Specifically, they should have similar p-value. Notably, linear models report t-value, instead of F-value, and thus the square of t-value should match the F-value where there are only two levels in the variable (more see here or here).

What we get from linear models (by default)

With the default dummy coding (a1 and b1 are coded as 0 and a2 and b2 are coded as 1), the linear model is conducted:

# with the default dummy coding
summary(lm(Y ~ A * B, df))
## 
## Call:
## lm(formula = Y ~ A * B, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.003  -5.772   1.103   6.386  24.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   31.087      1.386  22.425  < 2e-16 ***
## Aa2          -11.797      1.960  -6.018 8.55e-09 ***
## Bb2           -7.713      1.960  -3.934 0.000116 ***
## Aa2:Bb2       37.535      2.773  13.538  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.802 on 196 degrees of freedom
## Multiple R-squared:  0.5813, Adjusted R-squared:  0.5749 
## F-statistic: 90.72 on 3 and 196 DF,  p-value: < 2.2e-16

Some researchers (mistakenly) regard Aa2 and Bb2 as the main effects of A and B observed in ANOVA, respectively. Let’s compare the t-value and p-value with those obtained in ANOVA. Unfortunately, the square of t-value of Aa2 (i.e., (-6.018)^2 = 36.216324) does not match the F-value of the main effect of A (25.280) and neither was the p-value. Similarly, the statistical results of Bb2 do not match the main effect of B in ANOVA, either.

Actually, Aa2 tested a2_b1 - a1_b1 (the corresponding true population value is -10) and Bb2 tested a1_b2 - a1_b1 (the corresponding true population value is -5), i.e., two simple effects.

What if I would like to get the main effects with linear models

If you would like to test the main effect of A and B with linear models, you need to apply another contrast coding, e.g., contr.sum().

summary(lm(Y ~ A * B, df, 
           contrasts = list(A = "contr.sum",
                            B = "contr.sum")))
## 
## Call:
## lm(formula = Y ~ A * B, data = df, contrasts = list(A = "contr.sum", 
##     B = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.003  -5.772   1.103   6.386  24.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.7153     0.6931  44.314  < 2e-16 ***
## A1           -3.4851     0.6931  -5.028 1.12e-06 ***
## B1           -5.5271     0.6931  -7.974 1.25e-13 ***
## A1:B1         9.3838     0.6931  13.538  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.802 on 196 degrees of freedom
## Multiple R-squared:  0.5813, Adjusted R-squared:  0.5749 
## F-statistic: 90.72 on 3 and 196 DF,  p-value: < 2.2e-16

With contr.sum(), the square of t-value of A1 is (-5.028)^2 (i.e., 25.280784), which is similar to that of the main effect of A in ANOVA (25.280); so is the p-value. The square of t-value of B1 is (-7.974)^2 (i.e., 63.584676), which is also similar to that of the main effect of B in ANOVA (63.586).

Take home message

The first two (or several) effects estimated in linear models with the default dummy coding do not correspond to the main effects in ANOVA.

PS: If you would like to know how to set contrast coding in more complex models, you may check my previous blogs: a \(2\times2\) between-subjects design, a \(2\times2\times2\) between-subjects design, and a \(2\times3\) within-subjects design.