9 min read

How to simulate a dataset for fitting linear mixed models

[DEPRECATED] Please refer to this updated method for simulating data.

Recently, I’m working on exploring how linear mixed model (LMM) works and how to fit data with it. I remember that when I started to learn the linear model, simulating a dataset and fitting it accordingly facilitate my understanding how the models work. So probably it would be the same case for LMM. Unfortunately, no functions in R are found to simulate a dataset directly for a LMM. Hence, I decided to write codes to fulfill this purpose.

To simulate the dataset for LMM, firstly, an experiment design in psychology is assumed. Secondly, μ (mean of the population) of all conditions, σ (standard deviation) for residual and random effects are set. Thirdly, the dependent variable is calculated accordingly. Finally, the estimated results of fitted fixed and random effects are compared to the pre-set parameters.

1 Simulate the Experiment Design

(Sorry in advance, due to my limited knowledge, I cannot make the codes flexible enough, and, therefore, the experiment design of this simulated data set has to be 2 × 2 within-subject design or less complicated.)

For each trial, one face will be presented to the participant and the task is to judge if the eyes are upright or inverted. The face will be upright intact, upright scrambled, inverted intact or inverted scrambled. We will only use the response time (RT) as the dependent variable and assume the distribution of RT is normally distributed.

There are two independent variables in this experiment (IV1: upright vs. inverted; IV2: intact vs. scrambled). Participants completed the trials of all stimuli in all conditions (2 × 2 within-subject design).

The question we are interested in is whether the two independent variables (including the interaction) have effects on the dependent variable?

set.seed(42)

# set the parameters for experiment design
IV1_levels = c("upright", "inverted")  # (has to be two levels)
IV2_levels = c("intact", "scrambled")  # (has to be two levels)
num_Subj = 30
num_Stim = 20

2 Set the Parameters for this LMM

# set the mu for every bin (every condition)
IV1.1_IV2.1 = 500  # upright intact
IV1.2_IV2.1 = 600  # inverted intact
IV1.1_IV2.2 = 800  # upright scrambled
IV1.2_IV2.2 = 850  # inverted scrambled

# set the variances for lmm (std)
var_residual = 30  # residual
var_rnd_int_subj = 40  # random intercept for Subject   
var_rnd_int_stim = 50  # random intercept for Stimuli
var_rnd_slp_IV1_subj <- 60  # random slope of IV1 for Subject
var_rnd_slp_IV2_subj <- 70  # random slope of IV2 for Subject
var_rnd_slp_inter_subj <- 20  # random slope of IV1*IV2 for Subject
var_rnd_slp_IV1_stim <- 80  # random slope of IV1 for Stimuli
var_rnd_slp_IV2_stim <- 90  # random slope of IV2 for Stimuli
var_rnd_slp_inter_stim <- 20  # random slope of IV1*IV2 for Stimuli

3 Calculate the Dependent Variable (RT)

# load libraries
library(tidyverse)
library(lme4)

Generate the design matrix for participant code (Subject), stimulus code (Stimuli) and all independent variables (IV1 and IV2):

# number of levels for IV1 and IV2
numlevel_IV1 = length(IV1_levels)  
numlevel_IV2 = length(IV2_levels)
num_total = numlevel_IV1 * numlevel_IV2 * num_Stim * num_Subj

# generate factors for all non-dependent variables
IV1 <- gl(n = numlevel_IV1, k = 1, length = num_total, labels = IV1_levels)
IV2 <- gl(n = numlevel_IV2, k = numlevel_IV1, length = num_total, labels = IV2_levels)
Stim <- gl(n = num_Stim, k = numlevel_IV1 * numlevel_IV2, length = num_total, labels = paste0("stim", 1:num_Stim))
Subj <- gl(n = num_Subj, k = numlevel_IV1 * numlevel_IV2 * num_Stim, length = num_total, labels = paste0("P", 1:num_Subj))

expdesign <- tibble(Subject = Subj, Stimuli = Stim, IV2 = IV2, IV1 = IV1)

str(expdesign)
## tibble [2,400 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Subject: Factor w/ 30 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stimuli: Factor w/ 20 levels "stim1","stim2",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ IV2    : Factor w/ 2 levels "intact","scrambled": 1 1 2 2 1 1 2 2 1 1 ...
##  $ IV1    : Factor w/ 2 levels "upright","inverted": 1 2 1 2 1 2 1 2 1 2 ...

Calculate the dependent varialbe (RT) which consists of three parts: (1) RT for the fixed effect, (2) RT for the random intercepts and (3) RT for the random slopes.

3.1 RT for the Fixed Effects (residuals included)

The intercept and slopes of the fixed effect are calculated based on the \(\mu\) for each condition. (Details see below).

# RT for fixed factors
modmat.fix <- model.matrix( ~ IV1 * IV2, expdesign)

# calculate the betas
intercept = IV1.1_IV2.1
int_IV1.2 = IV1.2_IV2.1 - intercept
slp_IV2.2 = (IV1.1_IV2.2 - intercept) / 1  # divided by 1 (dummy coding)
interaction = (IV1.2_IV2.2 - IV1.1_IV2.2) - int_IV1.2
coeff_fix <- c(intercept, int_IV1.2, slp_IV2.2, interaction)

# added random variance (i.e. the residual) to the RT of fixed effect 
RT_fix <- rnorm(n = num_total, mean = modmat.fix %*% coeff_fix, sd = var_residual)

3.2 RT for the Random Intercepts

The random intercepts of RT for Subjects and Stimuli are both calculated.

# Random intercepts of RT for every subject or stimulus
rnd_int_subj <- rnorm(num_Subj, 0, var_rnd_int_subj)  # Subject
rnd_int_stim <- rnorm(num_Stim, 0, var_rnd_int_stim)  # Stimulus

# The sum of the random intercepts of Subjects and Stimuli
RT_rnd_int <- rnd_int_subj[as.numeric(Subj)] +  rnd_int_stim[as.numeric(Stim)]

3.3 RT for the Random Slopes

The RT of random slopes are a little bit complicated. To my knowledge, there are six components in maximum for the current design. They are the random slopes for IV1:inverted, IV2:scrambled and the interaction of Subjects and Stimuli.

# RT for random slopes
rnd_slp_IV1_subj <- rnorm(num_Subj, 0, var_rnd_slp_IV1_subj)
rnd_slp_IV2_subj <- rnorm(num_Subj, 0, var_rnd_slp_IV2_subj)
rnd_slp_inter_subj <- rnorm(num_Subj, 0, var_rnd_slp_inter_subj)

rnd_slp_IV1_stim <- rnorm(num_Stim, 0, var_rnd_slp_IV1_stim)
rnd_slp_IV2_stim <- rnorm(num_Stim, 0, var_rnd_slp_IV2_stim)
rnd_slp_inter_stim <- rnorm(num_Stim, 0, var_rnd_slp_inter_stim)


IV1_2 <- paste0("IV1", IV1_levels[2])  # column name  
IV2_2 <- paste0("IV2", IV2_levels[2])  # column name

IV1_dummy <- modmat.fix[, IV1_2]
IV2_dummy <- modmat.fix[, IV2_2]
inter_dummy <- modmat.fix[, paste(IV1_2, IV2_2, sep = ":")]

RT_rnd_slp <- rnd_slp_IV1_subj[as.numeric(Subj)] * IV1_dummy +
  rnd_slp_IV2_subj[as.numeric(Subj)] * IV2_dummy +
  rnd_slp_inter_subj[as.numeric(Subj)] * inter_dummy + 
  rnd_slp_IV1_stim[as.numeric(Stim)] * IV1_dummy +
  rnd_slp_IV2_stim[as.numeric(Stim)] * IV2_dummy +
  rnd_slp_inter_stim[as.numeric(Stim)] * inter_dummy
# Sum up all the components of RT
simudata <- {
  expdesign %>% 
    data.frame(RT = RT_fix + RT_rnd_int + RT_rnd_slp)
}

str(simudata)
## 'data.frame':    2400 obs. of  5 variables:
##  $ Subject: Factor w/ 30 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stimuli: Factor w/ 20 levels "stim1","stim2",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ IV2    : Factor w/ 2 levels "intact","scrambled": 1 1 2 2 1 1 2 2 1 1 ...
##  $ IV1    : Factor w/ 2 levels "upright","inverted": 1 2 1 2 1 2 1 2 1 2 ...
##  $ RT     : num  642 664 831 815 580 ...

4 Fit the LMM

A full LMM model is applied:

# lmm (full model)
lmm_fit <- lmer(RT ~ IV1 * IV2 + 
                  (1 + IV1 * IV2|Subject) + 
                  (1 + IV1 * IV2|Stimuli), 
                data = simudata,
                control = lmerControl(optimizer = "bobyqa"))  # 
summary(lmm_fit)
## Linear mixed model fit by REML ['lmerMod']
## Formula: RT ~ IV1 * IV2 + (1 + IV1 * IV2 | Subject) + (1 + IV1 * IV2 |  
##     Stimuli)
##    Data: simudata
## Control: lmerControl(optimizer = "bobyqa")
## 
## REML criterion at convergence: 23867.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2918 -0.6738  0.0025  0.6693  4.0265 
## 
## Random effects:
##  Groups   Name                     Variance Std.Dev. Corr             
##  Subject  (Intercept)               744.4   27.28                     
##           IV1inverted              4305.6   65.62     0.04            
##           IV2scrambled             4623.0   67.99    -0.04  0.22      
##           IV1inverted:IV2scrambled  477.8   21.86    -0.40 -0.28 -0.35
##  Stimuli  (Intercept)              2145.4   46.32                     
##           IV1inverted              9295.9   96.42    -0.20            
##           IV2scrambled             9012.6   94.93     0.30 -0.13      
##           IV1inverted:IV2scrambled  288.7   16.99    -0.57  0.17  0.32
##  Residual                           894.7   29.91                     
## Number of obs: 2400, groups:  Subject, 30; Stimuli, 20
## 
## Fixed effects:
##                          Estimate Std. Error t value
## (Intercept)               518.522     11.557  44.865
## IV1inverted                84.305     24.724   3.410
## IV2scrambled              302.183     24.652  12.258
## IV1inverted:IV2scrambled  -45.069      6.027  -7.477
## 
## Correlation of Fixed Effects:
##             (Intr) IV1nvr IV2scr
## IV1inverted -0.155              
## IV2scrambld  0.220 -0.043       
## IV1nvrt:IV2 -0.414 -0.016  0.040

5 Compare the Results of LMM and Pre-set Parameters

Comparsions of the fixed effect:

# confidence intervals of fixed effects
if (!file.exists("2019-01-02-simuluate-data-confint_fix.RData")) {
  confint_fix <- as.tibble(confint(lmm_fit, parm = "beta_", oldNames = FALSE))  # This step takes a little bit longer
  
  save(confint_fix, file = "2019-01-02-simuluate-data-confint_fix.RData")
} else {
  load("2019-01-02-simuluate-data-confint_fix.RData")
}

# combine results of lmm and parameters together
fix_coef_ci <- cbind(data.frame(fixef(lmm_fit)), confint_fix, assumed.coeff = coeff_fix)

fix_coef_ci
##                          fixef.lmm_fit.     2.5 %    97.5 % assumed.coeff
## (Intercept)                   518.52220 495.49760 541.54681           500
## IV1inverted                    84.30550  35.14353 133.46746           100
## IV2scrambled                  302.18303 253.24465 351.12139           300
## IV1inverted:IV2scrambled      -45.06862 -57.00729 -33.12995           -50

From the above data frame, we noticed that the coefficients (betas) fitted with LMM are similar to the that of the assumed population. And the coefficients of assumed population are all within the 95% confident intervals (though it is possible they are outside).

Comprisons of the random effects:

# # confidence intervals of random effects
# confint_fix <- as.data.frame(confint.merMod(lmm_fit, parm = "theta_", oldNames = FALSE))  # This step takes too longe

full_var_df <- tibble(
  grp = c(rep("Stimuli", 4), rep("Subject", 4), "Residual"),
  var1 = c(rep(c("(Intercept)", IV1_2, IV2_2, paste(IV1_2, IV2_2, sep = ":")), 2), NA)
)

rnd_var <- {
  as.tibble(VarCorr(lmm_fit)) %>% 
    filter(is.na(var2)) %>% 
    select(grp, var1, sdcor) %>% 
    rename(sd = sdcor)
}
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
rnd_var_full <- merge(full_var_df, rnd_var, all = TRUE)

assumed_variance <- c(var_residual,
  var_rnd_int_stim, var_rnd_slp_IV1_stim, var_rnd_slp_inter_stim, var_rnd_slp_IV2_stim, 
  var_rnd_int_subj, var_rnd_slp_IV1_subj, var_rnd_slp_inter_subj, var_rnd_slp_IV2_subj
  )

compare_rnd <- tibble(rnd_var_full, assumed_variance)

compare_rnd
## # A tibble: 9 x 4
##   grp      var1                        sd assumed_variance
##   <chr>    <chr>                    <dbl>            <dbl>
## 1 Residual <NA>                      29.9               30
## 2 Stimuli  (Intercept)               46.3               50
## 3 Stimuli  IV1inverted               96.4               80
## 4 Stimuli  IV1inverted:IV2scrambled  17.0               20
## 5 Stimuli  IV2scrambled              94.9               90
## 6 Subject  (Intercept)               27.3               40
## 7 Subject  IV1inverted               65.6               60
## 8 Subject  IV1inverted:IV2scrambled  21.9               20
## 9 Subject  IV2scrambled              68.0               70

All the standand deviations fitted with LMM are similar to the assumed variances. But probably you would notice that the differences between assumed values and estimated values for the random effects are larger than that for fixed effects. I think it is because the data points for estimating the variances are small (here it is 30 for Subjects and 20 for Stimuli). It is similar to that the estimated standard deviation of several data points generated from a normal distribution would be quite different from the \(\sigma\) in most cases. You can try this with sd(rnorm(n, 0, sigma)) in R.

Hope you enjoyed this post.