Imputing missing values

We saw how to make estimations by assigning reasonable predictions to each respondent in our survey. However, this approach offers a number of limitations. Firstly, our ability to actually make predictions will depend on whether the model we use is able to work with missing data on the RHS itself. Second, the approach we used does not tell us much about the variance of the prediction.

In this session, our goal is slightly different. We want to fill-in not only one variable but—potentially—a lot of them and what we want is not so much to impute an outcome variable but rather the regressors.

library(ggplot2)
library(reshape2)
library(mice)
Loading required package: lattice

Attaching package: 'mice'
The following objects are masked from 'package:base':

    cbind, rbind
library(VIM)
Loading required package: colorspace
Loading required package: grid
Loading required package: data.table

Attaching package: 'data.table'
The following objects are masked from 'package:reshape2':

    dcast, melt
Registered S3 methods overwritten by 'car':
  method                          from
  influence.merMod                lme4
  cooks.distance.influence.merMod lme4
  dfbeta.influence.merMod         lme4
  dfbetas.influence.merMod        lme4
VIM is ready to use. 
 Since version 4.0.0 the GUI is in its own package VIMGUI.

          Please use the package to use the new (and old) GUI.
Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

Attaching package: 'VIM'
The following object is masked from 'package:datasets':

    sleep

Most of these methods are computationally intensive, so let’s make our life a bit easier by selecting a smaller version of our dataset.

CLEAN_DATA <- file.path(DATA_DIR, "cis-clean-data.RDS")
dta <- readRDS(CLEAN_DATA)

impdata <- c("interest",
             "gender",
             "age",
             "education",
             "ideology")
dta <- dta[, impdata]

Let’s now take a look at the dataset.

summary(dta)
    interest     gender           age              education   
 Min.   :1.0   man  :7895   age24   :1229   hs          :4851  
 1st Qu.:2.0   woman:8299   age25t34:2116   no educ     : 983  
 Median :3.0                age35t44:3023   possecondary:3579  
 Mean   :2.8                age45t54:3096   primary     :2791  
 3rd Qu.:4.0                age55t64:2635   secondary   :3902  
 Max.   :4.0                age65   :4095   NA's        :  88  
 NA's   :32                                                    
    ideology   
 Min.   : 1    
 1st Qu.: 3    
 Median : 5    
 Mean   : 5    
 3rd Qu.: 6    
 Max.   :10    
 NA's   :2507  

What we see is the common pattern in many surveys. Some people do not report information that obviously exists and some others do not report information that may simply not apply to them (or that they do not know how to report). This observation is important because it points to different types of missing values.

The structure of the missing data

Let’s gain a bit more insight about the missing values by looking at the structure of the missing values.

aggr(dta, prop=c(TRUE, FALSE))

What we see here is that most of the time, people simply do not report ideology and that people either do not report interest for public affairs alone or in combination with ideology. What can we make of that?

A useful tool to learn more about these missing values is to look at conditional plots. Let’s look for instance at the way in which ideology is missing, conditional on interest.

spineMiss(dta[, c("interest", "ideology")], interactive=FALSE)

What do learn from this relation?

Missing data mechanisms

The previous finding poses an interesting question to us. To what extent can we “explain” the missingness in our dataset? It sounds intuitive that depending on the answer to this question we will have different ways of addressing the problem—and even to decide whether it is a problem to begin with. We can think of three different scenarios:

  1. Data is unconditionally missing (MCAR).
  2. Data is conditionally missing on observables (MAR).
  3. Data is conditionally missing on unobservables (MNAR).

Let’s think for a second about what each of situations imply for us, some examples from common surveys, and—even more importantly—to what extent we can decide what situation applies in each case.

Imputation methods

Simple imputation

It is often tempting to impute unconditionally a central statistic of a given variable. For instance, a common recommendation that you will find is to impute the mean of the incomplete variable. Let’s try that out:

imputed_ideology <- ifelse(is.na(dta$ideology),
                           round(mean(dta$ideology, na.rm=TRUE)),
                           dta$ideology)

Notice that we are already making an assumption through our na.rm argument. Let’s now look at the distribution of our imputed variable:

mean(imputed_ideology)
[1] 4.75
mean(dta$ideology, na.rm=TRUE)
[1] 4.7

But let’s look at the variability in the variable:

sd(imputed_ideology)
[1] 1.88
sd(dta$ideology, na.rm=TRUE)
[1] 2.04

An obvious solution to this problem is to use more information:

n <- sum(is.na(dta$ideology))
mu <- mean(dta$ideology, na.rm=TRUE)
sigma <- sd(dta$ideology, na.rm=TRUE)
imputed_ideology[is.na(dta$ideology)] <- round(rnorm(n, mu, sigma))

The variable now looks better in a way but also very strange:

summary(dta$ideology)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1       3       5       5       6      10    2507 
summary(imputed_ideology)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -2.0     3.0     5.0     4.7     6.0    13.0 

We could be tempted to trim those odd values, but let’s think about it for a second. Look at this histogram and think about what happens if we trim things:

hist(imputed_ideology)

Of course this is not an unsolvable problem (Why not use a different distribution?) but points to some of the assumptions that we are implicitly making in the imputation process. In fact, let’s think about another assumption we are making: Why are we relying on the variable itself alone? Is there any other information that we can use to help us come up with better imputations? What are some of the challenges that can appear if we try to use additional variables?

Hint: Just consider this model:

summary(lm(ideology ~ gender + age + education + interest, data=dta))

Call:
lm(formula = ideology ~ gender + age + education + interest, 
    data = dta)

Residuals:
   Min     1Q Median     3Q    Max 
-4.554 -1.398 -0.025  1.197  5.988 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)             3.7280     0.0838   44.49  < 2e-16
genderwoman            -0.0182     0.0347   -0.53   0.5995
ageage25t34            -0.0296     0.0799   -0.37   0.7108
ageage35t44             0.0624     0.0752    0.83   0.4066
ageage45t54             0.0973     0.0748    1.30   0.1930
ageage55t64            -0.0104     0.0770   -0.13   0.8928
ageage65                0.5187     0.0783    6.62  3.7e-11
educationno educ       -0.1392     0.0912   -1.53   0.1270
educationpossecondary   0.0179     0.0475    0.38   0.7059
educationprimary        0.0651     0.0600    1.09   0.2777
educationsecondary      0.1277     0.0477    2.68   0.0074
interest                0.2949     0.0192   15.38  < 2e-16

Residual standard error: 2.01 on 13586 degrees of freedom
  (2596 observations deleted due to missingness)
Multiple R-squared:  0.0332,    Adjusted R-squared:  0.0325 
F-statistic: 42.5 on 11 and 13586 DF,  p-value: <2e-16

Multiple imputation

In the previous situation, we accepted the fact that we could only impute one variable at time, working under the assumption that there was only one variable with missing values. However that’s not often the standard situation.

A popular method to deal with multivariate missingness is through chained equations. The intuition is simple. We impute what we can using a sequence of conditional models and we cycle through them until the sequence of imputations converges.

However, we do not want to make only one imputation because we want to be able to say something about the variance that is due to the uncertainty in the model—and to the fact that we have performed imputations to begin with, i.e., due to the fact that we don’t have observed but estimated data. Notice that the idea here is that we will create a number of replicate datasets each of them with different imputations. The goal is then to make complete-case inference in each of the datasets and combine the estimates accounting for the between and within variance across different imputed datasets.

Chained imputations

The common implementation of the chained equations method is very computationally intensive, so let’s start by reducing our dataset.

oo <- sample(1:nrow(dta), 1000)
dta <- dta[oo, ]

We can now do a dry-run to see how things are set up:

M <- mice(data=dta, m=1, maxit=0)
print(M)
Class: mids
Number of multiple imputations:  1 
Imputation methods:
 interest    gender       age education  ideology 
       ""        ""        "" "polyreg"     "pmm" 
PredictorMatrix:
          interest gender age education ideology
interest         0      1   1         1        1
gender           1      0   1         1        1
age              1      1   0         1        1
education        1      1   1         0        1
ideology         1      1   1         1        0

There are two main pieces here. First, the method attribute indicates the method that will be used for each of the conditional models. mice often does a reasonable job a selecting them but we may want to fine-tune them. Notice that this is one of the rare occassions in survey statistics in which runtime should be a factor to keep under consideration.

M$method[names(M$method) == "interest"] <- "pmm"

The other piece is the predictorMatrix object which indicates the variables that will be used in the RHS for each model. More often that not, we will have to tune this matrix iteratively after identifying issues that may affect our ability to computationally run each of the conditional models.

pos <- which(colnames(M$predictorMatrix) == "ideology")
M$predictorMatrix[, pos] <- 0

We now run mice again and create 5 imputations to be selected after running 20 iterations of the full sequence of models. Notice that I am choosing 5 here although commonly now people use over 20 imputations.

imps <- mice(dta,
             m=5,
             maxit=10,
             method=M$method,
             predictorMatrix=M$predictorMatrix) 

 iter imp variable
  1   1  education  ideology
  1   2  education  ideology
  1   3  education  ideology
  1   4  education  ideology
  1   5  education  ideology
  2   1  education  ideology
  2   2  education  ideology
  2   3  education  ideology
  2   4  education  ideology
  2   5  education  ideology
  3   1  education  ideology
  3   2  education  ideology
  3   3  education  ideology
  3   4  education  ideology
  3   5  education  ideology
  4   1  education  ideology
  4   2  education  ideology
  4   3  education  ideology
  4   4  education  ideology
  4   5  education  ideology
  5   1  education  ideology
  5   2  education  ideology
  5   3  education  ideology
  5   4  education  ideology
  5   5  education  ideology
  6   1  education  ideology
  6   2  education  ideology
  6   3  education  ideology
  6   4  education  ideology
  6   5  education  ideology
  7   1  education  ideology
  7   2  education  ideology
  7   3  education  ideology
  7   4  education  ideology
  7   5  education  ideology
  8   1  education  ideology
  8   2  education  ideology
  8   3  education  ideology
  8   4  education  ideology
  8   5  education  ideology
  9   1  education  ideology
  9   2  education  ideology
  9   3  education  ideology
  9   4  education  ideology
  9   5  education  ideology
  10   1  education  ideology
  10   2  education  ideology
  10   3  education  ideology
  10   4  education  ideology
  10   5  education  ideology

Are our imputations any good? For that we have to ensure, at the very least, that the sequence of models has converged to a stationary distribution.

plot(imps)