We saw how to make estimations by assigning reasonable predictions to each respondent in our survey. However, this approach offers a number of limitations. Firstly, our ability to actually make predictions will depend on whether the model we use is able to work with missing data on the RHS itself. Second, the approach we used does not tell us much about the variance of the prediction.
In this session, our goal is slightly different. We want to fill-in not only one variable but—potentially—a lot of them and what we want is not so much to impute an outcome variable but rather the regressors.
library(ggplot2)
library(reshape2)
library(mice)
Loading required package: lattice
Attaching package: 'mice'
The following objects are masked from 'package:base':
cbind, rbind
library(VIM)
Loading required package: colorspace
Loading required package: grid
Loading required package: data.table
Attaching package: 'data.table'
The following objects are masked from 'package:reshape2':
dcast, melt
Registered S3 methods overwritten by 'car':
method from
influence.merMod lme4
cooks.distance.influence.merMod lme4
dfbeta.influence.merMod lme4
dfbetas.influence.merMod lme4
VIM is ready to use.
Since version 4.0.0 the GUI is in its own package VIMGUI.
Please use the package to use the new (and old) GUI.
Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
Attaching package: 'VIM'
The following object is masked from 'package:datasets':
sleep
Most of these methods are computationally intensive, so let’s make our life a bit easier by selecting a smaller version of our dataset.
CLEAN_DATA <- file.path(DATA_DIR, "cis-clean-data.RDS")
dta <- readRDS(CLEAN_DATA)
impdata <- c("interest",
"gender",
"age",
"education",
"ideology")
dta <- dta[, impdata]
Let’s now take a look at the dataset.
summary(dta)
interest gender age education
Min. :1.0 man :7895 age24 :1229 hs :4851
1st Qu.:2.0 woman:8299 age25t34:2116 no educ : 983
Median :3.0 age35t44:3023 possecondary:3579
Mean :2.8 age45t54:3096 primary :2791
3rd Qu.:4.0 age55t64:2635 secondary :3902
Max. :4.0 age65 :4095 NA's : 88
NA's :32
ideology
Min. : 1
1st Qu.: 3
Median : 5
Mean : 5
3rd Qu.: 6
Max. :10
NA's :2507
What we see is the common pattern in many surveys. Some people do not report information that obviously exists and some others do not report information that may simply not apply to them (or that they do not know how to report). This observation is important because it points to different types of missing values.
Let’s gain a bit more insight about the missing values by looking at the structure of the missing values.
aggr(dta, prop=c(TRUE, FALSE))
What we see here is that most of the time, people simply do not report ideology and that people either do not report interest for public affairs alone or in combination with ideology. What can we make of that?
A useful tool to learn more about these missing values is to look at conditional plots. Let’s look for instance at the way in which ideology is missing, conditional on interest.
spineMiss(dta[, c("interest", "ideology")], interactive=FALSE)
What do learn from this relation?
The previous finding poses an interesting question to us. To what extent can we “explain” the missingness in our dataset? It sounds intuitive that depending on the answer to this question we will have different ways of addressing the problem—and even to decide whether it is a problem to begin with. We can think of three different scenarios:
Let’s think for a second about what each of situations imply for us, some examples from common surveys, and—even more importantly—to what extent we can decide what situation applies in each case.
It is often tempting to impute unconditionally a central statistic of a given variable. For instance, a common recommendation that you will find is to impute the mean of the incomplete variable. Let’s try that out:
imputed_ideology <- ifelse(is.na(dta$ideology),
round(mean(dta$ideology, na.rm=TRUE)),
dta$ideology)
Notice that we are already making an assumption through our na.rm
argument. Let’s now look at the distribution of our imputed variable:
mean(imputed_ideology)
[1] 4.75
mean(dta$ideology, na.rm=TRUE)
[1] 4.7
But let’s look at the variability in the variable:
sd(imputed_ideology)
[1] 1.88
sd(dta$ideology, na.rm=TRUE)
[1] 2.04
An obvious solution to this problem is to use more information:
n <- sum(is.na(dta$ideology))
mu <- mean(dta$ideology, na.rm=TRUE)
sigma <- sd(dta$ideology, na.rm=TRUE)
imputed_ideology[is.na(dta$ideology)] <- round(rnorm(n, mu, sigma))
The variable now looks better in a way but also very strange:
summary(dta$ideology)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 3 5 5 6 10 2507
summary(imputed_ideology)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.0 3.0 5.0 4.7 6.0 13.0
We could be tempted to trim those odd values, but let’s think about it for a second. Look at this histogram and think about what happens if we trim things:
hist(imputed_ideology)
Of course this is not an unsolvable problem (Why not use a different distribution?) but points to some of the assumptions that we are implicitly making in the imputation process. In fact, let’s think about another assumption we are making: Why are we relying on the variable itself alone? Is there any other information that we can use to help us come up with better imputations? What are some of the challenges that can appear if we try to use additional variables?
Hint: Just consider this model:
summary(lm(ideology ~ gender + age + education + interest, data=dta))
Call:
lm(formula = ideology ~ gender + age + education + interest,
data = dta)
Residuals:
Min 1Q Median 3Q Max
-4.554 -1.398 -0.025 1.197 5.988
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7280 0.0838 44.49 < 2e-16
genderwoman -0.0182 0.0347 -0.53 0.5995
ageage25t34 -0.0296 0.0799 -0.37 0.7108
ageage35t44 0.0624 0.0752 0.83 0.4066
ageage45t54 0.0973 0.0748 1.30 0.1930
ageage55t64 -0.0104 0.0770 -0.13 0.8928
ageage65 0.5187 0.0783 6.62 3.7e-11
educationno educ -0.1392 0.0912 -1.53 0.1270
educationpossecondary 0.0179 0.0475 0.38 0.7059
educationprimary 0.0651 0.0600 1.09 0.2777
educationsecondary 0.1277 0.0477 2.68 0.0074
interest 0.2949 0.0192 15.38 < 2e-16
Residual standard error: 2.01 on 13586 degrees of freedom
(2596 observations deleted due to missingness)
Multiple R-squared: 0.0332, Adjusted R-squared: 0.0325
F-statistic: 42.5 on 11 and 13586 DF, p-value: <2e-16
In the previous situation, we accepted the fact that we could only impute one variable at time, working under the assumption that there was only one variable with missing values. However that’s not often the standard situation.
A popular method to deal with multivariate missingness is through chained equations. The intuition is simple. We impute what we can using a sequence of conditional models and we cycle through them until the sequence of imputations converges.
However, we do not want to make only one imputation because we want to be able to say something about the variance that is due to the uncertainty in the model—and to the fact that we have performed imputations to begin with, i.e., due to the fact that we don’t have observed but estimated data. Notice that the idea here is that we will create a number of replicate datasets each of them with different imputations. The goal is then to make complete-case inference in each of the datasets and combine the estimates accounting for the between and within variance across different imputed datasets.
The common implementation of the chained equations method is very computationally intensive, so let’s start by reducing our dataset.
oo <- sample(1:nrow(dta), 1000)
dta <- dta[oo, ]
We can now do a dry-run to see how things are set up:
M <- mice(data=dta, m=1, maxit=0)
print(M)
Class: mids
Number of multiple imputations: 1
Imputation methods:
interest gender age education ideology
"" "" "" "polyreg" "pmm"
PredictorMatrix:
interest gender age education ideology
interest 0 1 1 1 1
gender 1 0 1 1 1
age 1 1 0 1 1
education 1 1 1 0 1
ideology 1 1 1 1 0
There are two main pieces here. First, the method
attribute indicates the method that will be used for each of the conditional models. mice
often does a reasonable job a selecting them but we may want to fine-tune them. Notice that this is one of the rare occassions in survey statistics in which runtime should be a factor to keep under consideration.
M$method[names(M$method) == "interest"] <- "pmm"
The other piece is the predictorMatrix
object which indicates the variables that will be used in the RHS for each model. More often that not, we will have to tune this matrix iteratively after identifying issues that may affect our ability to computationally run each of the conditional models.
pos <- which(colnames(M$predictorMatrix) == "ideology")
M$predictorMatrix[, pos] <- 0
We now run mice
again and create 5 imputations to be selected after running 20 iterations of the full sequence of models. Notice that I am choosing 5 here although commonly now people use over 20 imputations.
imps <- mice(dta,
m=5,
maxit=10,
method=M$method,
predictorMatrix=M$predictorMatrix)
iter imp variable
1 1 education ideology
1 2 education ideology
1 3 education ideology
1 4 education ideology
1 5 education ideology
2 1 education ideology
2 2 education ideology
2 3 education ideology
2 4 education ideology
2 5 education ideology
3 1 education ideology
3 2 education ideology
3 3 education ideology
3 4 education ideology
3 5 education ideology
4 1 education ideology
4 2 education ideology
4 3 education ideology
4 4 education ideology
4 5 education ideology
5 1 education ideology
5 2 education ideology
5 3 education ideology
5 4 education ideology
5 5 education ideology
6 1 education ideology
6 2 education ideology
6 3 education ideology
6 4 education ideology
6 5 education ideology
7 1 education ideology
7 2 education ideology
7 3 education ideology
7 4 education ideology
7 5 education ideology
8 1 education ideology
8 2 education ideology
8 3 education ideology
8 4 education ideology
8 5 education ideology
9 1 education ideology
9 2 education ideology
9 3 education ideology
9 4 education ideology
9 5 education ideology
10 1 education ideology
10 2 education ideology
10 3 education ideology
10 4 education ideology
10 5 education ideology
Are our imputations any good? For that we have to ensure, at the very least, that the sequence of models has converged to a stationary distribution.
plot(imps)
We would also want to test the extent to which the imputed values are conditionally similar to the non-imputed values as a sanity check. But let’s assume for now that we are satisfied with the imputations. What we need to do now is extract from the imputed object a list of imputed datasets with the function complete
:
completed <- complete(imps, "all")
What remains now is to estimate a model on the imputed datasets. mice
comes with a series of tools that simplifies the matter of applying the model to each of the datasets and combine them into an estimate with the correct variance.
ideol <- with(imps, lm(interest ~ gender + education))
summary(ideol)
Let’s compare things now with the complete-case analysis:
model <- lm(interest ~ gender + education, data=dta)
coefficients(model) - pool(ideol)$pooled$estimate
(Intercept) genderwoman educationno educ
0.004605 -0.001436 -0.002049
educationpossecondary educationprimary educationsecondary
-0.004582 0.000937 -0.001964