frame <- readRDS(file.path(DATADIR, "frame.RDS"))
## Sort
oo <- sample(1:nrow(frame), replace=FALSE)
frame <- frame[oo, ]
ws <- readRDS(file.path(DATADIR, "weighted-sample.RDS"))
svysrs <- svydesign(id= ~1, weights= ~1, data=ws)
In the previous session we saw the concept of stratification as a way introducing some information into our design to ensure that some groups (strata) are represented in a given way. Poststratification follows the same idea but, as the name says, it is an adjustment that we make after the sample is collected. It is best thought of as a way of benchmarking our survey against known quantities from the population.
Let’s apply this idea to our sample with nonresponse. Remember that we have information in the frame about the gender and age of individuals. Think of these as our strata.
gender_age <- data.frame(prop.table(table(frame$gender,
frame$fage)))
names(gender_age)[c(1, 2)] <- c("gender", "fage")
gender_age$Freq <- gender_age$Freq * nrow(srs)
What the gender_age
object tells us is the proportion of cases in each cell that we have in the population. If we want to make the proportion of cases in each cell in our sample match that of the population we can divide one over the other and use that as a correction factor. With our data declared as a design object we can use the postStratify
function. Notice that we declare the strata
using a formula interface and we pass as population
a data.frame
with the expected counts in the population.
svyps <- postStratify(svysrs,
strata=~ gender * fage,
population=gender_age)
We can now retrieve the weights from the new object:
head(weights(svyps))
And compare the calculations that we would obtain from the poststratified sample and the raw sample.
svymean(~ gender, svyps)
svymean(~ gender, svysrs)
More importantly, we can see that not only the marginals match, the cross-breaks also match!
svymean(~ interaction(gender, fage), svyps)
svymean(~ interaction(gender, fage), svysrs)
prop.table(table(frame$gender, frame$fage))
Well, at least for the variables that we poststratified on:
svymean(~ feduc, svyps)
prop.table(table(frame$feduc))
Poststratification presents one main limitation. Mainly, the assumption that we know the cross-breaks for the same variables in the population. Sometimes, we only have partial information. Maybe we know the marginal distribution of one variable (say, the proportion of men and women) and the marginal distribution of another variable (say, the proportion of people by age group) but not the joint distribution (how many men and women belong to each age group). This type of situation often appears when we have several variables that we want to benchmark. For those cases, we can use a different strategy.
Imagine that in the previous example we have access to the joint distribution of age and gender, but we only have the marginal distribution of education.
educ <- data.frame(prop.table(table(frame$feduc)))
names(educ)[1] <- "feduc"
educ$Freq <- educ$Freq * nrow(srs)
The function rake
is very similar to postStratify
: we pass the sample margins and the population margins and it returns back a design object that incorporates weights.
svyrake <- rake(svysrs,
sample.margins=list(~ gender * fage, ~ feduc),
population.margins=list(gender_age, educ))
We can look again at the weights
head(ww <- weights(svyrake))
but something to notice is that, because the unbalance in education was relatively large, we have now a fairly askew distribution of weights:
quantile(ww)
What does that mean for us? To avoid that some individual units have too much influence on the estimation we can trim the weights:
svyrake <- trimWeights(svyrake, upper=5, lower=0)
We can now check the new estimates
svymean(~ gender, svyrake)
svymean(~ gender, svyps)
prop.table(table(frame$gender))
What to do with continuous variables? One option is to use a GREG estimator that generalizes both poststratification and raking and uses an approach similar to a linear model. Calibration is a very flexible approach but the basic idea is similar to what we have seen.
pop_totals <- c(`(Intercept)`=nrow(frame),
`log(age)`=sum(log(frame$age)),
`gendermale`=sum(frame$gender == "male"))
svycal <- calibrate(svysrs,
formula=~ log(age) + gender,
population=pop_totals)
We can now compare some of the estimates:
svymean(~ log(age) + gender, svycal)
svymean(~ log(age) + gender, svyps)
svymean(~ log(age) + gender, svysrs)
The methods that we discussed above take the view that the sample was extracted from a population and that maybe there are some remaining unbalances that need to be addressed. These unbalances can be fixed by benchmarking the sample against known population statistics and so the only question is about the type of information that is available to the analyst. More recently, the literature has taken a different approach, borrowing from the causal inference literature. There are two main methods in this literature. One tries to address the problem of selection into the sample by estimating the inclusion probabilities, i.e., by creating propensity scores that can then be used as weights. The other approach, the “superpopulation” approach is far less common.
The method of pseudo-inclusion probabilities is simple. All we need to do is estimate, for each individual, the probability with which it was selected into the sample. Of course, this is more easily said than done. In our case, we do have a reasonable pseudo-frame for the sample so the first thing we will do is assign a treatment arm to frame and sample. Here, we will take the sample as the treatment group.
ws$y <- 1
frame$y <- 0
## To reduce computation time
N <- 250000
frame <- frame[1:N, ]
combined <- plyr:::rbind.fill(ws, frame)
The next step is to estimate the model that assigns cases from the frame to the sample. The process here is identical to that of propensity score matching, for instance and consists on identifying a model that produces a reasonable balance between covariates in the sample relative to the frame (Why?). Back in the day, one would try many different specifications of a choice model but it is more common now to rely on machine learning models. For instance, it is common to use tree-based models because they also have the advantage of creating weighting cells if we need them.
library(partykit)
mod <- ctree(y ~ gender + age + feduc + foccup,
data=combined,
control = ctree_control(mincriterion=0.005,
minsplit=0,
minbucket=0))
Is this a good model? What would be a good way of testing whether we should keep searching?
Let’s assume for now that the model is what we want. The remaining steps consist on estimating the predicted probabilities for each case in the sample and treat them as weights.
phat <- predict(mod, newdata=ws)
weights <- 1/phat
weights <- weights/mean(weights) ## Too large
We can now use those weights and check whether the distribution of a known variable has improved relative to our benchmark.
prop.table(xtabs(weights ~ fage, data=ws))
prop.table(xtabs( ~ fage, data=ws))
prop.table(xtabs( ~ fage, data=frame))
If the goal is to create weights such that some covariates are balanced between treatment and control, we can take the inverse approach—rather than starting with a model to find the weights, we can just think of it as an optimization problem and find those weights directly. If we come at it from that point of view, the problem is very similar to that of covariate matching. There are many different methods to perform matching and there is no clear reason to prefer one over the other.
One possibility is to use an entropy-based approach.
library(WeightIt)
## ATC is important
out <- weightit(y ~ gender + fage + feduc + foccup, data=combined,
method="ebal",
estimand="ATC")
summary(out)
At this point we probably want to perform the same sanity checks as in the pseudo-inclusion approach, to make sure that the model balances correctly on the covariates that we are interested in.
Whenever we are satisfied with the approach, we can then follow the same strategy as before:
ww <- out$weights[combined$y == 1]
ww <- ww/mean(ww)
prop.table(xtabs(ww ~ fage, data=ws))
prop.table(xtabs( ~ fage, data=frame))
prop.table(xtabs( ~ fage, data=ws))
However, it is important to keep in mind that matching is still a sort of model and that, therefore, it will only fix the issues that we tell it to fix.
## But careful!
prop.table(xtabs(ww ~ fage + feduc, data=ws))
prop.table(xtabs( ~ fage + feduc, data=frame))