frame <- readRDS(file.path(DATADIR, "frame.RDS"))
## Sort
oo <- sample(1:nrow(frame), replace=FALSE)
frame <- frame[oo, ]

svysrs <- svydesign(id= ~1, weights= ~1, data=ws)

## Poststratification

In the previous session we saw the concept of stratification as a way introducing some information into our design to ensure that some groups (strata) are represented in a given way. Poststratification follows the same idea but, as the name says, it is an adjustment that we make after the sample is collected. It is best thought of as a way of benchmarking our survey against known quantities from the population.

Let’s apply this idea to our sample with nonresponse. Remember that we have information in the frame about the gender and age of individuals. Think of these as our strata.

gender_age <- data.frame(prop.table(table(frame$gender, frame$fage)))
names(gender_age)[c(1, 2)] <- c("gender", "fage")
gender_age$Freq <- gender_age$Freq * nrow(srs)

What the gender_age object tells us is the proportion of cases in each cell that we have in the population. If we want to make the proportion of cases in each cell in our sample match that of the population we can divide one over the other and use that as a correction factor. With our data declared as a design object we can use the postStratify function. Notice that we declare the strata using a formula interface and we pass as population a data.frame with the expected counts in the population.

svyps <- postStratify(svysrs,
strata=~ gender * fage,
population=gender_age)

We can now retrieve the weights from the new object:

head(weights(svyps))

And compare the calculations that we would obtain from the poststratified sample and the raw sample.

svymean(~ gender, svyps)
svymean(~ gender, svysrs)

More importantly, we can see that not only the marginals match, the cross-breaks also match!

svymean(~ interaction(gender, fage), svyps)
svymean(~ interaction(gender, fage), svysrs)
prop.table(table(frame$gender, frame$fage))

Well, at least for the variables that we poststratified on:

svymean(~ feduc, svyps)
prop.table(table(frame$feduc)) ## Raking Poststratification presents one main limitation. Mainly, the assumption that we know the cross-breaks for the same variables in the population. Sometimes, we only have partial information. Maybe we know the marginal distribution of one variable (say, the proportion of men and women) and the marginal distribution of another variable (say, the proportion of people by age group) but not the joint distribution (how many men and women belong to each age group). This type of situation often appears when we have several variables that we want to benchmark. For those cases, we can use a different strategy. Imagine that in the previous example we have access to the joint distribution of age and gender, but we only have the marginal distribution of education. educ <- data.frame(prop.table(table(frame$feduc)))
names(educ)[1] <- "feduc"
educ$Freq <- educ$Freq * nrow(srs)

The function rake is very similar to postStratify: we pass the sample margins and the population margins and it returns back a design object that incorporates weights.

svyrake <- rake(svysrs,
sample.margins=list(~ gender * fage, ~ feduc),
population.margins=list(gender_age, educ))

We can look again at the weights

head(ww <- weights(svyrake))

but something to notice is that, because the unbalance in education was relatively large, we have now a fairly askew distribution of weights:

quantile(ww)

What does that mean for us? To avoid that some individual units have too much influence on the estimation we can trim the weights:

svyrake <- trimWeights(svyrake, upper=5, lower=0)

We can now check the new estimates

svymean(~ gender, svyrake)
svymean(~ gender, svyps)
frame$y <- 0 ## To reduce computation time N <- 250000 frame <- frame[1:N, ] combined <- plyr:::rbind.fill(ws, frame) The next step is to estimate the model that assigns cases from the frame to the sample. The process here is identical to that of propensity score matching, for instance and consists on identifying a model that produces a reasonable balance between covariates in the sample relative to the frame (Why?). Back in the day, one would try many different specifications of a choice model but it is more common now to rely on machine learning models. For instance, it is common to use tree-based models because they also have the advantage of creating weighting cells if we need them. library(partykit) mod <- ctree(y ~ gender + age + feduc + foccup, data=combined, control = ctree_control(mincriterion=0.005, minsplit=0, minbucket=0)) Is this a good model? What would be a good way of testing whether we should keep searching? Let’s assume for now that the model is what we want. The remaining steps consist on estimating the predicted probabilities for each case in the sample and treat them as weights. phat <- predict(mod, newdata=ws) weights <- 1/phat weights <- weights/mean(weights) ## Too large We can now use those weights and check whether the distribution of a known variable has improved relative to our benchmark. prop.table(xtabs(weights ~ fage, data=ws)) prop.table(xtabs( ~ fage, data=ws)) prop.table(xtabs( ~ fage, data=frame)) ## Calibration using matching If the goal is to create weights such that some covariates are balanced between treatment and control, we can take the inverse approach—rather than starting with a model to find the weights, we can just think of it as an optimization problem and find those weights directly. If we come at it from that point of view, the problem is very similar to that of covariate matching. There are many different methods to perform matching and there is no clear reason to prefer one over the other. One possibility is to use an entropy-based approach. library(WeightIt) ## ATC is important out <- weightit(y ~ gender + fage + feduc + foccup, data=combined, method="ebal", estimand="ATC") summary(out) At this point we probably want to perform the same sanity checks as in the pseudo-inclusion approach, to make sure that the model balances correctly on the covariates that we are interested in. Whenever we are satisfied with the approach, we can then follow the same strategy as before: ww <- out$weights[combined\$y == 1]
ww <- ww/mean(ww)

prop.table(xtabs(ww ~ fage, data=ws))
prop.table(xtabs(       ~ fage, data=frame))
prop.table(xtabs(       ~ fage, data=ws))

However, it is important to keep in mind that matching is still a sort of model and that, therefore, it will only fix the issues that we tell it to fix.

## But careful!
prop.table(xtabs(ww ~ fage + feduc, data=ws))
prop.table(xtabs(   ~ fage + feduc, data=frame))