The population file
All sampling starts with a population that we want to study. Let’s assume for the moment that we have access to a file that lists all the individuals in the population with some auxiliary information. This will be our sampling frame. Think for a second about the resources that we have available that most resemble this file.
frame <- readRDS(file.path(DATADIR, "frame.RDS"))
## Sort
oo <- sample(1:nrow(frame), replace=FALSE)
frame <- frame[oo, ]
The frame represents the Spanish population. We see that, for instance, we have one row per individual:
nrow(frame)
And this is how our data looks like:
head(frame)
Notice that we have very little information. Some information about their place of residence, their gender, age, education, occupation, and a weight factor called pbweight
, in addition to a unique identifier (id
).
Methods with known probabilities
Simple random sampling
Let’s start with the sampling that we all know where everybody has the same probability of being selected. We will start by selecting a sample of 1,000 individuals.
srs <- sample(1:nrow(frame), size=1000)
srs <- frame[srs, ]
We are now interested in seeing how well the sample represents the population where we are not very specific about the meaning of “representation.” We can see whether the gender split in our sample matches that of the population:
prop.table(table(srs$gender))
prop.table(table(frame$gender))
The calculation for the frame took longer (sampling saves costs everywhere!) but the two numbers are not that far apart.
To make the computation we used the Base
functions table
and prop.table
and we relied implicitly on the fact that we didn’t need weights because we were interested in a proportion. Let’s try to use survey
and see its value.
Let’s, first of all, create a weight variable.
srs$weight <- nrow(frame)/1000
In survey
, we always start by declaring a “design” object which captures the sampling design. In our case, it can be declared as:
svysrs <- svydesign(id= ~1, weights= ~weight, data=srs)
svysrs
With this object, we can now compute the same table for proportions:
svytable(~ gender, svysrs, Ntotal=100)
Why the Ntotal
argument? Because without it we would get an estimate of the totals, implied by the weights. This is something we are rarely interested in when doing public opinion (why?) but it is a convenient tool to known. Let’s just check the totals for a moment:
svytable(~ gender, svysrs)
table(frame$gender)
Margin of error
We said that the values we got from the survey were close enough. We should define proximity a little bit better, maybe through a measure of uncertainty:
1.96*SE(svymean(~ gender, svysrs))
Notice here that I am calculating the mean of the variable gender
using a formula and then extracting the standard error with with SE
function.
Let’s build upon this link between sample size and uncertainty. Let’s take several samples out of the frame with sizes ranging between 50 and 10,000:
N <- seq(50, 10000, by=50)
srs <- lapply(N, function(i) frame[sample(1:nrow(frame), size=i), ])
We are going to declare each of the samples as a svydesign
object. I am now omitting the weights because I do not really care about population totals anymore.
svysrs <- lapply(srs, function(i) svydesign(id= ~1, weights= ~1, data=i))
Let’s compute our margin of error around our estimate for the gender split in each of the surveys.
margins <- lapply(svysrs, function(i) 1.96*SE(svymean(~ gender, i)))
margins <- do.call(rbind, margins)
And voila! We have our relation between sample size and MoE:
plot(N, margins[, 1], type='l',
xlab="Sample size",
ylab="Margin of error",
main="Math works!")
Stratified random sampling
Simple random sampling is not too interesting. But more importantly, it carries some issues that we should consider before moving forward. Consider, for instance, our sample of 1,000 in all of Spain. We have been able to estimate gender with some accuracy but we wouldn’t be able to estimate gender for each of the CCAA. Is the problem that we don’t have enough cases?
Imagine that we take a much larger sample size, enough to get around 1,000 cases in each CCAA plus the two Autonomous Cities.
srs <- frame[sample(1:nrow(frame), size=1900), ]
svysrs <- svydesign(id= ~1, weights= ~1, data=srs)
Now let’s estimate the gender split. This is also a good opportunity to learn a new function in survey
:
svyby(~ gender, by=~ ccaa, design=svysrs, FUN=svymean)
These are the numbers in the population:
prop.table(table(frame$ccaa, frame$gender), 1)
The numbers are more or less fine but SE vary wildly. Yes, the problem is that we didn’t use our sample in a better way:
svytable(~ ccaa, svysrs)
A posible solution is to distribute the sample so that we had 1,000 cases in each CCAA (Is this always a good idea?). This is a key idea: we could have allocated the sample in a different way taking advantage of the information we know about the population: the allocation of units to CCAA.
Let’s just do that. Let’s take a 1,000 cases from each CCAA:
stratif <- split(frame, frame$ccaa)
stratif <- lapply(stratif, function(i) i[sample(1:nrow(i), size=100), ])
stratif <- do.call(rbind, stratif)
The survey
package let’s us declare this design following the same idea as before:
svystratif <- svydesign(id= ~1, weights= ~1, data=stratif, strata= ~ccaa)
Let’s now compute the gender split again:
svyby(~ gender, by=~ ccaa, design=svystratif, FUN=svymean)
We can also compute the split at the national level:
svymean(~ gender, design=svystratif)
Cluster sampling and PPS
Using stratification we solved one problem but the solution can become impractical. Imagine that we conduct our interviews in person and that we, for the sake of the argument, insist in having all 19 strata represented. We could very well end up in a situation in which all the individuals are distant from one another. Wouldn’t it be nice to sample individuals that are in close proximity to one another? We can accomplish this by dividing our stratum in “clusters” and selecting a sample of those clusters from which we then select individuals in a multistage design. If we do this, and if our clusters have different sizes, we often want to make sure that the larger clusters have a higher probability of selection (it is easier to see if you think about population totals and the relative contribution of large versus small units). This is what we call “probability proportional to size” sampling (PPS).
Methods with unknown probabilities
Quota methods
We have been relying on the idea that we had access to the population file. Wouldn’t that just be convenient? Think for a second on how the population file would need to look like for an opinion poll. Sometimes we do have a list of all the members of the population or we can proceed as if we had it, but more often than not this is not how polling is conducted. We will see more about why in the next section.
A common workaround is to use an idea that resembles stratified sampling. We will define quotas based on known population values and we will interview people until those quotas are filled. We first start by defining what are the cells that we want to fill out. For instance, age and gender (why these two?).
quotas <- prop.table(table(frame$fage, frame$gender))
quotas <- as.data.frame(quotas); names(quotas) <- c("fage", "gender", "freq")
quotas$n <- round(quotas$freq*1000)
quotas
Let’s now simulate the process of collecting this sample. We are going to make a key simplifying assumption. We are just going to run through the sample after sorting it at random and we will just add a person to our sample if the person doesn’t belong to a full quota. In other words, anyone we select will respond.
oo <- sample(1:nrow(frame))
xframe <- frame[oo, ]
xframe <- xframe[complete.cases(xframe), ]
N <- 1; i <- 1
squota <- list()
while (N < 1001) {
if (i %% 500 == 0) {
print(sprintf("Scanned %s cases. Collected %s interviews", i, N))
}
available <- quotas[quotas$fage %in% xframe$fage[i] &
quotas$gender %in% xframe$gender[i], "n"]
if (available > 0) {
squota[[N]] <- xframe[i, ]
N <- N + 1
quotas[quotas$fage %in% xframe$fage[i] &
quotas$gender %in% xframe$gender[i], "n"] <- available - 1
}
i <- i + 1
}
squota <- do.call(rbind, squota)
We can now declare a design for it. It is not exactly a stratified sample and it is definitely not a simple random sample but we can make an argument that our best guess given the random sorting is to assume that the probability of selection is constant (keep this idea in mind for a second).
svyquota <- svydesign(id= ~1, weights= ~1, data=squota)
With this design, we can now compute the distribution of, for instance, education.
svymean(~ feduc, svyquota)
prop.table(table(frame$feduc))
It doesn’t look bad at all, but let’s take a look again at our key assumption and how we can get around deviation from it in the different designs we have discussed.
Nonresponse
Let’s assume that not everybody cooperates with our interview and, even more realistically, that not all groups, as defined by their sociodemographic characteristics, answer at the same rate.
xb <- -.9 +
1.5*(frame$gender == "male") +
-.045*(frame$age - 18) +
.002*((frame$age - 18)^2) +
-.05*(frame$gender == "male")*(frame$age - 18) +
-.02*(frame$gender == "male")*(frame$age - 18) +
-2.7*(frame$feduc == "primary") +
-1.2*(frame$feduc == "secondary") +
1.4*(frame$feduc == "hs") +
2.1*(frame$feduc == "possecondary")
frame$p <- 1/(1 + exp(-xb))
What happens now? Let’s take another sample that uses those response probabilities.
frame_completes <- frame[complete.cases(frame), ]
srs <- frame_completes[sample(1:nrow(frame_completes),
size=1000,
prob=frame_completes$p), ]
saveRDS(srs, file.path(DATADIR, "weighted-sample.RDS"))
svysrs <- svydesign(id= ~1, weights= ~1, data=srs)
And let’s try now compute the same quantities as before.
svymean(~ gender, svysrs, na.rm=TRUE)
prop.table(table(frame_completes$gender))
Unsurprisingly, the numbers are off. In this case, we have all we need in order to make the necessary adjustments: we know the probability with which each group/individual didn’t answer. Let’s try it:
srs$weight <- 1/srs$p
srs$weight <- srs$weight/mean(srs$weight)
A thing to notice is the distribution of weights:
summary(srs$weight)
We now can use put these weights into a new design object and obtain our estimates again:
svysrs <- svydesign(id= ~1, weights= ~weight, data=srs)
svymean(~ gender, svysrs, na.rm=TRUE)
It works!
This last step, when we used as weights the response probability of individuals felt a bit like cheating. In real applications we don’t have this piece of information and, more important, we cannot estimate it by looking only at the people who did participate. However, when we think about samples extracted from an enumeration with some auxiliary variables we can start making reasonable guesses. Think now about our quota sample. All we see is that some people don’t answer but we cannot collect any attribute from those we selected (beyond guesses). Also, in the probability framework we know the probability with which each candidate was selected, which we can combine with our probability of nonresponse. But what about our quota method?
Think now about a specific type of sample, such as an Internet poll through the concepts that we have just discussed. Think, for instance, about the selection probability in the case of a survey link posted on Twitter. In particular, about the people with zero probability of selection (What are the different steps that make someone have a non-zero probability?) and about how much you know about those with non-zero probability. As a thought experiment, think about the ideal data you would need in order to recover something like the weight we used above.
--- 
title: "The very basics of sampling"
date: "`r format(Sys.time(), '%B %d, %Y')`"
---

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_chunk$set(fig.path = './assets/')
DATADIR <- "./../dta"
```

Let's start by loading the `survey` package. It gives us an easy way
to declare survey designs and to create and operate with weighted
data.

```{r}
library(survey)
```

# The population file

All sampling starts with a population that we want to study. Let's
assume for the moment that we have access to a file that lists all the
individuals in the population with some auxiliary information. This
will be our sampling frame. Think for a second about the resources
that we have available that most resemble this file.

```{r}
frame <- readRDS(file.path(DATADIR, "frame.RDS"))
## Sort
oo <- sample(1:nrow(frame), replace=FALSE)
frame <- frame[oo, ]
```

The frame represents the Spanish population. We see that, for
instance, we have one row per individual:

```{r}
nrow(frame)
```

And this is how our data looks like:

```{r}
head(frame)
```

Notice that we have very little information. Some information about
their place of residence, their gender, age, education, occupation,
and a weight factor called `pbweight`, in addition to a unique
identifier (`id`).

# Methods with known probabilities

## Simple random sampling

Let's start with the sampling that we all know where everybody has the
same probability of being selected. We will start by selecting a
sample of 1,000 individuals.

```{r}
srs <- sample(1:nrow(frame), size=1000)
srs <- frame[srs, ]
```

We are now interested in seeing how well the sample _represents_ the
population where we are not very specific about the meaning of
"representation." We can see whether the gender split in our sample
matches that of the population:

```{r}
prop.table(table(srs$gender))
prop.table(table(frame$gender))
```

The calculation for the frame took longer (sampling saves costs
everywhere!) but the two numbers are _not that far apart._

To make the computation we used the `Base` functions `table` and
`prop.table` and we relied implicitly on the fact that we didn't need
weights because we were interested in a proportion. Let's try to use
`survey` and see its value. 

Let's, first of all, create a weight variable. 

```{r}
srs$weight <- nrow(frame)/1000
```

In `survey`, we always start by declaring a "design" object which
captures the sampling design. In our case, it can be declared as:

```{r}
svysrs <- svydesign(id= ~1, weights= ~weight, data=srs)
svysrs
```

With this object, we can now compute the same table for proportions:

```{r}
svytable(~ gender, svysrs, Ntotal=100)
```

Why the `Ntotal` argument? Because without it we would get an estimate
of the totals, implied by the weights. This is something we are rarely
interested in when doing public opinion (why?) but it is a convenient
tool to known. Let's just check the totals for a moment:

```{r}
svytable(~ gender, svysrs)
table(frame$gender)
```

## Margin of error

We said that the values we got from the survey were _close enough._ We
should define proximity a little bit better, maybe through a measure
of uncertainty: 

```{r}
1.96*SE(svymean(~ gender, svysrs))
```

Notice here that I am calculating the mean of the variable `gender`
using a formula and then extracting the standard error with with `SE`
function.

Let's build upon this link between sample size and uncertainty. Let's
take several samples out of the frame with sizes ranging between 50
and 10,000:

```{r}
N <- seq(50, 10000, by=50)
srs <- lapply(N, function(i) frame[sample(1:nrow(frame), size=i), ])
```

We are going to declare each of the samples as a `svydesign` object. I
am now omitting the weights because I do not really care about
population totals anymore.

```{r}
svysrs <- lapply(srs, function(i) svydesign(id= ~1, weights= ~1, data=i))
```

Let's compute our margin of error around our estimate for the gender
split in each of the surveys.

```{r}
margins <- lapply(svysrs, function(i) 1.96*SE(svymean(~ gender, i)))
margins <- do.call(rbind, margins)
```

And voila! We have our relation between sample size and MoE:

```{r}
plot(N, margins[, 1], type='l',
     xlab="Sample size",
     ylab="Margin of error",
     main="Math works!")
```

## Stratified random sampling

Simple random sampling is not too interesting. But more importantly,
it carries some issues that we should consider before moving forward.
Consider, for instance, our sample of 1,000 in all of Spain. We have
been able to estimate gender with some accuracy but we wouldn't be
able to estimate gender for each of the CCAA. Is the problem that we
don't have enough cases?

Imagine that we take a much larger sample size, enough to get around
1,000 cases in each CCAA plus the two Autonomous Cities.

```{r}
srs <- frame[sample(1:nrow(frame), size=1900), ]
svysrs <- svydesign(id= ~1, weights= ~1, data=srs)
```

Now let's estimate the gender split. This is also a good opportunity
to learn a new function in `survey`:

```{r}
svyby(~ gender, by=~ ccaa, design=svysrs, FUN=svymean)
```

These are the numbers in the population:

```{r}
prop.table(table(frame$ccaa, frame$gender), 1)
```

The numbers are more or less fine but SE vary wildly. Yes, the problem
is that we didn't use our sample in a better way:

```{r}
svytable(~ ccaa, svysrs)
```

A posible solution is to distribute the sample so that we had 1,000
cases in each CCAA (Is this always a good idea?). This is a key idea:
we could have allocated the sample in a different way taking advantage
of the information we *know* about the population: the allocation of
units to CCAA.

Let's just do that. Let's take a 1,000 cases from each CCAA:

```{r}
stratif <- split(frame, frame$ccaa)
stratif <- lapply(stratif, function(i) i[sample(1:nrow(i), size=100), ])
stratif <- do.call(rbind, stratif)
```

The `survey` package let's us declare this design following the same
idea as before:

```{r}
svystratif <- svydesign(id= ~1, weights= ~1, data=stratif, strata= ~ccaa)
```

Let's now compute the gender split again:

```{r}
svyby(~ gender, by=~ ccaa, design=svystratif, FUN=svymean)
```

We can also compute the split at the national level:

```{r}
svymean(~ gender, design=svystratif)
```

## Cluster sampling and PPS

Using stratification we solved one problem but the solution can become
impractical. Imagine that we conduct our interviews in person and that
we, for the sake of the argument, insist in having all 19 strata
represented. We could very well end up in a situation in which all the
individuals are distant from one another. Wouldn't it be nice to
sample individuals that are in close proximity to one another? We can
accomplish this by dividing our stratum in "clusters" and selecting a
sample of those clusters from which we then select individuals in a
_multistage_ design. If we do this, and if our clusters have different
sizes, we often want to make sure that the larger clusters have a
higher probability of selection (it is easier to see if you think
about population totals and the relative contribution of large versus
small units). This is what we call "probability proportional to size"
sampling (PPS).

# Methods with unknown probabilities

## Quota methods

We have been relying on the idea that we had access to the population
file. Wouldn't that just be convenient? Think for a second on how the
population file would need to look like for an opinion poll. Sometimes
we do have a list of all the members of the population or we can
proceed _as if_ we had it, but more often than not this is not how
polling is conducted. We will see more about why in the next section. 

A common workaround is to use an idea that _resembles_ stratified
sampling. We will define quotas based on known population values and
we will interview people until those quotas are filled. We first start
by defining what are the cells that we want to fill out. For instance,
age and gender (why these two?).

```{r}
quotas <- prop.table(table(frame$fage, frame$gender))
quotas <- as.data.frame(quotas); names(quotas) <- c("fage", "gender", "freq")
quotas$n <- round(quotas$freq*1000)
quotas
```

Let's now simulate the process of collecting this sample. We are going
to make a key simplifying assumption. We are just going to run through
the sample after sorting it at random and we will just add a person to
our sample if the person doesn't belong to a full quota. In other
words, anyone we select will respond.

```{r}
oo <- sample(1:nrow(frame))
xframe <- frame[oo, ]
xframe <- xframe[complete.cases(xframe), ]

N <- 1; i <- 1
squota <- list()

while (N < 1001) {
    if (i %% 500 == 0) {
        print(sprintf("Scanned %s cases. Collected %s interviews", i, N))
    }
    available <- quotas[quotas$fage %in% xframe$fage[i] &
                        quotas$gender %in% xframe$gender[i], "n"]
    if (available > 0) {
        squota[[N]] <- xframe[i, ]
        N <- N + 1
        quotas[quotas$fage %in% xframe$fage[i] &
               quotas$gender %in% xframe$gender[i], "n"] <- available - 1
    }
    i <- i + 1
}
squota <- do.call(rbind, squota)
```

We can now declare a design for it. It is not exactly a stratified
sample and it is definitely not a simple random sample but we can make
an argument that our best guess given the random sorting is to assume
that the probability of selection is constant (keep this idea in mind
for a second).

```{r}
svyquota <- svydesign(id= ~1, weights= ~1, data=squota)
```

With this design, we can now compute the distribution of, for
instance, education.

```{r}
svymean(~ feduc, svyquota)
prop.table(table(frame$feduc))
```

It doesn't look bad at all, but let's take a look again at our key
assumption and how we can get around deviation from it in the
different designs we have discussed. 

# Nonresponse

Let's assume that not everybody cooperates with our interview and,
even more realistically, that not all groups, as defined by their
sociodemographic characteristics, answer at the same rate. 

```{r}
xb <- -.9 +
    1.5*(frame$gender == "male") +
    -.045*(frame$age - 18) +
    .002*((frame$age - 18)^2) +
    -.05*(frame$gender == "male")*(frame$age - 18) +
    -.02*(frame$gender == "male")*(frame$age - 18) +
    -2.7*(frame$feduc == "primary") +
    -1.2*(frame$feduc == "secondary") +
    1.4*(frame$feduc == "hs")  +
    2.1*(frame$feduc == "possecondary")

frame$p <- 1/(1 + exp(-xb))
```

What happens now? Let's take another sample that uses those response
probabilities.

```{r}
frame_completes <- frame[complete.cases(frame), ]
srs <- frame_completes[sample(1:nrow(frame_completes),
                              size=1000,
                              prob=frame_completes$p), ]
saveRDS(srs, file.path(DATADIR, "weighted-sample.RDS"))
svysrs <- svydesign(id= ~1, weights= ~1, data=srs)
```

And let's try now compute the same quantities as before.

```{r}
svymean(~ gender, svysrs, na.rm=TRUE)
prop.table(table(frame_completes$gender))
```

Unsurprisingly, the numbers are off. In this case, we have all we need
in order to make the necessary adjustments: we know the probability
with which each group/individual didn't answer. Let's try it:

```{r}
srs$weight <- 1/srs$p
srs$weight <- srs$weight/mean(srs$weight)
```

A thing to notice is the distribution of weights:

```{r}
summary(srs$weight)
```

We now can use put these weights into a new design object and obtain
our estimates again:

```{r}
svysrs <- svydesign(id= ~1, weights= ~weight, data=srs)
svymean(~ gender, svysrs, na.rm=TRUE)
```

It works! 

This last step, when we used as weights the response probability of
individuals felt a bit like cheating. In real applications we don't
have this piece of information and, more important, we cannot estimate
it by looking only at the people who did participate. However, when we
think about samples extracted from an enumeration with some auxiliary
variables we can start making reasonable guesses. Think now about our
quota sample. All we see is that some people don't answer but we
cannot collect any attribute from those we selected (beyond guesses).
Also, in the probability framework we know the probability with which
each candidate was selected, which we can combine with our probability
of nonresponse. But what about our quota method? 

Think now about a specific type of sample, such as an Internet poll
through the concepts that we have just discussed. Think, for instance,
about the selection probability in the case of a survey link posted on
Twitter. In particular, about the people with zero probability of
selection (What are the different steps that make someone have a
non-zero probability?) and about how much you know about those with
non-zero probability. As a thought experiment, think about the ideal
data you would need in order to recover something like the weight we
used above. 


