A common post-processing task in survey analysis is the “imputation” of opinions or behaviors to respondents who have not provided that information. Here I want to differentiate between “imputation” which I will restrict to an auxiliary task for the estimation of parameters of interest and “prediction” which I will use for all tasks that fill fill-in gaps of direct interest. The differentiation is slightly arbitrary but it is useful for our purposes.

Let’s start with prediction. In this case, we are going to take a study from the CIS and we are going to do something fairly straightforward. We will use all the information we can from the questionnaire in order to assign a vote choice to all the respondents in our data. For instance, we can take study 3242 from the CIS (the pre-electoral study for the April 2019 elections) and we can fill-in question P10 about vote intention. In this case, we are going to assign a party choice to all respondents, i.e., we are going to work under the assumption of full turnout. In addition, we are not going to edit the intended behavior of anyone.

We have two possible strategies ahead of us. One is to follow a theoretical model of voting behavior that can be used to make reasonable, theoretically-grounded predictions. The other is the one we will use here: we will use a flexible, data-driven approach to make predictions. In this case, the we have to be careful to not overfit the data.

Our main model will be a XGBoost, which is very common in the machine learning literature. We could use the caret package to test different approaches (never a bad idea!), but in this case we will go straight to the xgboost package.

library(caret)
Loading required package: lattice
Loading required package: ggplot2
library(xgboost)

First of all, we need to load and clean up the data a little bit.

CLEAN_DATA <- file.path(DATA_DIR, "cis-clean-data.RDS")
dta <- readRDS(CLEAN_DATA)

As we said above, we will start by reallocating those who are undecided and also those who have decided to not vote. These are our unknown observations.

dta$voteintention[dta$voteintention == "undecided" |
                  dta$voteintention == "wont"] <- NA
dta$voteintention <- as.factor(as.character(dta$voteintention))

unknown <- dta[is.na(dta$voteintention), ]
known <- dta[!is.na(dta$voteintention), ]

Our goal now is to make inferences about the unknown sample using whatever we can learn about the known one. The key concept here is that of bias-variance trade-off. We will start by splitting our known dataset into a training and a test set. In this case, we will set aside 20% of the observations as test.

training <- sample(1:nrow(known), nrow(known)*.8)
testing <- setdiff(1:nrow(known), training)
training <- known[training, ]
testing <- known[testing, ]

Now the idea is the following. We are going to take our training set and we are going to fit models of different complexity to it. In the case of XGBoost there are several hyperparameters that we can tune in order to select the one with the best performance. The strategy we will use here is the simplest one available: we will simply test different combinations of hyperparameters in a regular grid and we will see how each of those combinations work. For the sake of illustration, I am going to be using here a very small grid:

param_grid <- expand.grid(max.depth=c(1, 2, 3),
                          eta=c(.1)) 

In doing so, there is the risk that our models will learn too much about the training data (Why?). Thus, we will want to fit the model and to evaluate the model using different datasets. That’s where the xgb.cv function will help us. In this case, we will be using 5-fold cross-validation over the training set (How many cases are we using in each step?).

best <- 1e6
best_index <- NA
best_params <- NA

for (i in 1:nrow(param_grid)) {
    msg <- sprintf("Computation with (%s)",
                   paste(param_grid[i, ], collapse=","))
    cat(msg, "\n")
    mod_ <- xgb.cv(data=as.matrix(dv),
                  label=as.matrix(as.numeric(training$voteintention) - 1),
                  nfold=5,
                  params=param_grid[i, ],
                  nthread=4,
                  nrounds=500,
                  early_stopping_rounds=5,
                  objective="multi:softmax",
                  num_class=12)

    merror_mean_index_ <- mod_$best_iteration
    min_merror_ <- mod_$evaluation_log[merror_mean_index_]$test_merror_mean

    if (min_merror_ < best) {
        best <- min_merror_
        best_index <- merror_mean_index_
        best_param <- param_grid[i, ]
    }
}
Computation with (1,0.1) 
[1] train-merror:0.167117+0.008064  test-merror:0.172744+0.012918 
Multiple eval metrics are present. Will use test_merror for early stopping.
Will train until test_merror hasn't improved in 5 rounds.

[2] train-merror:0.149047+0.001238  test-merror:0.149120+0.006875 
[3] train-merror:0.148229+0.001384  test-merror:0.148265+0.007988 
[4] train-merror:0.147979+0.001379  test-merror:0.148265+0.007424 
[5] train-merror:0.146806+0.000668  test-merror:0.148265+0.008929 
[6] train-merror:0.146770+0.002784  test-merror:0.147980+0.007860 
[7] train-merror:0.145810+0.001770  test-merror:0.146272+0.007510 
[8] train-merror:0.146058+0.001928  test-merror:0.146699+0.007414 
[9] train-merror:0.141434+0.001964  test-merror:0.143142+0.007983 
[10]    train-merror:0.139300+0.003083  test-merror:0.142146+0.008121 
[11]    train-merror:0.134640+0.003550  test-merror:0.136597+0.009489 
[12]    train-merror:0.133324+0.002871  test-merror:0.136028+0.008723 
[13]    train-merror:0.130656+0.002331  test-merror:0.131189+0.010494 
[14]    train-merror:0.129055+0.002587  test-merror:0.129909+0.008606 
[15]    train-merror:0.127526+0.002662  test-merror:0.128344+0.008009 
[16]    train-merror:0.126530+0.001893  test-merror:0.127205+0.009089 
[17]    train-merror:0.125000+0.002293  test-merror:0.125640+0.008771 
[18]    train-merror:0.124751+0.002313  test-merror:0.125640+0.007755 
[19]    train-merror:0.124004+0.002099  test-merror:0.125355+0.007293 
[20]    train-merror:0.122546+0.002568  test-merror:0.124644+0.008648 
[21]    train-merror:0.121727+0.001827  test-merror:0.122795+0.008520 
[22]    train-merror:0.120838+0.001369  test-merror:0.122225+0.009147 
[23]    train-merror:0.120518+0.001757  test-merror:0.121941+0.008940 
[24]    train-merror:0.119593+0.002486  test-merror:0.120803+0.007126 
[25]    train-merror:0.118597+0.002441  test-merror:0.119807+0.007211 
[26]    train-merror:0.118277+0.002455  test-merror:0.119095+0.006807 
[27]    train-merror:0.118064+0.002591  test-merror:0.120233+0.006672 
[28]    train-merror:0.117708+0.002626  test-merror:0.119522+0.006366 
[29]    train-merror:0.117459+0.002215  test-merror:0.119095+0.006623 
[30]    train-merror:0.116712+0.002051  test-merror:0.118811+0.006241 
[31]    train-merror:0.116498+0.002085  test-merror:0.118384+0.006349 
[32]    train-merror:0.115645+0.002296  test-merror:0.117673+0.006337 
[33]    train-merror:0.115538+0.001804  test-merror:0.117815+0.006154 
[34]    train-merror:0.114755+0.001973  test-merror:0.117103+0.006068 
[35]    train-merror:0.114756+0.002331  test-merror:0.117673+0.006093 
[36]    train-merror:0.114186+0.002295  test-merror:0.117673+0.005708 
[37]    train-merror:0.113866+0.001654  test-merror:0.117104+0.005971 
[38]    train-merror:0.113297+0.001919  test-merror:0.116534+0.005903 
[39]    train-merror:0.112763+0.001254  test-merror:0.117103+0.005577 
[40]    train-merror:0.112052+0.001791  test-merror:0.116107+0.005544 
[41]    train-merror:0.111447+0.002052  test-merror:0.116391+0.005921 
[42]    train-merror:0.110842+0.002069  test-merror:0.115680+0.005886 
[43]    train-merror:0.110700+0.001976  test-merror:0.115537+0.005866 
[44]    train-merror:0.110238+0.001519  test-merror:0.115680+0.006362 
[45]    train-merror:0.109882+0.001853  test-merror:0.115395+0.005850 
[46]    train-merror:0.109740+0.001686  test-merror:0.115253+0.006589 
[47]    train-merror:0.109206+0.001724  test-merror:0.115538+0.006209 
[48]    train-merror:0.108779+0.001586  test-merror:0.114826+0.006322 
[49]    train-merror:0.108423+0.001719  test-merror:0.114399+0.006792 
[50]    train-merror:0.108103+0.001571  test-merror:0.113688+0.006340 
[51]    train-merror:0.107712+0.001279  test-merror:0.113404+0.006313 
[52]    train-merror:0.107427+0.001451  test-merror:0.112977+0.006133 
[53]    train-merror:0.107499+0.001435  test-merror:0.113404+0.006522 
[54]    train-merror:0.107107+0.001404  test-merror:0.112835+0.005970 
[55]    train-merror:0.106609+0.001377  test-merror:0.112124+0.006606 
[56]    train-merror:0.106680+0.001395  test-merror:0.112266+0.006824 
[57]    train-merror:0.106325+0.001355  test-merror:0.111697+0.006869 
[58]    train-merror:0.105898+0.001421  test-merror:0.111270+0.006425 
[59]    train-merror:0.105578+0.001455  test-merror:0.110416+0.005948 
[60]    train-merror:0.105222+0.001362  test-merror:0.109989+0.005942 
[61]    train-merror:0.104653+0.001412  test-merror:0.110131+0.006691 
[62]    train-merror:0.103835+0.001512  test-merror:0.108850+0.006434 
[63]    train-merror:0.104048+0.001202  test-merror:0.108850+0.006434 
[64]    train-merror:0.103977+0.001088  test-merror:0.108850+0.006095 
[65]    train-merror:0.103835+0.000999  test-merror:0.108423+0.006465 
[66]    train-merror:0.103194+0.001303  test-merror:0.107996+0.006089 
[67]    train-merror:0.102696+0.001076  test-merror:0.107854+0.005925 
[68]    train-merror:0.102768+0.000876  test-merror:0.108281+0.006412 
[69]    train-merror:0.102412+0.000998  test-merror:0.108139+0.006726 
[70]    train-merror:0.102021+0.000897  test-merror:0.107854+0.006926 
[71]    train-merror:0.101665+0.001073  test-merror:0.107142+0.006597 
[72]    train-merror:0.101558+0.001250  test-merror:0.107142+0.006398 
[73]    train-merror:0.101451+0.001209  test-merror:0.107284+0.006550 
[74]    train-merror:0.101131+0.001506  test-merror:0.106858+0.006857 
[75]    train-merror:0.101024+0.001640  test-merror:0.106431+0.006107 
[76]    train-merror:0.100811+0.001735  test-merror:0.105578+0.006127 
[77]    train-merror:0.100740+0.001638  test-merror:0.105862+0.006911 
[78]    train-merror:0.100242+0.001472  test-merror:0.105151+0.006781 
[79]    train-merror:0.100064+0.001510  test-merror:0.105293+0.006465 
[80]    train-merror:0.100064+0.001523  test-merror:0.105151+0.006642 
[81]    train-merror:0.099744+0.001600  test-merror:0.105435+0.006701 
[82]    train-merror:0.099779+0.001527  test-merror:0.105293+0.007009 
[83]    train-merror:0.099530+0.001581  test-merror:0.105293+0.006860 
[84]    train-merror:0.099424+0.001849  test-merror:0.105293+0.006980 
[85]    train-merror:0.099495+0.002071  test-merror:0.105008+0.006736 
[86]    train-merror:0.099281+0.002044  test-merror:0.104866+0.007100 
[87]    train-merror:0.099317+0.002478  test-merror:0.104581+0.007287 
[88]    train-merror:0.098961+0.002286  test-merror:0.105009+0.006873 
[89]    train-merror:0.098748+0.002248  test-merror:0.104439+0.006805 
[90]    train-merror:0.098641+0.002356  test-merror:0.104724+0.007095 
[91]    train-merror:0.098428+0.002246  test-merror:0.104297+0.007431 
[92]    train-merror:0.098285+0.002478  test-merror:0.104581+0.007144 
[93]    train-merror:0.098108+0.002303  test-merror:0.104154+0.007252 
[94]    train-merror:0.097788+0.002349  test-merror:0.104155+0.007057 
[95]    train-merror:0.097716+0.002250  test-merror:0.103727+0.007096 
[96]    train-merror:0.097467+0.002484  test-merror:0.104296+0.007471 
[97]    train-merror:0.097503+0.002396  test-merror:0.103870+0.007449 
[98]    train-merror:0.097539+0.002264  test-merror:0.103585+0.007417 
[99]    train-merror:0.097218+0.002362  test-merror:0.103727+0.007589 
[100]   train-merror:0.097325+0.002377  test-merror:0.103158+0.007231 
[101]   train-merror:0.097112+0.002252  test-merror:0.103301+0.007109 
[102]   train-merror:0.096863+0.002177  test-merror:0.103443+0.007072 
[103]   train-merror:0.096613+0.002139  test-merror:0.103159+0.007359 
[104]   train-merror:0.096578+0.002049  test-merror:0.103159+0.007498 
[105]   train-merror:0.096436+0.002166  test-merror:0.103016+0.007514 
[106]   train-merror:0.096293+0.002007  test-merror:0.102874+0.007677 
[107]   train-merror:0.096116+0.001863  test-merror:0.102732+0.007998 
[108]   train-merror:0.096080+0.001979  test-merror:0.102162+0.007975 
[109]   train-merror:0.095831+0.002064  test-merror:0.102732+0.007688 
[110]   train-merror:0.095511+0.002194  test-merror:0.102447+0.007979 
[111]   train-merror:0.095440+0.002161  test-merror:0.102163+0.007811 
[112]   train-merror:0.095440+0.002095  test-merror:0.102305+0.007979 
[113]   train-merror:0.095226+0.002003  test-merror:0.102305+0.007979 
Stopping. Best iteration:
[108]   train-merror:0.096080+0.001979  test-merror:0.102162+0.007975

Computation with (2,0.1) 
[1] train-merror:0.134284+0.002734  test-merror:0.140439+0.002010 
Multiple eval metrics are present. Will use test_merror for early stopping.
Will train until test_merror hasn't improved in 5 rounds.

[2] train-merror:0.129269+0.003594  test-merror:0.137023+0.003976 
[3] train-merror:0.127063+0.003038  test-merror:0.133041+0.003862 
[4] train-merror:0.125747+0.002409  test-merror:0.133040+0.003653 
[5] train-merror:0.123933+0.002829  test-merror:0.129339+0.002428 
[6] train-merror:0.121550+0.002980  test-merror:0.127348+0.001944 
[7] train-merror:0.120874+0.003168  test-merror:0.127205+0.002360 
[8] train-merror:0.119131+0.002574  test-merror:0.125924+0.001435 
[9] train-merror:0.118135+0.002344  test-merror:0.124645+0.002473 
[10]    train-merror:0.116819+0.002106  test-merror:0.124218+0.002437 
[11]    train-merror:0.115645+0.001308  test-merror:0.122369+0.002994 
[12]    train-merror:0.114222+0.001604  test-merror:0.121373+0.002378 
[13]    train-merror:0.113084+0.001431  test-merror:0.119096+0.002467 
[14]    train-merror:0.111767+0.001231  test-merror:0.117105+0.002970 
[15]    train-merror:0.111198+0.001418  test-merror:0.117247+0.002897 
[16]    train-merror:0.110522+0.001274  test-merror:0.116109+0.003062 
[17]    train-merror:0.109242+0.001588  test-merror:0.116251+0.002382 
[18]    train-merror:0.108246+0.001485  test-merror:0.115540+0.003235 
[19]    train-merror:0.107570+0.001780  test-merror:0.114686+0.003597 
[20]    train-merror:0.107001+0.002156  test-merror:0.113121+0.002656 
[21]    train-merror:0.106040+0.001834  test-merror:0.112979+0.003328 
[22]    train-merror:0.105649+0.001840  test-merror:0.111982+0.002860 
[23]    train-merror:0.104475+0.001914  test-merror:0.111840+0.002517 
[24]    train-merror:0.103408+0.001680  test-merror:0.110276+0.003993 
[25]    train-merror:0.102946+0.001788  test-merror:0.110561+0.004400 
[26]    train-merror:0.102092+0.001917  test-merror:0.108853+0.003909 
[27]    train-merror:0.100740+0.001801  test-merror:0.108853+0.003529 
[28]    train-merror:0.100135+0.002024  test-merror:0.108853+0.003642 
[29]    train-merror:0.099317+0.002259  test-merror:0.106861+0.004009 
[30]    train-merror:0.098464+0.002607  test-merror:0.106577+0.004826 
[31]    train-merror:0.097752+0.002795  test-merror:0.106435+0.004543 
[32]    train-merror:0.097610+0.002609  test-merror:0.105865+0.004157 
[33]    train-merror:0.097005+0.002730  test-merror:0.105438+0.004201 
[34]    train-merror:0.096614+0.002704  test-merror:0.105011+0.004422 
[35]    train-merror:0.095796+0.002517  test-merror:0.104301+0.005461 
[36]    train-merror:0.095582+0.002720  test-merror:0.104443+0.005677 
[37]    train-merror:0.095191+0.002616  test-merror:0.105154+0.005803 
[38]    train-merror:0.094587+0.002387  test-merror:0.104728+0.006029 
[39]    train-merror:0.094053+0.002489  test-merror:0.104016+0.005715 
[40]    train-merror:0.093554+0.002307  test-merror:0.103874+0.005797 
[41]    train-merror:0.093234+0.002270  test-merror:0.103447+0.005131 
[42]    train-merror:0.093021+0.002232  test-merror:0.103589+0.005210 
[43]    train-merror:0.092737+0.002244  test-merror:0.102736+0.005010 
[44]    train-merror:0.092381+0.002146  test-merror:0.103163+0.005636 
[45]    train-merror:0.091705+0.002314  test-merror:0.103020+0.005305 
[46]    train-merror:0.091421+0.002533  test-merror:0.103305+0.005660 
[47]    train-merror:0.091065+0.002538  test-merror:0.103021+0.006245 
[48]    train-merror:0.090318+0.002322  test-merror:0.103163+0.006619 
Stopping. Best iteration:
[43]    train-merror:0.092737+0.002244  test-merror:0.102736+0.005010

Computation with (3,0.1) 
[1] train-merror:0.113653+0.002873  test-merror:0.122657+0.006746 
Multiple eval metrics are present. Will use test_merror for early stopping.
Will train until test_merror hasn't improved in 5 rounds.

[2] train-merror:0.110380+0.001876  test-merror:0.122233+0.006802 
[3] train-merror:0.107926+0.003337  test-merror:0.119387+0.007224 
[4] train-merror:0.106147+0.003352  test-merror:0.117396+0.007623 
[5] train-merror:0.104831+0.002683  test-merror:0.117539+0.009204 
[6] train-merror:0.103195+0.002551  test-merror:0.115686+0.006939 
[7] train-merror:0.101914+0.001950  test-merror:0.114690+0.006737 
[8] train-merror:0.101238+0.001576  test-merror:0.113552+0.006563 
[9] train-merror:0.100313+0.001618  test-merror:0.112840+0.005875 
[10]    train-merror:0.099317+0.001420  test-merror:0.113123+0.005723 
[11]    train-merror:0.098037+0.001591  test-merror:0.113409+0.006318 
[12]    train-merror:0.097396+0.001937  test-merror:0.112698+0.006653 
[13]    train-merror:0.096827+0.001632  test-merror:0.112271+0.006681 
[14]    train-merror:0.095938+0.001810  test-merror:0.111844+0.006160 
[15]    train-merror:0.094870+0.001952  test-merror:0.110989+0.007585 
[16]    train-merror:0.094550+0.002255  test-merror:0.110563+0.007012 
[17]    train-merror:0.093483+0.002441  test-merror:0.109141+0.007765 
[18]    train-merror:0.092701+0.002175  test-merror:0.108999+0.007595 
[19]    train-merror:0.091669+0.002297  test-merror:0.108572+0.007368 
[20]    train-merror:0.090851+0.002512  test-merror:0.108288+0.007577 
[21]    train-merror:0.089428+0.002168  test-merror:0.107577+0.007025 
[22]    train-merror:0.088539+0.002121  test-merror:0.106864+0.005835 
[23]    train-merror:0.087792+0.001834  test-merror:0.106580+0.005866 
[24]    train-merror:0.087080+0.001226  test-merror:0.106011+0.005600 
[25]    train-merror:0.086369+0.001263  test-merror:0.105014+0.005163 
[26]    train-merror:0.085622+0.001354  test-merror:0.105156+0.005304 
[27]    train-merror:0.084911+0.001786  test-merror:0.104729+0.005522 
[28]    train-merror:0.084270+0.001563  test-merror:0.104445+0.006052 
[29]    train-merror:0.083665+0.001620  test-merror:0.103591+0.006312 
[30]    train-merror:0.083167+0.001667  test-merror:0.103307+0.006315 
[31]    train-merror:0.082420+0.001589  test-merror:0.102737+0.005812 
[32]    train-merror:0.081673+0.001842  test-merror:0.102168+0.005849 
[33]    train-merror:0.081567+0.001609  test-merror:0.102454+0.006729 
[34]    train-merror:0.080642+0.001573  test-merror:0.102169+0.005765 
[35]    train-merror:0.079930+0.001658  test-merror:0.102026+0.006090 
[36]    train-merror:0.079326+0.001662  test-merror:0.101458+0.007262 
[37]    train-merror:0.078899+0.001472  test-merror:0.101031+0.007001 
[38]    train-merror:0.078116+0.001390  test-merror:0.101600+0.007130 
[39]    train-merror:0.077440+0.001507  test-merror:0.100890+0.007661 
[40]    train-merror:0.076658+0.001423  test-merror:0.101174+0.007799 
[41]    train-merror:0.076373+0.001691  test-merror:0.101173+0.007751 
[42]    train-merror:0.075911+0.001744  test-merror:0.100604+0.007704 
[43]    train-merror:0.075377+0.001838  test-merror:0.100747+0.007600 
[44]    train-merror:0.075128+0.001752  test-merror:0.100319+0.007324 
[45]    train-merror:0.074665+0.001922  test-merror:0.100319+0.007378 
[46]    train-merror:0.074061+0.001898  test-merror:0.100461+0.007266 
[47]    train-merror:0.072994+0.001667  test-merror:0.099892+0.007087 
[48]    train-merror:0.072602+0.001564  test-merror:0.100318+0.006458 
[49]    train-merror:0.071749+0.001572  test-merror:0.100176+0.007213 
[50]    train-merror:0.070895+0.001671  test-merror:0.099750+0.007406 
[51]    train-merror:0.070362+0.001591  test-merror:0.099465+0.006961 
[52]    train-merror:0.069864+0.001428  test-merror:0.099465+0.007258 
[53]    train-merror:0.069508+0.001615  test-merror:0.099180+0.006812 
[54]    train-merror:0.068690+0.001645  test-merror:0.098754+0.007258 
[55]    train-merror:0.068227+0.001682  test-merror:0.099464+0.006508 
[56]    train-merror:0.067765+0.001779  test-merror:0.099750+0.006595 
[57]    train-merror:0.067196+0.001672  test-merror:0.099607+0.006649 
[58]    train-merror:0.066769+0.001538  test-merror:0.099323+0.006913 
[59]    train-merror:0.066342+0.001481  test-merror:0.099038+0.007349 
Stopping. Best iteration:
[54]    train-merror:0.068690+0.001645  test-merror:0.098754+0.007258

Now that we have our best performing combination of hyperparameters, we can now apply it to the full training set.

base <- xgboost(data=as.matrix(dv),
                label=as.matrix(as.numeric(training$voteintention) - 1),
                nthread=4,
                params=best_param,
                nrounds=best_index,
                objective="multi:softmax",
                num_class=12,
                verbose=FALSE)

From this model we could try to say something about performance and how well it works. However, that’s not a good idea. Instead we are going to evaluate the performance of the model using the data that we set aside.

The easiest way to make this evaluation is through the confusion matrix: a comparison between the predictions we made and the known responses. We can characterize the confusion matrix in different ways, and caret::confusionMatrix offers many of the common statistics.

cm <- confusionMatrix(data=factor(predict(base, as.matrix(dv)), labels=lev),
                      reference=testing$voteintention)
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   bildu ciudadanos erc  iu other pdecat pnv podemos  pp psoe
  bildu         30          0   0   0     0      0   0       0   0    0
  ciudadanos     0        224   0   0     3      0   0       1   5    2
  erc            0          0  60   0     3      3   0       0   0    1
  iu             0          0   0  26     1      0   0       5   0    0
  other          0          9   0   5   111      1   1       0   3    8
  pdecat         0          0   1   0     2     16   0       0   0    0
  pnv            0          0   0   0     1      0  25       1   0    1
  podemos        2          0   1   8     2      0   0     107   0    9
  pp             0          8   0   0     8      0   0       0 307    2
  psoe           0          1   6   6    11      1   1       9   1  576
  vox            0          3   0   0     3      0   0       1   3    1
            Reference
Prediction   vox
  bildu        0
  ciudadanos   4
  erc          0
  iu           0
  other        3
  pdecat       0
  pnv          0
  podemos      0
  pp           9
  psoe         1
  vox        115

Overall Statistics
                                        
               Accuracy : 0.908         
                 95% CI : (0.894, 0.921)
    No Information Rate : 0.341         
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.887         
                                        
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: bildu Class: ciudadanos Class: erc Class: iu
Sensitivity                0.9375             0.914     0.8824    0.5778
Specificity                1.0000             0.990     0.9959    0.9965
Pos Pred Value             1.0000             0.937     0.8955    0.8125
Neg Pred Value             0.9988             0.986     0.9953    0.9890
Prevalence                 0.0182             0.139     0.0387    0.0256
Detection Rate             0.0171             0.127     0.0341    0.0148
Detection Prevalence       0.0171             0.136     0.0381    0.0182
Balanced Accuracy          0.9688             0.952     0.9391    0.7871
                     Class: other Class: pdecat Class: pnv Class: podemos
Sensitivity                0.7655        0.7619     0.9259         0.8629
Specificity                0.9814        0.9983     0.9983         0.9865
Pos Pred Value             0.7872        0.8421     0.8929         0.8295
Neg Pred Value             0.9790        0.9971     0.9988         0.9896
Prevalence                 0.0825        0.0119     0.0154         0.0705
Detection Rate             0.0631        0.0091     0.0142         0.0609
Detection Prevalence       0.0802        0.0108     0.0159         0.0734
Balanced Accuracy          0.8735        0.8801     0.9621         0.9247
                     Class: pp Class: psoe Class: vox
Sensitivity              0.962       0.960     0.8712
Specificity              0.981       0.968     0.9932
Pos Pred Value           0.919       0.940     0.9127
Neg Pred Value           0.992       0.979     0.9896
Prevalence               0.181       0.341     0.0751
Detection Rate           0.175       0.328     0.0654
Detection Prevalence     0.190       0.349     0.0717
Balanced Accuracy        0.972       0.964     0.9322
--- 
title: "Prediction"
date: "`r format(Sys.time(), '%B %d, %Y')`"
---

```{r setup, include=FALSE, cache=FALSE}
DATA_DIR <- "./../dta"
```

A common post-processing task in survey analysis is the "imputation"
of opinions or behaviors to respondents who have not provided that
information. Here I want to differentiate between "imputation" which I
will restrict to an auxiliary task for the estimation of parameters of
interest and "prediction" which I will use for all tasks that fill
fill-in gaps of direct interest. The differentiation is slightly
arbitrary but it is useful for our purposes. 

Let's start with prediction. In this case, we are going to take a
study from the CIS and we are going to do something fairly
straightforward. We will use all the information we can from the
questionnaire in order to assign a vote choice to all the respondents
in our data. For instance, we can take study 3242 from the CIS (the
pre-electoral study for the April 2019 elections) and we can fill-in
question `P10` about vote intention. In this case, we are going to
assign a party choice to _all_ respondents, i.e., we are going to work
under the assumption of full turnout. In addition, we are not going to
edit the intended behavior of anyone. 

We have two possible strategies ahead of us. One is to follow a
theoretical model of voting behavior that can be used to make
reasonable, theoretically-grounded predictions. The other is the one
we will use here: we will use a flexible, data-driven approach to make
predictions. In this case, the we have to be careful to not overfit
the data.

Our main model will be a XGBoost, which is very common in the machine
learning literature. We could use the `caret` package to test
different approaches (never a bad idea!), but in this case we will go
straight to the `xgboost` package.

```{r}
library(caret)
library(xgboost)
```

First of all, we need to load and clean up the data a little bit. 

```{r}
CLEAN_DATA <- file.path(DATA_DIR, "cis-clean-data.RDS")
dta <- readRDS(CLEAN_DATA)
```

As we said above, we will start by reallocating those who are
undecided and also those who have decided to not vote. These are our
_unknown_ observations.

```{r}
dta$voteintention[dta$voteintention == "undecided" |
                  dta$voteintention == "wont"] <- NA
dta$voteintention <- as.factor(as.character(dta$voteintention))

unknown <- dta[is.na(dta$voteintention), ]
known <- dta[!is.na(dta$voteintention), ]
```

Our goal now is to make inferences about the `unknown` sample using
whatever we can learn about the `known` one. The key concept here is
that of _bias-variance_ trade-off. We will start by splitting our
`known` dataset into a training and a test set. In this case, we will
set aside 20% of the observations as test. 

```{r}
training <- sample(1:nrow(known), nrow(known)*.8)
testing <- setdiff(1:nrow(known), training)
training <- known[training, ]
testing <- known[testing, ]
```

```{r include=FALSE}
dummies <- dummyVars(~ . - voteintention, 
                     dta,
                     missing=na.pass)
nn <- names(as.data.frame(predict(dummies, dta))) ## Correct order
lev <- levels(dta$voteintention)

## Dummy expansion of training dataset
dv <- predict(dummies,
              training)
dv <- data.frame(dv)
dv <- dv[, nn[nn %in% names(dv)]]
dv$id <- NULL
```

Now the idea is the following. We are going to take our training set
and we are going to fit models of different complexity to it. In the
case of XGBoost there are several hyperparameters that we can tune in
order to select the one with the best performance. The strategy we
will use here is the simplest one available: we will simply test
different combinations of hyperparameters in a regular grid and we
will see how each of those combinations work. For the sake of
illustration, I am going to be using here a very small grid:

```{r}
param_grid <- expand.grid(max.depth=c(1, 2, 3),
                          eta=c(.1)) 
``` 

In doing so, there is the risk that our models will learn _too much_
about the training data (Why?). Thus, we will want to _fit_ the model
and to _evaluate_ the model using different datasets. That's where the
`xgb.cv` function will help us. In this case, we will be using 5-fold
cross-validation over the training set (How many cases are we using in
each step?).

```{r}
best <- 1e6
best_index <- NA
best_params <- NA

for (i in 1:nrow(param_grid)) {
    msg <- sprintf("Computation with (%s)",
                   paste(param_grid[i, ], collapse=","))
    cat(msg, "\n")
    mod_ <- xgb.cv(data=as.matrix(dv),
                  label=as.matrix(as.numeric(training$voteintention) - 1),
                  nfold=5,
                  params=param_grid[i, ],
                  nthread=4,
                  nrounds=500,
                  early_stopping_rounds=5,
                  objective="multi:softmax",
                  num_class=12)

    merror_mean_index_ <- mod_$best_iteration
    min_merror_ <- mod_$evaluation_log[merror_mean_index_]$test_merror_mean

    if (min_merror_ < best) {
        best <- min_merror_
        best_index <- merror_mean_index_
        best_param <- param_grid[i, ]
    }
}
```

Now that we have our best performing combination of hyperparameters,
we can now apply it to the full training set.

```{r}
base <- xgboost(data=as.matrix(dv),
                label=as.matrix(as.numeric(training$voteintention) - 1),
                nthread=4,
                params=best_param,
                nrounds=best_index,
                objective="multi:softmax",
                num_class=12,
                verbose=FALSE)
```

From this model we could try to say something about performance and
how well it works. However, that's not a good idea. Instead we are
going to evaluate the performance of the model using the data that we
set aside.

```{r include=FALSE}
dv <- predict(dummies,
              testing)
dv <- data.frame(dv)
dv <- dv[, base$feature_names]
```

The easiest way to make this evaluation is through the confusion
matrix: a comparison between the predictions we made and the known
responses. We can characterize the confusion matrix in different ways,
and `caret::confusionMatrix` offers many of the common statistics.

```{r}
cm <- confusionMatrix(data=factor(predict(base, as.matrix(dv)), labels=lev),
                      reference=testing$voteintention)
print(cm)
```
