Update: Improved simulations!

My previous post detailed a method to simulate predicted probabilities that each vice-presidential candidate would “win” a Pulse Asia survey should the survey be repeated an infinite number of times. I’ll put the survey results here again for reference:

Comparative-VP_FCA23F37C2544692B0859AA790615323

Peter Julian Cayton, UP Diliman School of Statistics professor and current Ph.D student at Australian National University in Canberra, had two criticisms of my approach which I wholeheartedly agree with:

1.) The approach treats the survey results as fixed, when in fact the survey results have some uncertainty about them. In retrospect, I admit it was strange that I spent half my post talking about the margin of error and then proceeded to completely ignore it in my simulation. It isn’t good to treat the survey results as fixed, because the simulation results are very sensitive to slight changes in the parameters.

2.) The approach assumes that there are no geographic or socio-economic variations in people’s preferences. This is obviously false. I used a simplified assumption so that the model wouldn’t be too complex.

Therefore, I now present two improved simulations of the same survey data. The first simulation will explicitly model the uncertainty around the survey results, while the second will do that AND model geographic variations as well.

Simulation 1: uncertainty around vote shares

First, some code in R:
set.seed(9999)
se <- sqrt(0.5 * 0.5 / 1800)
marcos <- rnorm(1000000, 0.25, se)
robredo <- rnorm(1000000, 0.23, se)
escudero <- rnorm(1000000, 0.23, se)
cayetano <- rnorm(1000000, 0.14, se)
honasan <- rnorm(1000000, 0.06, se)
honasan[honasan < 0] <- 0
trillanes <- rnorm(1000000, 0.05, se)
trillanes[trillanes < 0] <- 0
ewan <- rnorm(1000000, 0.04, se)
ewan[ewan < 0] <- 0
simulation <- matrix(nrow = 7, ncol = 1000000)
for (i in 1:1000000) {
simulation[, i] <- rmultinom(1, 1800, c(marcos[i], robredo[i], escudero[i], cayetano[i], honasan[i], trillanes[i], ewan[i]))
}
table(apply(simulation, MARGIN = 2, FUN = which.max))

This time, instead of using marcos = 0.25, robredo = 0.23, escudero = 0.23, cayetano = 0.14, honasan = 0.06, trillanes = 0.05, and ewan = 0.04, as I did in the previous simulation, I explicitly model the uncertainty around each of these proportions by treating each as a normally distributed random variable with the mean being the Pulse Asia result and the standard deviation being the maximum standard error for a sample survey with 1,800 respondents. In other words: Before simulating which candidate will win the most times, I also simulate each candidate’s vote share. Just to be clear, in my previous post I criticized the use of this standard error as a basis for comparison between candidates, but if we’re only looking at the uncertainty around each individual proportion, this standard error is perfectly fine.

If that sounded like gobbledygook to you, what I’m doing is letting each proportion vary in a reasonable way. For example, Robredo’s Pulse Asia result is 0.23. Instead of fixing that result at 0.23, I recognize that “0.23” has some uncertainty and I allow 0.23 to vary in order to look something like this:

robredouncertainty.png

Thus, Robredo’s vote share can be as low as 19% or as high as 27%, but it’ll still most likely be around 23%.

I do this for all candidates. However, for candidates whose vote shares are very low, modeling their uncertainty in this way will occasionally give us negative values, and there’s no such thing as a negative proportion. So I do truncation: if Trillanes’s simulated vote share ever dips below 0%, I set it to exactly 0%.

I then simulate 1,000,000 draws from a multinomial distribution, like before, only this time, each draw has slightly different parameters because the parameters are themselves simulated. For example, on draw 1, marcos = 0.262, robredo = 0.219, escudero = 0.245, cayetano = 0.149, honasan = 0.047, trillanes = 0.054, and ewan = 0.065. (Due to rounding, these may not add up exactly to 1). On draw 2, marcos = 0.26, robredo = 0.24, escudero = 0.223, cayetano = 0.138, honasan = 0.067, trillanes = 0.058, and ewan = 0.044.

Here are the results:

simtable2.PNG

Marcos still has the edge here, but he now only has a 70% chance of winning, while both Robredo and Escudero have a 15% chance each of winning. (Again, this is if the election were held that day and if the survey is properly representative of the voting population).

Simulation 2: uncertainty around geographic vote shares

Now, let’s discard the assumption that vote shares are the same across the whole country. The Pulse Asia survey results clearly show us results for NCR, “BL” (Balance of Luzon), Visayas and Mindanao. If Pulse Asia’s methodology is anything like SWS, then let’s assume that Pulse Asia surveyed 450 people in each of NCR, the rest of Luzon, Visayas, and Mindanao. This means that the standard deviation used in modeling the uncertainty around geographic vote shares will be larger, because the sample size in each area is smaller.

Then we can do the same thing as above, but this time, we will first do a separate simulation for each region, and then combine them together.

Here is some (inefficient and unnecessarily long – sorry) code that does all this. You can skip to the very end of it if you want, because it’s fairly uninteresting – except for the last line, which I’ll get to.


set.seed(9999)
geog <- data.frame(ncr = c(0.36, 0.12, 0.30, 0.12, 0.04, 0.03, 0.02),
bl = c(0.31, 0.22, 0.23, 0.08, 0.05, 0.05, 0.07),
vis = c(0.17, 0.35, 0.22, 0.13, 0.05, 0.06, 0.03),
min = c(0.18, 0.19, 0.19, 0.29, 0.08, 0.07, 0.02))
rownames(geog) <- c("marcos", "robredo", "escudero", "cayetano", "honasan", "trillanes", "ewan")
se_geog <- sqrt(0.5 * 0.5 / 450)
iter <- 1000000
marcos_ncr <- rnorm(iter, geog$ncr[1], se_geog)
marcos_bl <- rnorm(iter, geog$bl[1], se_geog)
marcos_vis <- rnorm(iter, geog$vis[1], se_geog)
marcos_min <- rnorm(iter, geog$min[1], se_geog)
robredo_ncr <- rnorm(iter, geog$ncr[2], se_geog)
robredo_bl <- rnorm(iter, geog$bl[2], se_geog)
robredo_vis <- rnorm(iter, geog$vis[2], se_geog)
robredo_min <- rnorm(iter, geog$min[2], se_geog)
escudero_ncr <- rnorm(iter, geog$ncr[3], se_geog)
escudero_bl <- rnorm(iter, geog$bl[3], se_geog)
escudero_vis <- rnorm(iter, geog$vis[3], se_geog)
escudero_min <- rnorm(iter, geog$min[3], se_geog)
cayetano_ncr <- rnorm(iter, geog$ncr[4], se_geog)
cayetano_bl <- rnorm(iter, geog$bl[4], se_geog)
cayetano_vis <- rnorm(iter, geog$vis[4], se_geog)
cayetano_min <- rnorm(iter, geog$min[4], se_geog)
honasan_ncr <- rnorm(iter, geog$ncr[5], se_geog)
honasan_bl <- rnorm(iter, geog$bl[5], se_geog)
honasan_vis <- rnorm(iter, geog$vis[5], se_geog)
honasan_min <- rnorm(iter, geog$min[5], se_geog)
trillanes_ncr <- rnorm(iter, geog$ncr[6], se_geog)
trillanes_bl <- rnorm(iter, geog$bl[6], se_geog)
trillanes_vis <- rnorm(iter, geog$vis[6], se_geog)
trillanes_min <- rnorm(iter, geog$min[6], se_geog)
ewan_ncr <- rnorm(iter, geog$ncr[7], se_geog)
ewan_bl <- rnorm(iter, geog$bl[7], se_geog)
ewan_vis <- rnorm(iter, geog$vis[7], se_geog)
ewan_min <- rnorm(iter, geog$min[7], se_geog)
sim_geog_p <- rbind(marcos_ncr, marcos_bl, marcos_vis, marcos_min, robredo_ncr, robredo_bl, robredo_vis, robredo_min,
escudero_ncr, escudero_bl, escudero_vis, escudero_min, cayetano_ncr, cayetano_bl, cayetano_vis, cayetano_min,
honasan_ncr, honasan_bl, honasan_vis, honasan_min, trillanes_ncr, trillanes_bl, trillanes_vis, trillanes_min,
ewan_ncr, ewan_bl, ewan_vis, ewan_min)
sim_geog_p[sim_geog_p < 0] <- 0
sim_ncr <- matrix(nrow = 7, ncol = iter)
sim_bl <- matrix(nrow = 7, ncol = iter)
sim_vis <- matrix(nrow = 7, ncol = iter)
sim_min <- matrix(nrow = 7, ncol = iter)
for (i in 1:iter) {
sim_ncr[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_ncr", i],
sim_geog_p["robredo_ncr", i],
sim_geog_p["escudero_ncr", i],
sim_geog_p["cayetano_ncr", i],
sim_geog_p["honasan_ncr", i],
sim_geog_p["trillanes_ncr", i],
sim_geog_p["ewan_ncr", i]))
sim_bl[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_bl", i],
sim_geog_p["robredo_bl", i],
sim_geog_p["escudero_bl", i],
sim_geog_p["cayetano_bl", i],
sim_geog_p["honasan_bl", i],
sim_geog_p["trillanes_bl", i],
sim_geog_p["ewan_bl", i]))
sim_vis[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_vis", i],
sim_geog_p["robredo_vis", i],
sim_geog_p["escudero_vis", i],
sim_geog_p["cayetano_vis", i],
sim_geog_p["honasan_vis", i],
sim_geog_p["trillanes_vis", i],
sim_geog_p["ewan_vis", i]))
sim_min[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_min", i],
sim_geog_p["robredo_min", i],
sim_geog_p["escudero_min", i],
sim_geog_p["cayetano_min", i],
sim_geog_p["honasan_min", i],
sim_geog_p["trillanes_min", i],
sim_geog_p["ewan_min", i]))
}
sim_geog <- 0.115*sim_ncr + 0.445*sim_bl + 0.208*sim_vis + 0.232*sim_min

In order to combine the results of separate simulations for NCR, the rest of Luzon, Visayas, and Mindanao, one approach would be to just add the four simulations together.

We know, however, that the Philippine voting population is not equally distributed across these four areas. Straight-up adding the four simulations together assumes that each area has exactly 25% of the voting population, and that’s patently false.

Commission on Elections records as of November 2015 show that NCR has 11.50% of voters nationwide, the rest of Luzon has 44.5%, while Visayas has 20.8% and Mindanao has 23.2%.

The last line, therefore, assigns to the variable sim_geog the combination of the four simulations together according to the percentage of the voting population each area has. This is called weighting.

With all this, we’re now ready to simulate! First, let’s look at NCR:

simncr.PNG

Marcos = 88.5%, Escudero = 11.5%. 0% for Robredo and everyone else.

What about the rest of Luzon?

simluzon

Marcos = 93.3%, Robredo = 2.5%, Escudero = 4.2%.

Visayas?

simvisayas.PNG

Robredo wins it big!

Marcos = 0% (well, 0.009%, but that’s pretty much 0%), Robredo = 99.5%, Escudero = 0.5%, Cayetano = 0% (again, 0.0001% is pretty much 0%).

And Mindanao?

simmin.PNG

Cayetano is a rockstar, no doubt due to being Duterte’s running mate.

Marcos = 1%, Robredo = 1.5%, Escudero = 1.5%, Cayetano = 96%

None of this should come as a surprise, considering that you could have guessed these results just by looking at the Pulse Asia table.

But here it comes: what of the combined results across the nation?

simgeog

Marcos = 78.5%, Robredo = 11.5%, Escudero = 10%.

Yeah, looks like Robredo and Cayetano could stand to gain some popularity outside Visayas and Mindanao, respectively.

Remember, though: These results are all obtained via information from a single survey conducted two months before the election. Public opinion can still change – not to mention that, as far as I know, the survey doesn’t cover overseas voters (Fun fact: overseas Filipinos can vote as early as April 9.) Unfortunately, the survey also doesn’t account for bloc voting and vote buying (which usually takes place just a couple of weeks before the elections).

A comment by UP College of Mass Communication’s Dr. Clarissa David puts it best:

“Odds cannot predict things like Binay tanking both debates, or the Kidapawan conflict, or any number of unimaginable events that can happen during an election period. That’s why odds change so dramatically with each round of survey. The odds as calculated assume that nothing out of the ordinary will happen. This is why predicted odds are not a thing in electoral studies. Surveys before the election are never intended to “predict” the winner, they give us a picture of current sentiment.”

I would argue that probabilities give us a more intuitive picture of current sentiment than do difficult-to-interpret proportions and margins of error, but I agree with Dr. David.

I am grateful to Peter for suggesting improvements to the simulations, as well as to Jan Carlo Punongbayan, who conducted follow-up analysis and spurred further discussion based on my original blog post.

(And to Jan Fredrick Cruz for letting me know about JC Punongbayan’s note in the first place.)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s