Uniformity and difference-in-differences

Earlier, I posted on Facebook about some problems with the analysis conducted by Yap and Contreras on 34 time periods sampled from 92,509 possible time periods, representing each and every precinct. In part of that post I discussed Contreras’s interpretation of high R-squared as representing uniform changes, and said that he was half right.

I gave Contreras too much credit here. Someone pointed out that an R-squared very close to 1 doesn’t even mean that the increase will always be close to 46,000, it means that the model explains almost all the variability around the mean of 46,000. A model fit to a tight zigzag can still have an R-squared near 1, which means that the line and the data are still pretty damn close to one another, but doesn’t tell you anything about whether the model systematically overpredicts or underpredicts at any given point. See this article for more information.

In order to directly check the claim of uniformity, here’s what should be done:

STEP 1. Get all 92,509 precincts in the order in which they transmitted.
STEP 2. Create a new set of 92,508 points. Yes, 92,508. Each point will be (difference in vote share after precinct 2 – difference in vote share after precinct 1), (difference in vote share after precinct 3 – difference in vote share after precinct 2), etc. That’s why it’ll be 1 less than 92,509.
STEP 3. Make a plot with the x-axis being 1 to 92,508 and the y-axis being the difference in differences (yeah, that’s confusing, but that’s what it is). Then run a linear regression model.

To reiterate: we are now plotting how much the gap changes every time a new precinct is added, rather than the gap itself.

IF each time a precinct transmits, the vote gaps are increasing uniformly, which would indeed indicate some major bullshit going on and prove Contreras’s point, you should expect the plot in Step 3 to be a flat horizontal line. The R-squared of the regression should be close to exactly 0.5.

IF each time a precinct transmits, the vote gaps are NOT increasing uniformly, which indicates a normal transmission process, you should expect the plot in Step 3 to have no discernible pattern. The R-squared of the regression should be close to 0.

I will demonstrate with some more simulated data, but what I do below is what should be done with the full data if we ever get it. First, let’s say that after we get all points in, it is indeed as Contreras says – every time a precinct transmits, everyone’s vote totals go up by exactly the same amount, which would be extremely suspicious. I’ll use 400 precincts (time periods) for illustrative purposes, and make two plots: one where precincts in order of transmission are on the x-axis and difference in votes are on the y-axis, and one where the difference in differences are on the y-axis instead.

uniformxyplot

uniformfdplot

Then run a regression model on the data that form the second plot:

lm(formula = z_unif ~ index)
            coef.est coef.se
(Intercept) 1.00     0.00   
index       0.00     0.00   
---
n = 399, k = 2
residual sd = 0.00, R-Squared = 0.50

The R-squared is exactly 50% as expected. I will now plot the estimated regression line on top of the second plot:

uniformfdplotwithfit

Remember that a regression model tries to fit a line to data as best as it can, so it’s not surprising that given uniform data, the regression line overlaps the first-differenced data entirely.

Now let’s say that the data look a lot more like we would expect from a normal transmission process – the gap is steadily increasing in favor of the leading candidate but we see dips and swings here and there as each precinct transmits. That data would look like this:

zigzagxyplot

And the first-differenced data would now look like this:

zigzagfdplot.png

Here’s a regression model fit to that plot:

lm(formula = z ~ index)
            coef.est coef.se
(Intercept) 1.03     0.66   
index       0.00     0.00   
---
n = 399, k = 2
residual sd = 6.55, R-Squared = 0.00

The R-squared is now pretty much 0 (it’ll be more like 0.003 or something, but the model summary rounds to two decimal places).

zigzagfdplotwithfit.png

Now you are trying to fit that volatile-looking zigzag with a single straight line. The model does the best it can, which ends up being a horizontal line drawn through the middle. But look at all the variation that the line doesn’t capture.

To conclude, a very good way to examine the claims of uniformity is to do what I did above on the actual data consisting of 92,509 precincts in order of transmission. You don’t even need to fit regression models, because the R-squared will plummet to near 0 at even the first hint of non-uniformity. You just need to eyeball the plot of difference in differences.

Code follows. Yeah, I know, I should be putting this on Github or something, but I’ll do that later.


install.packages("arm")
library(arm)
x_unif <- c(1:400)
y_unif <- c(101:500)
plot(x_unif, y_unif, type = "l", main = "Simulated data - uniform increase",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
z_unif <- rep(NA, 399)
for (i in 1:399) {
z_unif[i] <- y_unif[i + 1] - y_unif[i]
}
index <- c(1:399)
plot(index, z_unif, type = "l", main = "Simulated data - first-differenced - uniform increase",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z_unif ~ index))
abline(lm(z_unif ~ index), col = "red")
x <- c(1:400)
y <- c(runif(10, 100, 110), runif(10, 110, 120), runif(10, 120, 130), runif(10, 140, 150), runif(10, 130, 140), runif(10, 150, 160), runif(10, 160, 170), runif(10, 170, 180), runif(10, 180, 190), runif(10, 190, 200), runif(10, 200, 210), runif(10, 210, 220), runif(10, 220, 230), runif(10, 240, 250), runif(10, 230, 240), runif(10, 250, 260), runif(10, 270, 280), runif(10, 260, 270), runif(10, 280, 290), runif(10, 290, 300), runif(10, 300, 310), runif(10, 320, 330), runif(10, 310, 320), runif(10, 330, 340), runif(10, 340, 350), runif(10, 350, 360), runif(10, 380, 390), runif(10, 370, 380), runif(10, 360, 370), runif(10, 390, 400), runif(10, 400, 410), runif(10, 410, 420), runif(10, 420, 430), runif(10, 450, 460), runif(10, 440, 450), runif(10, 430, 440), runif(10, 460, 470), runif(10, 490, 500), runif(10, 470, 480), runif(10, 480, 490))
plot(x, y, type = "l", main = "Simulated data - zigzaggy",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
index <- c(1:399)
z <- rep(NA, 399)
for (i in 1:399) {
z[i] <- y[i + 1] - y[i]
}
plot(index, z, type = "l", main = "Simulated data - first-differenced - non-uniform movement",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z ~ index))
abline(lm(z ~ index), col = "red")

 

Did Leni cheat? We don’t know, but trendlines are insufficient evidence

The goal of the following exercise is to simulate a very simplified election in order to show that the apparent anomaly in which Marcos pulled ahead at first and then Robredo suddenly caught up and overtook him is consistent with a scenario in which no cheating occurred. In other words, there may or may not have been cheating, but the surge in Robredo’s vote totals doesn’t prove or even indicate it.

Suppose there are exactly 2000 election precincts in the Philippines.

Each precinct has 1,000,000 voters.

Further suppose that in 1,000, or exactly half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 55%, Robredo = 35%, Others = 10%

And suppose that in the other 1,000 or other half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 34.7%, Robredo = 55.3%, Others = 10%

As we can see, I’ve constructed simulated data that will give Robredo more votes in the final count. Before you accuse me of bias, you can just flip the numbers to give Marcos more votes if you want.

Finally, suppose we have two scenarios:

SCENARIO 1: Each precinct reports their vote tallies in random order.

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:

random_tally

As we can see, Marcos (in red) and Robredo (in yellow) would hew VERY close to each other, as would be expected if the precincts that submitted their vote tallies to the COMELEC server did so randomly.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:

random_difference.pngThe difference between Marcos and Robredo votes looks to be a fairly random process as well, trending towards favoring Robredo (because that’s how I simulated the data here) but alternating unpredictably between narrow and wide gaps/peaks and valleys.

If, in the actual Philippine election, the order in which each precinct reported in to COMELEC was completely randomized, we would expect graphs that look like the above. However, we know that the order in which each precinct reported in to COMELEC was not random; instead, certain regions reported earlier than others, so that overall, more precincts in the north reported first with precincts in the middle catching up later, for example. Let’s simulate something like this with our second scenario.

SCENARIO 2: All the precincts that favor Marcos report first, then the precincts that favor Robredo report last.

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:marcosfirst_tally

Now we see Marcos pulling away first, but after 1000 precincts report, Robredo begins closing the gap, and overtakes Marcos after the last few precincts come in.

In these two scenarios, the final count of Marcos vs. Robredo votes is identical, but depending on the order in which precincts reported their tallies, the trend lines will look very different.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:marcosfirst_difference.png

Now it looks like a very unnatural upside-down V! That’s because the gap widens first as all the precincts where Marcos leads report, then the gap narrows as all the precincts where Robredo leads report.

I have demonstrated here that the trendlines in both candidate total vote share and difference in votes as the number of precincts reporting increases, which some have pointed as evidence of cheating, are in fact consistent with a situation where no manipulation occurred while the votes were being reported.

As a statistician worth their salt would say, this is not to claim that no cheating occurred; it is to claim that that upside-down V is not sufficient evidence to claim that cheating did occur. This is analogous to hypothesis testing where we do not say that the alternative hypothesis is false, but rather we say that we failed to reject the null hypothesis.

Now, of course, the Philippines has way more than 2,000 precincts, they all have very different population sizes, and voters’ preferences are much more diverse than the super simplified scenario I depicted here. But we also do know that there are regional voting patterns. There are definitely clusters of precincts that are next to each other that all went for one candidate. And some of these clusters sent their tallies to COMELEC before others did. So given the actual election data and process, it wouldn’t be out of the ordinary to see a scenario like the above.

Some code in the R programming language to replicate the above follows. If you’re not interested in the code, you can stop reading here.

I didn’t set seeds this time, so your plots will look marginally different, but the patterns I pointed out will all hold.

marcos_leads_20p <- rmultinom(1000, 1000000, prob = c(0.55, 0.35, 0.05, 0.03, 0.01, 0.01))
robredo_leads_20.6p <- rmultinom(1000, 1000000, prob = c(0.347, 0.553, 0.04, 0.03, 0.02, 0.01))

combined <- cbind(marcos_leads_20p, robredo_leads_20.6p)
combined_order <- sample(1:2000, 2000)

marcos <- rep(NA, 2000)
robredo <- rep(NA, 2000)
marcos_total <- 0
robredo_total <- 0

for (i in 1:2000) {
marcos_total <- marcos_total + combined[1, combined_order[i]]
marcos[i] <- marcos_total
robredo_total <- robredo_total + combined[2, combined_order[i]]
robredo[i] <- robredo_total
}

plot(1:2000, marcos, col = “red”, type = “l”,
main = “Marcos vs. Robredo Running Tally, Random Precinct Reporting”,
xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)
points(1:2000, robredo, col = “yellow”, type = “l”)

marcos_2 <- rep(NA, 2000)
robredo_2 <- rep(NA, 2000)
marcos_total <- 0
robredo_total <- 0
for (j in 1:1000) {
marcos_total <- marcos_total + marcos_leads_20p[1, j]
marcos_2[j] <- marcos_total
robredo_total <- robredo_total + marcos_leads_20p[2, j]
robredo_2[j] <- robredo_total

}
for (k in 1001:2000) {
marcos_total <- marcos_total + robredo_leads_20.6p[1, k – 1000]
marcos_2[k] <- marcos_total
robredo_total <- robredo_total + robredo_leads_20.6p[2, k – 1000]
robredo_2[k] <- robredo_total
}

plot(1:2000, marcos_2, col = “red”, type = “l”,
main = “Marcos vs. Robredo Running Tally, Marcos Precincts Report First”,
xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)
points(1:2000, robredo_2, col = “yellow”, type = “l”)

plot(1:2000, marcos – robredo, type = “l”, main = “Marcos minus Robredo Running Tally, Random Precinct Reporting”,
xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)
plot(1:2000, marcos_2 – robredo_2, type = “l”,
main = “Marcos minus Robredo Running Tally, Marcos Precincts Report First”,
xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)