The goal of the following exercise is to simulate a very simplified election in order to show that the apparent anomaly in which Marcos pulled ahead at first and then Robredo suddenly caught up and overtook him is consistent with a scenario in which no cheating occurred. In other words, there may or may not have been cheating, but the surge in Robredo’s vote totals doesn’t prove or even indicate it.

Suppose there are exactly 2000 election precincts in the Philippines.

Each precinct has 1,000,000 voters.

Further suppose that in 1,000, or exactly half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 55%, Robredo = 35%, Others = 10%

And suppose that in the other 1,000 or other half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 34.7%, Robredo = 55.3%, Others = 10%

As we can see, I’ve constructed simulated data that will give Robredo more votes in the final count. Before you accuse me of bias, you can just flip the numbers to give Marcos more votes if you want.

Finally, suppose we have two scenarios:

**SCENARIO 1: Each precinct reports their vote tallies in random order.**

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:

As we can see, Marcos (in red) and Robredo (in yellow) would hew VERY close to each other, as would be expected if the precincts that submitted their vote tallies to the COMELEC server did so randomly.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:

The difference between Marcos and Robredo votes looks to be a fairly random process as well, trending towards favoring Robredo (because that’s how I simulated the data here) but alternating unpredictably between narrow and wide gaps/peaks and valleys.

If, in the actual Philippine election, the order in which each precinct reported in to COMELEC was completely randomized, we would expect graphs that look like the above. However, we know that the order in which each precinct reported in to COMELEC was not random; instead, certain regions reported earlier than others, so that overall, more precincts in the north reported first with precincts in the middle catching up later, for example. Let’s simulate something like this with our second scenario.

**SCENARIO 2: All the precincts that favor Marcos report first, then the precincts that favor Robredo report last.**

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:

Now we see Marcos pulling away first, but after 1000 precincts report, Robredo begins closing the gap, and overtakes Marcos after the last few precincts come in.

In these two scenarios, the final count of Marcos vs. Robredo votes is identical, but depending on the order in which precincts reported their tallies, the trend lines will look very different.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:

Now it looks like a very unnatural upside-down V! That’s because the gap widens first as all the precincts where Marcos leads report, then the gap narrows as all the precincts where Robredo leads report.

I have demonstrated here that the trendlines in both candidate total vote share and difference in votes as the number of precincts reporting increases, which some have pointed as evidence of cheating, are in fact consistent with a situation where no manipulation occurred while the votes were being reported.

As a statistician worth their salt would say, this is not to claim that no cheating occurred; it is to claim that that upside-down V is not sufficient evidence to claim that cheating did occur. This is analogous to hypothesis testing where we do not say that the alternative hypothesis is false, but rather we say that we failed to reject the null hypothesis.

Now, of course, the Philippines has way more than 2,000 precincts, they all have very different population sizes, and voters’ preferences are much more diverse than the super simplified scenario I depicted here. But we also do know that there are regional voting patterns. There are definitely clusters of precincts that are next to each other that all went for one candidate. And some of these clusters sent their tallies to COMELEC before others did. So given the actual election data and process, it wouldn’t be out of the ordinary to see a scenario like the above.

Some code in the R programming language to replicate the above follows. If you’re not interested in the code, you can stop reading here.

I didn’t set seeds this time, so your plots will look marginally different, but the patterns I pointed out will all hold.

marcos_leads_20p <- rmultinom(1000, 1000000, prob = c(0.55, 0.35, 0.05, 0.03, 0.01, 0.01))

robredo_leads_20.6p <- rmultinom(1000, 1000000, prob = c(0.347, 0.553, 0.04, 0.03, 0.02, 0.01))

combined <- cbind(marcos_leads_20p, robredo_leads_20.6p)

combined_order <- sample(1:2000, 2000)

marcos <- rep(NA, 2000)

robredo <- rep(NA, 2000)

marcos_total <- 0

robredo_total <- 0

for (i in 1:2000) {

marcos_total <- marcos_total + combined[1, combined_order[i]]

marcos[i] <- marcos_total

robredo_total <- robredo_total + combined[2, combined_order[i]]

robredo[i] <- robredo_total

}

plot(1:2000, marcos, col = “red”, type = “l”,

main = “Marcos vs. Robredo Running Tally, Random Precinct Reporting”,

xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)

points(1:2000, robredo, col = “yellow”, type = “l”)

marcos_2 <- rep(NA, 2000)

robredo_2 <- rep(NA, 2000)

marcos_total <- 0

robredo_total <- 0

for (j in 1:1000) {

marcos_total <- marcos_total + marcos_leads_20p[1, j]

marcos_2[j] <- marcos_total

robredo_total <- robredo_total + marcos_leads_20p[2, j]

robredo_2[j] <- robredo_total

}

for (k in 1001:2000) {

marcos_total <- marcos_total + robredo_leads_20.6p[1, k – 1000]

marcos_2[k] <- marcos_total

robredo_total <- robredo_total + robredo_leads_20.6p[2, k – 1000]

robredo_2[k] <- robredo_total

}

plot(1:2000, marcos_2, col = “red”, type = “l”,

main = “Marcos vs. Robredo Running Tally, Marcos Precincts Report First”,

xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)

points(1:2000, robredo_2, col = “yellow”, type = “l”)

plot(1:2000, marcos – robredo, type = “l”, main = “Marcos minus Robredo Running Tally, Random Precinct Reporting”,

xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)

plot(1:2000, marcos_2 – robredo_2, type = “l”,

main = “Marcos minus Robredo Running Tally, Marcos Precincts Report First”,

xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)

Wow! Thanks for the analysis and for this article. But I really do believe there was no massive cheating involved that made Leni’s numbers surge last Tuesday.

Hi. Why don’t you run an analysis on actual available election data? The insights might be better. :-)

Definitely, except actual election data isn’t available in a clean format that I know of, except if you manually copied out numbers from COMELEC/Rappler/GMA-7/etc.

Here’s one that did:

https://www.facebook.com/notes/jan-carlo-punongbayan/weighing-in-on-the-leni-bongbong-data-debate-a-growth-rate-perspective/10154707524170110

Hi. Why don’t you run a simulation on actual election data? :-)

the real question is… Duterte got 15M+ votes and Cayetano only got 5M+ votes. Where did the other 10M+ VP votes go? Obviously not to Leni. The majority of them are DU30/Marcos votes.

You’re forgetting Duterte/Cayetano votes. Cayetano is far behind but he was first place in Eastern Mindanao.

And how sure are you that Duterte votes also means majority are Marcos votes???

Gusto ko yung nag assume ka na lang.

“Obviously not to Leni”

Parang nabantayan mo LAHAT ng bumoto at siguradong sigurado kang hindi nila binoto si Leni 🙂

If for the sake of argument that BBM got entirely Du30’s 10M, keep in mind that Leni cornered most, if not all, of Roxas’ votes who was a 2nd placer, ~9M. Add to that the votes from the votes for Poe, Miriam, Binay and also a fraction of Du30. With that, do you think Leni will be far behind?

[…] Arnold Lau, masters program in quantitative methods in the social science at Columbia University: https://asintunado.wordpress.com/2016/05/11/did-leni-cheat-we-dont-know-but-trendlines-are-insuffici… […]

[…] Arnold Lau, masters program in quantitative methods in the social science at Columbia University: https://asintunado.wordpress.com/2016/05/11/did-leni-cheat-we-dont-know-but-trendlines-are-insuffici… […]