Double-barreled responses: Congress poll on lowering minimum age of criminal responsibility

A one-question online poll on the website of the Philippine Congress gives me the opportunity to discuss questionnaire design a little bit.

congressciclpoll

 

This primer by Jon Krosnick outlines conventional wisdom on how to design a survey question from decades of research in survey methodology. Among the recommendations are to make response options exhaustive and to avoid leading or loaded word choices that push respondents towards an answer.

The problem with the response options above is that they are what are known as “double-barreled” response options, where the question asks you for one answer but each response options is actually two answers. You have the option of choosing Yes AND agreeing that the youth should be responsible for their actions and words as early as possible, or choosing No AND agreeing that punishing children violates child rights. The response options thus do not exhaust other possible ways of framing the issue. For example, a respondent might think that the youth should be responsible as early as possible but that nine years is too young, or that effective punishment would not necessarily deprive children of the chance to improve their lives.

Clearly, the question laudably tried to present balanced viewpoints on the issue. Certainly, allowing respondents to grapple with some arguments in favor of or against the question would result in more informed responses than if the response options were just straight-up Yes, No or Undecided. In order to avoid the double-barreling discussed above, a better way to ask the question would be to simply move the arguments from the responses into the question, as follows:

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. [RANDOMIZE ORDER OF PARENTHESES] Some argue that (the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes), while others argue that (punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character). Do you favor or oppose this proposal, or are you undecided?

  1. Favor
  2. Oppose
  3. Undecided

By doing so, we allow respondents to consider policy arguments in their heads while not railroading them into agreeing with a fixed set of arguments along with their opinion on whether they favor or oppose the proposal.

I would also suggest two changes which are already incorporated into the above question. First is to remove the phrase that mentions “unduly pampering” children with “impunity from criminal responsibility”. These word choices are loaded with negativity and may thus influence respondents’ answers. Furthermore, the argument being made in that phrase is effectively the same as arguing that children should take responsibility at an earlier age, thus making this statement redundant.

Second is to randomize the order in which the arguments are presented to the respondent. The above text would not be what is shown to the viewer; it would be what is shown to instruct whoever is in charge of programming the survey. Upon visiting the website, each respondent would have a 50-50 chance of seeing one of the two following questions:

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. Some argue that the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes, while others argue that punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character. Do you favor or oppose this proposal, or are you undecided?

vs.

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. Some argue that punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character, while others argue that the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes. Do you favor or oppose this proposal, or are you undecided?

There is a possibility that the argument that respondents see either first or most recently would stick out in their minds more. We do not want this to influence their answer, and switching up the order of the arguments should help prevent this.

There are of course broader issues to consider. The poll itself is an interesting exercise, but without any information on who is answering (i.e. who makes up the sample), the poll cannot provide any evidence on Philippine public opinion. Nearly 18,000 respondents as of 10:00 AM on 13 January 2017 may look more impressive than the 1,200 respondents we see from SWS or Pulse Asia surveys, but they may, for example, largely be highly-educated, English-speaking, middle to upper class Filipinos with Internet access who were interested enough in politics to answer a poll on the Congress website – or, in short, unrepresentative of the Filipino population as a whole. Hopefully no lawmaker or pundit will look to this poll to say that 90% of Filipinos oppose the proposal, because this poll does not provide sufficient evidence for that assertion.

Statistical adjustments of the results of online polls in order to reflect the population is an active field of research, but the bare minimum required to start doing so would be demographic information on the respondents. Without that, the poll is nothing more than a curiosity.

 

Advertisements

Data collection and accessibility are the core of civic data analytics

Data science and data analytics are red hot terms nowadays. You can’t go more than a page with Google search without finding some reference to how “sexy” data science is. And everybody wants to be sexy, right?

In the Philippines, helmed by a small enthusiastic community, more and more startups are mushrooming with the business model of providing data science and analytics services to businesses and corporations, such as Thinking Machines Data Science and DataSeer.

Of particular interest to me is “civic” data analytics, or analytics as applied to civic problems such as health, infrastructure, agriculture, poverty, education, the environment, and all sorts of other things that are the ambit of government agencies and nonprofits. The international volunteer organization DataKind, with chapters in Singapore, Dublin, Bangalore, the United Kingdom, Washington, DC, and San Francisco, describes its mission as “bringing together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact.”

One homegrown example would be this story that was published very recently in the Manila Bulletin. As the account goes, using data on dengue outbreaks in Dagupan City, Pangasinan, Wilson Chua and collaborators were able to narrow down the source to specific stagnant pools of water near a couple of elementary schools, and then work with the barangay (village) captain and the Bureau of Fisheries and Aquatic Resources to implement a targeted solution, and no new cases of dengue have been reported since then.

In meetups, conferences, training sessions, and press releases, a lot of attention is placed on the use of big data tools such as Hadoop and d3.js, which are used to easily organize massive amounts of possibly unstructured data and produce impressive-looking visualizations, such as this graph, also produced by Wilson Chua, comparing dengue outbreaks across Pangasinan barangays between 2014 and 2016.

15138584_10154435869058598_2838915204947344528_o.jpg

or this blog post by Thinking Machines that visualizes 114 years of Philippine disasters. This is in line with data science being “sexy” – not only can you use it to do sexy stuff, you can also make sexy looking graphics!

I feel, however, that the main takeaways from the dengue article above are about the thoroughly unsexy, fundamental, and undervalued activities that are at the core of data science: data collection and data access.

Before Wilson Chua could analyze the data, the data had to exist in the first place. Someone had to go out there and collect data on individual incidences of dengue. According to the article, Wilson sourced his data from the Philippine Integrated Diseases Surveillance and Response team at the Department of Health. Someone also had to think about what sorts of variables to collect; one of the keys to Wilson’s insights was that the PIDSR data included not just the date of occurrence and the barangay but also the patient’s age, from which Chua noticed that more school-aged children were getting dengue than any other age group. That means that during the data collection process, someone had to have recognized that age was a relevant epidemiological covariate, without which Chua would have been able to do far less.

They were then able to verify that specific pools of water initially located via Google Maps were locations in which rainfall would accumulate and stagnate without exit points, because a separate person, Nicanor Melecio, based with the Dagupan City government, had LIDAR (LIght Detection And Ranging) maps that were initially created to track flooding. This means that someone had to have recognized that creating LIDAR maps was not only useful but also more broadly applicable, and someone higher up had to have agreed to fund such a project.

Epidemiology (the study of public health) is a fairly well-established field, and the dengue problem was fairly well-defined and narrow in scope. Most civic issues are much murkier; people recognize that disasters, poverty, crime, etc. are problems, but it is not as straightforward to drill down to a specific problem that can be solved. Even when a seemingly specific problem can be identified, e.g. how to reduce casualties from flooding in a particular barangay, or how to improve the livelihoods of a particular group of people, or how to reduce recidivism rates among prisoners, there is still a wide range of possible approaches that must be considered – and more to the point, it isn’t immediately clear what data needs to be collected in order to approach these problems from a “scientific” perspective.

Depending on the application domain, data collection might be a painstakingly long and slow effort. It will also probably be an expensive effort, and thus one whose expenses need to be justified. And it will all be for naught if we do not pay attention to proper measurement, or “the idea of considering the connection between the data you gather and the underlying object of your study.”

People who want to do data science for social good need to focus on working with agencies and organizations charged with data collection in order to identify the specific problems they want to help solve and the specific kinds of data that the solution needs. If the data hasn’t been collected yet, we need to push for efforts to collect it. If the data has been collected but is incomplete or of low quality, we need to push for efforts to improve it. For example, Matthew Cua’s Skyeye Inc. uses drones that can take aerial camera shots to collect data that can help resolve property disputes and land claims, and the company works with the Department of Agrarian Reform to help settle land reform issues.

The current approach is to focus only on problems for which data is already available. For example, data science startups are now currently working with Waze, the traffic app, in order to use their data to try and come up with solutions to Metro Manila’s traffic problem, which affects millions of Filipinos every day, greatly reduces labor productivity, harms the environment, and makes Metro Manila less “liveable”. But the data scientists working on this specific problem did not choose it for its relative importance. They chose it because the data already existed.

Many social scientists are now interested in mining Twitter data to look at public sentiment, despite the fact that we have zero picture of how representative Filipinos on Twitter are versus the Filipino population as a whole. Why are we then treating Twitter as a reliable source of public opinion data? Because it’s already there.

The very example that Wilson uses as his inspiration, John Snow’s approach to solving a cholera outbreak, did not involve Snow accessing an API or writing a web-scraping script. It involved going door-to-door, boots on the ground, identifying houses with cholera. For the vast majority of applications, proper data collection does not involve complex mathematical models or whiz-bang software engineering. It is not sexy, and it is fundamental to good data science.

Then there is the question of data access. In the past few years, the Philippine government has taken great strides in setting up a web portal that allows for public access to some government data. The new administration has announced their intention to continue this program.

Setting aside the question of whether the government data is reliable and measures the needed variables, or whether data on a specific domain even exists in the first place, not all government data is open access. For example, much of the “open data” on the web portal actually just redirects you to the website of the concerned government agency, where they will have summary tables and charts of the data but not the raw data itself. The Philippine Geoportal project combines geospatial data from multiple agencies and allows the visitor to view things such as the location of every hospital and health center in the Philippines on a map, but if the user wants the actual coordinates, they still have to course their request to the Department of Health in writing, which they are not obliged to fulfill. Going back to the dengue article, this quote is telling:

Using his credentials as a technology writer for Manila Bulletin, he wrote the Philippine Integrated Diseases Surveillance and Response team (PIDSR) of the Department of Health, requesting for three years worth of data on Pangasinan.

The DOH acquiesced and sent him back a litany of data on an Excel sheet: 81,000 rows of numbers or around 27,000 rows of data per year.

Wilson had to use his “credentials” to make a request for the data, and the DOH chose whether or not to “acquiesce”. If a less well-off, less well-connected, less prestigious private citizen from Dagupan city, perhaps a concerned elementary school teacher, were to make this request from the DOH, would they have acquiesced?

The burden should not be on the person making a request for data to somehow show that they have “credentials” or that they are “serious”. It should be just as easy for a street sweeper or a fish vendor to access the data as it is for a PhD or a businessman with decades of experience. The person should not even have to make a request. This data should have already been out there. The only data that should be locked behind requests for access are data that contain information that could directly identify individual people, and data that might compromise national security.

In a sense, the article is not merely a celebration of Wilson’s achivements, but a celebration of the good fortune that the DOH considered Wilson credible enough.

We are woefully lacking quality data on all manner of social problems in the Philippines. If data science in the Philippines is to advance, the community cannot merely sit back and hope that some unsexy, underpaid bureaucrats in government agencies or academics in research firms will be insightful enough to collect some good data, committed enough to justify the collection at a budget hearing or to multilateral funding organizations, and considerate enough to make this data as open to all as can be. All things considered, these bureaucrats and academics are data scientists too. They are part of what makes civic data science possible and they deserve the community’s support and advocacy.

 

M.A. Quantitative Methods in the Social Sciences at Columbia University: Things you should know

I am, as of right now (Fall 2016), in my third and final semester of a master’s degree in Quantitative Methods in the Social Sciences (QMSS) at Columbia University in New York City. There isn’t a whole lot of information about this program on the Internet, so to those who are giving this program some consideration, particularly students from outside the United States, here is some helpful information.

How long is QMSS?

One year, with the option to extend one more semester if you wish. Very rarely, some people will extend by two semesters – these are usually people on externally-funded scholarships that have a set timeframe.

Is it expensive?

Yep. This academic year, a full-time semester costs $28,780 for up to 20 units. (One full-time class is either 3 or 4 units, and the number of units isn’t necessarily a function of workload. QMSS classes are 4 units.) It’s expensive, though cheaper than other quantitative courses such as Mathematics of Finance or Statistics.

If you choose to extend your stay, the additional semester costs $10,944, with the restriction that only up to two non-QMSS classes can be taken that semester.

Scholarships?
Unfortunately, hardly any from within Columbia. There might be some merit-based scholarships that’ll give you like 5%. International students are best off seeking external funding. In general, master’s programs that aren’t in public policy tend to be very stingy with scholarships; they need your money to subsidize the Ph.Ds.

So I can take classes outside the program?
Hell yeah. The only required classes are “Theory and Methodology”, which is a survey class of approaches to quantitative social science that’s generally geared towards people who’ve never seen social science research before and is pretty boring and old hat otherwise; the “Research Seminar”, a two-semester sequence where you come in late at night to listen to an academic or practitioner give a talk (which may or may not be interesting), partially meant to expose you as a student to some of the latest research and practice being done, and your master’s thesis. Everyone has to do a thesis. There’s no getting around it.

By the way, Statistics students don’t have a thesis. If theses are your thing (e.g. if you really want to go on to a Ph.D), go for QMSS.

Aside from those classes, you can take any classes you want outside the program as long as a certain number of them are quantitative. If, for example, you find QMSS’s course offerings too easy, or you really have a particular substantive topic you’re looking to learn more about, you can take classes in the Stats, CS, Engineering, Econ, PoliSci, Sociology, and whatever other departments exist at Columbia that’ll let you in. And most classes will let you in provided there’s enough room. It’s only the super-duper high demand ones such as Data Science that won’t let you in… except if you’re in QMSS’s Data Science track.

Track?

QMSS students are required to pick one of three tracks. The “Economics” track is exactly what it says on the tin; it used to function as basically an MA in Econ when Columbia didn’t have their MA in Econ yet (it was just launched last year). The “Data Science” track lets you into classes offered by the Institute for Data Science and Engineering such as Algorithms for Data Science, Machine Learning for Data Science, etc. And the “Traditional” track is where everybody else who isn’t specializing in the former two tracks goes.

(There’s also a fourth track, “Experiments”, but to the best of my knowledge practically no one has ever taken that track and it might not even functionally exist anymore.)

I suck at math – what do I do?

QMSS was specifically designed for people with a limited background in quantitative methods. I didn’t even know what a regression was before I entered the program – my own background was in development studies and history. Now I wouldn’t call myself an expert on anything, but I’m more comfortable with this stuff and know where to look to go deeper.

You do need to kinda not suck at math, but if you can nail the Graduate Record Examinations, you’re good. (Not the math-specific GRE, just the general quantitative portion.) I would even say that you can graduate from the program without knowing calculus and linear algebra if you just take all the applied data analysis classes, though how much you actually understand would be a valid question.

How big is the program, and what are the people in it like?

Each cohort ranges from about 60-75 students, including part-time students. About half or so are from East Asia and the other half are from everywhere else, though mostly North America. Backgrounds range from idiots like myself to people who are already data scientists of some sort.

For Chinese people in particular, especially those who are coming straight from undergrad in China, if you’re also looking for a program with a good mix of Mandarin and English speakers, QMSS is for you. By comparison, Columbia’s Stats program is around 90% Chinese.

QMSS has an associated student-led organization, Society for Quantitative Approaches to Social Research (QASR), that organizes social events, alumni networking sessions, outreach efforts, and the like.

12311290_10156263920680158_8252427665775680723_n
Thanksgiving dinner
11416420_10101408298610104_5306192380019355888_o.jpg
QMSS students at Google’s offices

So what is in QMSS?

QMSS classes focus on applied quantitative social science. There’s a three-course sequence taught by sociologist and program director Gregory Eirich that starts with applied regression analysis through linear models, generalized linear models, causal inference methods, text mining, methods for longitudinal data, and time series processes. Those classes are light on the technical details and focus on understanding the structure and assumptions of various models and techniques at a high level and putting them to use with data via the open-source R programming language. This is in contrast with an econometrics class, which would cover the same material but with much more theoretical grounding and mathematical proof and much less actual data analysis.

There’s a two-course sequence taught by geographical sociologist Jeremy Porter on geographic information systems & spatial analysis that essentially teaches you how to work with data with some sort of location information attached. It uses the open-source software Quantum GIS and the R programming language.

Other electives include Social Network Analysis, also taught by Eirich, and Data Visualization, which is either a really good or a really crappy class depending on who’s teaching it. If you’re familiar with the Harry Potter series, it’s like Defense of the Dark Arts – the professor changes every year for some reason or another, and the syllabus completely changes every year in step.

There are also two classes taught by political scientist Benjamin Goodrich. In the fall, Data Mining for Social Science is an introduction to techniques such as tree-based models, neural networks, principal components analysis, etc., that are commonly used when the goal is to predict new data. It’s essentially an intro to machine learning class that’s decent preparation for more advanced classes, and it doubles as an actual introduction to R in the first couple of weeks. The other classes use R, but this class actually teaches it from the ground up at an accelerated pace.

In the spring, Bayesian Statistics for the Social Sciences is an introduction to Bayesian modelling, which is a different way of approaching a statistical problems that puts heavy emphasis on probability distributions. It’s the most math-heavy class in the program and also doubles as a shameless plug for the statistical modelling language Stan, which is being actively developed at Columbia by a team including Goodrich and others, which allows the user to specify flexible models for particular situations instead of relying on canned packages. Columbia is also the home of Andrew Gelman, one of the foremost Bayesian statisticians out there, and his influence looms large.

If none of those sounded exciting to you, remember that you don’t even have to take any of those classes if you don’t want to. Already a crackerjack who wants to go full-on into machine learning? Head over to Stats/CS/Engineering. Want to go into finance? Columbia Business School classes can be hard to get into but it’s doable. Not satisfied with the fairly high-level approach of Eirich’s classes and want to go really deep into the weeds? Either the Econ or PoliSci departments are for you.

There’s also one last thing. If you wish, QMSS can match you with professors from across the university who are looking for research assistants, which is a great opportunity to get your feet wet with social science research outside of a classroom setting.

I don’t want a Ph.D – I just want a better job or a change of career.

QMSS is perfect for that as well – while initially designed as a Ph.D preparation program, most people opt to go into industry for at least a while after graduation, and the flexible nature of the program allows for diverse interests. Data science and analytics are the flavor of the year, but policy research, tech, consulting and finance aren’t far behind. And for those last two industries in particular, there’s hardly a better place than New York City.

QMSS also qualifies as a Science, Technology, Engineering and Mathematics (STEM) program, which for international students who want to work in the United States means that they would qualify for special preferences given to STEM majors.

How easy would it be for me to find a job?

Jobs are everywhere for people skilled in quantitative methods. Many students in the program aren’t even particularly interested in the “social science” part. The barriers to entry are much higher for international students, however. Columbia University’s career office is a great resource, with lots of events such as job fairs, industry talks, and interview prep sessions. This being New York, though, their offerings are heavily skewed towards consulting, finance and tech. The good news is that larger firms are generally also more open to international hires.

Part of the reason some people stay for a third semester is to give themselves additional breathing room in the job hunt – instead of graduating in May and having to have a full-time job within three months or risk losing their visa status, they can graduate in December and hopefully score a summer internship beforehand.

Why shouldn’t I just major in Statistics?

Sure, you could. In fact, if your goal is to get a Ph.D in Statistics, or to get a top-level engineering job in some data science firm, then you would need much more quantitative stuff than the average QMSS student learns, although you could pursue those same classes within QMSS if you wanted to due to its flexibility.

Regarding Columbia specifically, the primary advantage to QMSS over the Statistics program are that it’s slightly cheaper and has 60-75 students as opposed to Stat’s 200+, meaning that it’s easier to raise concerns and get attention from the program administration.

If your goal is a Ph.D in a social science, QMSS is excellent preparation. Many social science Ph.Ds are actually somewhat behind when it comes to quantitative methods despite American social science’s heavy focus on it.

Why would QMSS not be for me?

The cost is a huge factor, honestly. Tuition aside, cost of living in New York City can be ridiculous. Food, transportation, and that kind of stuff isn’t very expensive, but rent…

Barring that, it does say “quantitative” in the title, which represents a particular approach to social science that isn’t by any means all-encompassing. If you’re interested in being the kind of researcher who can do detailed case studies, thick descriptions, in-depth interviews, etc., then the program probably isn’t for you. You might even encounter stuck-up people who will scoff at you for your ‘inferior’ methods, though it’s not as prevalent as it seems. Quantitative methods are generally useful for every social scientist to familiarize themselves with, but a program devoted to them isn’t for everybody.

If your bachelor’s degree was a fairly quant-heavy social science program, QMSS will be redundant.

Finally, if you intend to use the program’s flexibility to take courses in things like data science, it isn’t going to be immediately obvious to employers that you’ve taken those courses if they see QMSS on your resume. Your chosen track isn’t reflected on your transcript or diploma. So if you really do want to go into something more specific like stats/data science, you may want to consider going for the degree that actually says stats/data science. (Or you could self-study, build a portfolio, etc.)

On a lighter note, “Quantitative Methods for the Social Sciences” is also way too long of a course name. Imagine telling someone that in an elevator.

 

 

Uniformity and difference-in-differences

Earlier, I posted on Facebook about some problems with the analysis conducted by Yap and Contreras on 34 time periods sampled from 92,509 possible time periods, representing each and every precinct. In part of that post I discussed Contreras’s interpretation of high R-squared as representing uniform changes, and said that he was half right.

I gave Contreras too much credit here. Someone pointed out that an R-squared very close to 1 doesn’t even mean that the increase will always be close to 46,000, it means that the model explains almost all the variability around the mean of 46,000. A model fit to a tight zigzag can still have an R-squared near 1, which means that the line and the data are still pretty damn close to one another, but doesn’t tell you anything about whether the model systematically overpredicts or underpredicts at any given point. See this article for more information.

In order to directly check the claim of uniformity, here’s what should be done:

STEP 1. Get all 92,509 precincts in the order in which they transmitted.
STEP 2. Create a new set of 92,508 points. Yes, 92,508. Each point will be (difference in vote share after precinct 2 – difference in vote share after precinct 1), (difference in vote share after precinct 3 – difference in vote share after precinct 2), etc. That’s why it’ll be 1 less than 92,509.
STEP 3. Make a plot with the x-axis being 1 to 92,508 and the y-axis being the difference in differences (yeah, that’s confusing, but that’s what it is). Then run a linear regression model.

To reiterate: we are now plotting how much the gap changes every time a new precinct is added, rather than the gap itself.

IF each time a precinct transmits, the vote gaps are increasing uniformly, which would indeed indicate some major bullshit going on and prove Contreras’s point, you should expect the plot in Step 3 to be a flat horizontal line. The R-squared of the regression should be close to exactly 0.5.

IF each time a precinct transmits, the vote gaps are NOT increasing uniformly, which indicates a normal transmission process, you should expect the plot in Step 3 to have no discernible pattern. The R-squared of the regression should be close to 0.

I will demonstrate with some more simulated data, but what I do below is what should be done with the full data if we ever get it. First, let’s say that after we get all points in, it is indeed as Contreras says – every time a precinct transmits, everyone’s vote totals go up by exactly the same amount, which would be extremely suspicious. I’ll use 400 precincts (time periods) for illustrative purposes, and make two plots: one where precincts in order of transmission are on the x-axis and difference in votes are on the y-axis, and one where the difference in differences are on the y-axis instead.

uniformxyplot

uniformfdplot

Then run a regression model on the data that form the second plot:

lm(formula = z_unif ~ index)
            coef.est coef.se
(Intercept) 1.00     0.00   
index       0.00     0.00   
---
n = 399, k = 2
residual sd = 0.00, R-Squared = 0.50

The R-squared is exactly 50% as expected. I will now plot the estimated regression line on top of the second plot:

uniformfdplotwithfit

Remember that a regression model tries to fit a line to data as best as it can, so it’s not surprising that given uniform data, the regression line overlaps the first-differenced data entirely.

Now let’s say that the data look a lot more like we would expect from a normal transmission process – the gap is steadily increasing in favor of the leading candidate but we see dips and swings here and there as each precinct transmits. That data would look like this:

zigzagxyplot

And the first-differenced data would now look like this:

zigzagfdplot.png

Here’s a regression model fit to that plot:

lm(formula = z ~ index)
            coef.est coef.se
(Intercept) 1.03     0.66   
index       0.00     0.00   
---
n = 399, k = 2
residual sd = 6.55, R-Squared = 0.00

The R-squared is now pretty much 0 (it’ll be more like 0.003 or something, but the model summary rounds to two decimal places).

zigzagfdplotwithfit.png

Now you are trying to fit that volatile-looking zigzag with a single straight line. The model does the best it can, which ends up being a horizontal line drawn through the middle. But look at all the variation that the line doesn’t capture.

To conclude, a very good way to examine the claims of uniformity is to do what I did above on the actual data consisting of 92,509 precincts in order of transmission. You don’t even need to fit regression models, because the R-squared will plummet to near 0 at even the first hint of non-uniformity. You just need to eyeball the plot of difference in differences.

Code follows. Yeah, I know, I should be putting this on Github or something, but I’ll do that later.


install.packages("arm")
library(arm)
x_unif <- c(1:400)
y_unif <- c(101:500)
plot(x_unif, y_unif, type = "l", main = "Simulated data - uniform increase",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
z_unif <- rep(NA, 399)
for (i in 1:399) {
z_unif[i] <- y_unif[i + 1] - y_unif[i]
}
index <- c(1:399)
plot(index, z_unif, type = "l", main = "Simulated data - first-differenced - uniform increase",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z_unif ~ index))
abline(lm(z_unif ~ index), col = "red")
x <- c(1:400)
y <- c(runif(10, 100, 110), runif(10, 110, 120), runif(10, 120, 130), runif(10, 140, 150), runif(10, 130, 140), runif(10, 150, 160), runif(10, 160, 170), runif(10, 170, 180), runif(10, 180, 190), runif(10, 190, 200), runif(10, 200, 210), runif(10, 210, 220), runif(10, 220, 230), runif(10, 240, 250), runif(10, 230, 240), runif(10, 250, 260), runif(10, 270, 280), runif(10, 260, 270), runif(10, 280, 290), runif(10, 290, 300), runif(10, 300, 310), runif(10, 320, 330), runif(10, 310, 320), runif(10, 330, 340), runif(10, 340, 350), runif(10, 350, 360), runif(10, 380, 390), runif(10, 370, 380), runif(10, 360, 370), runif(10, 390, 400), runif(10, 400, 410), runif(10, 410, 420), runif(10, 420, 430), runif(10, 450, 460), runif(10, 440, 450), runif(10, 430, 440), runif(10, 460, 470), runif(10, 490, 500), runif(10, 470, 480), runif(10, 480, 490))
plot(x, y, type = "l", main = "Simulated data - zigzaggy",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
index <- c(1:399)
z <- rep(NA, 399)
for (i in 1:399) {
z[i] <- y[i + 1] - y[i]
}
plot(index, z, type = "l", main = "Simulated data - first-differenced - non-uniform movement",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z ~ index))
abline(lm(z ~ index), col = "red")

 

Did Leni cheat? We don’t know, but trendlines are insufficient evidence

The goal of the following exercise is to simulate a very simplified election in order to show that the apparent anomaly in which Marcos pulled ahead at first and then Robredo suddenly caught up and overtook him is consistent with a scenario in which no cheating occurred. In other words, there may or may not have been cheating, but the surge in Robredo’s vote totals doesn’t prove or even indicate it.

Suppose there are exactly 2000 election precincts in the Philippines.

Each precinct has 1,000,000 voters.

Further suppose that in 1,000, or exactly half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 55%, Robredo = 35%, Others = 10%

And suppose that in the other 1,000 or other half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 34.7%, Robredo = 55.3%, Others = 10%

As we can see, I’ve constructed simulated data that will give Robredo more votes in the final count. Before you accuse me of bias, you can just flip the numbers to give Marcos more votes if you want.

Finally, suppose we have two scenarios:

SCENARIO 1: Each precinct reports their vote tallies in random order.

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:

random_tally

As we can see, Marcos (in red) and Robredo (in yellow) would hew VERY close to each other, as would be expected if the precincts that submitted their vote tallies to the COMELEC server did so randomly.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:

random_difference.pngThe difference between Marcos and Robredo votes looks to be a fairly random process as well, trending towards favoring Robredo (because that’s how I simulated the data here) but alternating unpredictably between narrow and wide gaps/peaks and valleys.

If, in the actual Philippine election, the order in which each precinct reported in to COMELEC was completely randomized, we would expect graphs that look like the above. However, we know that the order in which each precinct reported in to COMELEC was not random; instead, certain regions reported earlier than others, so that overall, more precincts in the north reported first with precincts in the middle catching up later, for example. Let’s simulate something like this with our second scenario.

SCENARIO 2: All the precincts that favor Marcos report first, then the precincts that favor Robredo report last.

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:marcosfirst_tally

Now we see Marcos pulling away first, but after 1000 precincts report, Robredo begins closing the gap, and overtakes Marcos after the last few precincts come in.

In these two scenarios, the final count of Marcos vs. Robredo votes is identical, but depending on the order in which precincts reported their tallies, the trend lines will look very different.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:marcosfirst_difference.png

Now it looks like a very unnatural upside-down V! That’s because the gap widens first as all the precincts where Marcos leads report, then the gap narrows as all the precincts where Robredo leads report.

I have demonstrated here that the trendlines in both candidate total vote share and difference in votes as the number of precincts reporting increases, which some have pointed as evidence of cheating, are in fact consistent with a situation where no manipulation occurred while the votes were being reported.

As a statistician worth their salt would say, this is not to claim that no cheating occurred; it is to claim that that upside-down V is not sufficient evidence to claim that cheating did occur. This is analogous to hypothesis testing where we do not say that the alternative hypothesis is false, but rather we say that we failed to reject the null hypothesis.

Now, of course, the Philippines has way more than 2,000 precincts, they all have very different population sizes, and voters’ preferences are much more diverse than the super simplified scenario I depicted here. But we also do know that there are regional voting patterns. There are definitely clusters of precincts that are next to each other that all went for one candidate. And some of these clusters sent their tallies to COMELEC before others did. So given the actual election data and process, it wouldn’t be out of the ordinary to see a scenario like the above.

Some code in the R programming language to replicate the above follows. If you’re not interested in the code, you can stop reading here.

I didn’t set seeds this time, so your plots will look marginally different, but the patterns I pointed out will all hold.

marcos_leads_20p <- rmultinom(1000, 1000000, prob = c(0.55, 0.35, 0.05, 0.03, 0.01, 0.01))
robredo_leads_20.6p <- rmultinom(1000, 1000000, prob = c(0.347, 0.553, 0.04, 0.03, 0.02, 0.01))

combined <- cbind(marcos_leads_20p, robredo_leads_20.6p)
combined_order <- sample(1:2000, 2000)

marcos <- rep(NA, 2000)
robredo <- rep(NA, 2000)
marcos_total <- 0
robredo_total <- 0

for (i in 1:2000) {
marcos_total <- marcos_total + combined[1, combined_order[i]]
marcos[i] <- marcos_total
robredo_total <- robredo_total + combined[2, combined_order[i]]
robredo[i] <- robredo_total
}

plot(1:2000, marcos, col = “red”, type = “l”,
main = “Marcos vs. Robredo Running Tally, Random Precinct Reporting”,
xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)
points(1:2000, robredo, col = “yellow”, type = “l”)

marcos_2 <- rep(NA, 2000)
robredo_2 <- rep(NA, 2000)
marcos_total <- 0
robredo_total <- 0
for (j in 1:1000) {
marcos_total <- marcos_total + marcos_leads_20p[1, j]
marcos_2[j] <- marcos_total
robredo_total <- robredo_total + marcos_leads_20p[2, j]
robredo_2[j] <- robredo_total

}
for (k in 1001:2000) {
marcos_total <- marcos_total + robredo_leads_20.6p[1, k – 1000]
marcos_2[k] <- marcos_total
robredo_total <- robredo_total + robredo_leads_20.6p[2, k – 1000]
robredo_2[k] <- robredo_total
}

plot(1:2000, marcos_2, col = “red”, type = “l”,
main = “Marcos vs. Robredo Running Tally, Marcos Precincts Report First”,
xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)
points(1:2000, robredo_2, col = “yellow”, type = “l”)

plot(1:2000, marcos – robredo, type = “l”, main = “Marcos minus Robredo Running Tally, Random Precinct Reporting”,
xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)
plot(1:2000, marcos_2 – robredo_2, type = “l”,
main = “Marcos minus Robredo Running Tally, Marcos Precincts Report First”,
xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)

 

Update: Improved simulations!

My previous post detailed a method to simulate predicted probabilities that each vice-presidential candidate would “win” a Pulse Asia survey should the survey be repeated an infinite number of times. I’ll put the survey results here again for reference:

Comparative-VP_FCA23F37C2544692B0859AA790615323

Peter Julian Cayton, UP Diliman School of Statistics professor and current Ph.D student at Australian National University in Canberra, had two criticisms of my approach which I wholeheartedly agree with:

1.) The approach treats the survey results as fixed, when in fact the survey results have some uncertainty about them. In retrospect, I admit it was strange that I spent half my post talking about the margin of error and then proceeded to completely ignore it in my simulation. It isn’t good to treat the survey results as fixed, because the simulation results are very sensitive to slight changes in the parameters.

2.) The approach assumes that there are no geographic or socio-economic variations in people’s preferences. This is obviously false. I used a simplified assumption so that the model wouldn’t be too complex.

Therefore, I now present two improved simulations of the same survey data. The first simulation will explicitly model the uncertainty around the survey results, while the second will do that AND model geographic variations as well.

Simulation 1: uncertainty around vote shares

First, some code in R:
set.seed(9999)
se <- sqrt(0.5 * 0.5 / 1800)
marcos <- rnorm(1000000, 0.25, se)
robredo <- rnorm(1000000, 0.23, se)
escudero <- rnorm(1000000, 0.23, se)
cayetano <- rnorm(1000000, 0.14, se)
honasan <- rnorm(1000000, 0.06, se)
honasan[honasan < 0] <- 0
trillanes <- rnorm(1000000, 0.05, se)
trillanes[trillanes < 0] <- 0
ewan <- rnorm(1000000, 0.04, se)
ewan[ewan < 0] <- 0
simulation <- matrix(nrow = 7, ncol = 1000000)
for (i in 1:1000000) {
simulation[, i] <- rmultinom(1, 1800, c(marcos[i], robredo[i], escudero[i], cayetano[i], honasan[i], trillanes[i], ewan[i]))
}
table(apply(simulation, MARGIN = 2, FUN = which.max))

This time, instead of using marcos = 0.25, robredo = 0.23, escudero = 0.23, cayetano = 0.14, honasan = 0.06, trillanes = 0.05, and ewan = 0.04, as I did in the previous simulation, I explicitly model the uncertainty around each of these proportions by treating each as a normally distributed random variable with the mean being the Pulse Asia result and the standard deviation being the maximum standard error for a sample survey with 1,800 respondents. In other words: Before simulating which candidate will win the most times, I also simulate each candidate’s vote share. Just to be clear, in my previous post I criticized the use of this standard error as a basis for comparison between candidates, but if we’re only looking at the uncertainty around each individual proportion, this standard error is perfectly fine.

If that sounded like gobbledygook to you, what I’m doing is letting each proportion vary in a reasonable way. For example, Robredo’s Pulse Asia result is 0.23. Instead of fixing that result at 0.23, I recognize that “0.23” has some uncertainty and I allow 0.23 to vary in order to look something like this:

robredouncertainty.png

Thus, Robredo’s vote share can be as low as 19% or as high as 27%, but it’ll still most likely be around 23%.

I do this for all candidates. However, for candidates whose vote shares are very low, modeling their uncertainty in this way will occasionally give us negative values, and there’s no such thing as a negative proportion. So I do truncation: if Trillanes’s simulated vote share ever dips below 0%, I set it to exactly 0%.

I then simulate 1,000,000 draws from a multinomial distribution, like before, only this time, each draw has slightly different parameters because the parameters are themselves simulated. For example, on draw 1, marcos = 0.262, robredo = 0.219, escudero = 0.245, cayetano = 0.149, honasan = 0.047, trillanes = 0.054, and ewan = 0.065. (Due to rounding, these may not add up exactly to 1). On draw 2, marcos = 0.26, robredo = 0.24, escudero = 0.223, cayetano = 0.138, honasan = 0.067, trillanes = 0.058, and ewan = 0.044.

Here are the results:

simtable2.PNG

Marcos still has the edge here, but he now only has a 70% chance of winning, while both Robredo and Escudero have a 15% chance each of winning. (Again, this is if the election were held that day and if the survey is properly representative of the voting population).

Simulation 2: uncertainty around geographic vote shares

Now, let’s discard the assumption that vote shares are the same across the whole country. The Pulse Asia survey results clearly show us results for NCR, “BL” (Balance of Luzon), Visayas and Mindanao. If Pulse Asia’s methodology is anything like SWS, then let’s assume that Pulse Asia surveyed 450 people in each of NCR, the rest of Luzon, Visayas, and Mindanao. This means that the standard deviation used in modeling the uncertainty around geographic vote shares will be larger, because the sample size in each area is smaller.

Then we can do the same thing as above, but this time, we will first do a separate simulation for each region, and then combine them together.

Here is some (inefficient and unnecessarily long – sorry) code that does all this. You can skip to the very end of it if you want, because it’s fairly uninteresting – except for the last line, which I’ll get to.


set.seed(9999)
geog <- data.frame(ncr = c(0.36, 0.12, 0.30, 0.12, 0.04, 0.03, 0.02),
bl = c(0.31, 0.22, 0.23, 0.08, 0.05, 0.05, 0.07),
vis = c(0.17, 0.35, 0.22, 0.13, 0.05, 0.06, 0.03),
min = c(0.18, 0.19, 0.19, 0.29, 0.08, 0.07, 0.02))
rownames(geog) <- c("marcos", "robredo", "escudero", "cayetano", "honasan", "trillanes", "ewan")
se_geog <- sqrt(0.5 * 0.5 / 450)
iter <- 1000000
marcos_ncr <- rnorm(iter, geog$ncr[1], se_geog)
marcos_bl <- rnorm(iter, geog$bl[1], se_geog)
marcos_vis <- rnorm(iter, geog$vis[1], se_geog)
marcos_min <- rnorm(iter, geog$min[1], se_geog)
robredo_ncr <- rnorm(iter, geog$ncr[2], se_geog)
robredo_bl <- rnorm(iter, geog$bl[2], se_geog)
robredo_vis <- rnorm(iter, geog$vis[2], se_geog)
robredo_min <- rnorm(iter, geog$min[2], se_geog)
escudero_ncr <- rnorm(iter, geog$ncr[3], se_geog)
escudero_bl <- rnorm(iter, geog$bl[3], se_geog)
escudero_vis <- rnorm(iter, geog$vis[3], se_geog)
escudero_min <- rnorm(iter, geog$min[3], se_geog)
cayetano_ncr <- rnorm(iter, geog$ncr[4], se_geog)
cayetano_bl <- rnorm(iter, geog$bl[4], se_geog)
cayetano_vis <- rnorm(iter, geog$vis[4], se_geog)
cayetano_min <- rnorm(iter, geog$min[4], se_geog)
honasan_ncr <- rnorm(iter, geog$ncr[5], se_geog)
honasan_bl <- rnorm(iter, geog$bl[5], se_geog)
honasan_vis <- rnorm(iter, geog$vis[5], se_geog)
honasan_min <- rnorm(iter, geog$min[5], se_geog)
trillanes_ncr <- rnorm(iter, geog$ncr[6], se_geog)
trillanes_bl <- rnorm(iter, geog$bl[6], se_geog)
trillanes_vis <- rnorm(iter, geog$vis[6], se_geog)
trillanes_min <- rnorm(iter, geog$min[6], se_geog)
ewan_ncr <- rnorm(iter, geog$ncr[7], se_geog)
ewan_bl <- rnorm(iter, geog$bl[7], se_geog)
ewan_vis <- rnorm(iter, geog$vis[7], se_geog)
ewan_min <- rnorm(iter, geog$min[7], se_geog)
sim_geog_p <- rbind(marcos_ncr, marcos_bl, marcos_vis, marcos_min, robredo_ncr, robredo_bl, robredo_vis, robredo_min,
escudero_ncr, escudero_bl, escudero_vis, escudero_min, cayetano_ncr, cayetano_bl, cayetano_vis, cayetano_min,
honasan_ncr, honasan_bl, honasan_vis, honasan_min, trillanes_ncr, trillanes_bl, trillanes_vis, trillanes_min,
ewan_ncr, ewan_bl, ewan_vis, ewan_min)
sim_geog_p[sim_geog_p < 0] <- 0
sim_ncr <- matrix(nrow = 7, ncol = iter)
sim_bl <- matrix(nrow = 7, ncol = iter)
sim_vis <- matrix(nrow = 7, ncol = iter)
sim_min <- matrix(nrow = 7, ncol = iter)
for (i in 1:iter) {
sim_ncr[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_ncr", i],
sim_geog_p["robredo_ncr", i],
sim_geog_p["escudero_ncr", i],
sim_geog_p["cayetano_ncr", i],
sim_geog_p["honasan_ncr", i],
sim_geog_p["trillanes_ncr", i],
sim_geog_p["ewan_ncr", i]))
sim_bl[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_bl", i],
sim_geog_p["robredo_bl", i],
sim_geog_p["escudero_bl", i],
sim_geog_p["cayetano_bl", i],
sim_geog_p["honasan_bl", i],
sim_geog_p["trillanes_bl", i],
sim_geog_p["ewan_bl", i]))
sim_vis[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_vis", i],
sim_geog_p["robredo_vis", i],
sim_geog_p["escudero_vis", i],
sim_geog_p["cayetano_vis", i],
sim_geog_p["honasan_vis", i],
sim_geog_p["trillanes_vis", i],
sim_geog_p["ewan_vis", i]))
sim_min[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_min", i],
sim_geog_p["robredo_min", i],
sim_geog_p["escudero_min", i],
sim_geog_p["cayetano_min", i],
sim_geog_p["honasan_min", i],
sim_geog_p["trillanes_min", i],
sim_geog_p["ewan_min", i]))
}
sim_geog <- 0.115*sim_ncr + 0.445*sim_bl + 0.208*sim_vis + 0.232*sim_min

In order to combine the results of separate simulations for NCR, the rest of Luzon, Visayas, and Mindanao, one approach would be to just add the four simulations together.

We know, however, that the Philippine voting population is not equally distributed across these four areas. Straight-up adding the four simulations together assumes that each area has exactly 25% of the voting population, and that’s patently false.

Commission on Elections records as of November 2015 show that NCR has 11.50% of voters nationwide, the rest of Luzon has 44.5%, while Visayas has 20.8% and Mindanao has 23.2%.

The last line, therefore, assigns to the variable sim_geog the combination of the four simulations together according to the percentage of the voting population each area has. This is called weighting.

With all this, we’re now ready to simulate! First, let’s look at NCR:

simncr.PNG

Marcos = 88.5%, Escudero = 11.5%. 0% for Robredo and everyone else.

What about the rest of Luzon?

simluzon

Marcos = 93.3%, Robredo = 2.5%, Escudero = 4.2%.

Visayas?

simvisayas.PNG

Robredo wins it big!

Marcos = 0% (well, 0.009%, but that’s pretty much 0%), Robredo = 99.5%, Escudero = 0.5%, Cayetano = 0% (again, 0.0001% is pretty much 0%).

And Mindanao?

simmin.PNG

Cayetano is a rockstar, no doubt due to being Duterte’s running mate.

Marcos = 1%, Robredo = 1.5%, Escudero = 1.5%, Cayetano = 96%

None of this should come as a surprise, considering that you could have guessed these results just by looking at the Pulse Asia table.

But here it comes: what of the combined results across the nation?

simgeog

Marcos = 78.5%, Robredo = 11.5%, Escudero = 10%.

Yeah, looks like Robredo and Cayetano could stand to gain some popularity outside Visayas and Mindanao, respectively.

Remember, though: These results are all obtained via information from a single survey conducted two months before the election. Public opinion can still change – not to mention that, as far as I know, the survey doesn’t cover overseas voters (Fun fact: overseas Filipinos can vote as early as April 9.) Unfortunately, the survey also doesn’t account for bloc voting and vote buying (which usually takes place just a couple of weeks before the elections).

A comment by UP College of Mass Communication’s Dr. Clarissa David puts it best:

“Odds cannot predict things like Binay tanking both debates, or the Kidapawan conflict, or any number of unimaginable events that can happen during an election period. That’s why odds change so dramatically with each round of survey. The odds as calculated assume that nothing out of the ordinary will happen. This is why predicted odds are not a thing in electoral studies. Surveys before the election are never intended to “predict” the winner, they give us a picture of current sentiment.”

I would argue that probabilities give us a more intuitive picture of current sentiment than do difficult-to-interpret proportions and margins of error, but I agree with Dr. David.

I am grateful to Peter for suggesting improvements to the simulations, as well as to Jan Carlo Punongbayan, who conducted follow-up analysis and spurred further discussion based on my original blog post.

(And to Jan Fredrick Cruz for letting me know about JC Punongbayan’s note in the first place.)

Stop talking about “statistical ties”: What are the actual chances your favorite VP will win?

Note: I do some statistics here. If that bores you, skip to the very end of the page for the answer to the question in the title. Then read the whole post.

Comparative-VP_FCA23F37C2544692B0859AA790615323

Rappler reports that according to the latest Pulse Asia survey, Bongbong Marcos, Leni Robredo and Chiz Escudero are in a “statistical tie” with each other. The term “statistical tie” here refers to the three candidates’ proportions being within the margin of error of the poll when compared to each other. Marcos has 25% while Robredo and Escudero have 23%. Pulse Asia reports the margin of error for their 1,800-person survey to be 2% at the 95% confidence level, so due to sampling error alone, if the elections were held the day of the survey, there is a 95% chance that the interval between 23% and 27% will include Marcos’s true vote share, and there is a 95% chance that the interval between 21% and 25% will include Robredo’s or Escudero’s true vote share. Since these intervals overlap, the three are therefore said to be “statistically tied.”

This is bullshit.

There is no such thing as the “margin of error of a poll”.

Where does the 2% figure that Pulse Asia is calling the margin of error of the poll come from? It is a shorthand for 1.96 multiplied by the upper bound of the standard error of a proportion from a survey question with only two choices.

I’ll unpack that. Suppose we have only two candidates, Ruby and Sapphire, and we conduct a poll among 1,800 people asking whether they would vote for Ruby or Sapphire if the elections were to be held right now. Also suppose that everyone answers either Ruby or Sapphire; there are no “don’t know” or “abstain” answers, and no one refuses to answer the question.

Suppose our poll yields 1,100 people going for Ruby and 700 people going for Sapphire. This means approximately 61% of people are going for Ruby and 39% of people are going for Sapphire.

This looks good for Ruby. However, we’re not done yet. Knowing that we only polled 1,800 people out of tens of millions of eligible voters, we need some way to estimate how much our 61-39 results could change if we could do the exact same poll on the exact same day over and over again with a different group of 1,800 people. This estimate is known as the standard error. It is calculated as follows:

\sqrt{\frac{0.61*0.39}{1800}}

This gives us 0.0115. We then multiply this standard error by 1.96 to get our margin of error. Why 1.96? 1.96 is the magic number if we specifically want a 95% confidence interval around our reported vote share. In this case, 1.96 multiplied by 0.0115 is 0.02254, which we can just round to 2.2%, or 2%. We can then say that there is a 95% chance that the interval between 59% and 63% contains the true percentage of voters who will opt for Ruby, and that there is a 95% chance that the interval between 37% and 41% contains the true percentage of voters who will opt for Sapphire. This looks pretty bad for Sapphire.

What if we want to know the margin of error for our question before we even ask it? Technically we can’t do this, but we can figure out what the highest possible margin of error would be. All we have to do is to assume an imaginary survey where Ruby and Sapphire both get exactly 50% of the vote. Then

\sqrt{\frac{0.5*0.5}{1800}}

will be the highest possible standard error, and 1.96 multiplied by that will be the highest possible margin of error. It turns out that in relatively close races, there isn’t much of a huge difference between this upper bound and what we might actually get. The above equation evaluates to 0.0118, which for all intents and purposes is the same as 0.0115. We aren’t measuring airplane parts here – having precision to the fourth decimal place is an illusion. When all is said and done this still gives us a margin of error of 2%. Voila, that’s where Pulse Asia and other surveys such as SWS get their margin of error.

There are therefore two major problems with that 2% figure and other similar reported figures:

1.) The above is the margin of error for an estimated proportion, where there are only two choices – not the margin of error of an entire poll, which doesn’t make any conceptual sense unless your entire poll consists of nothing but questions with only two choices.

2.) The vice-presidential race – and the presidential race, for that matter – has more than two choices!

We have no less than six people in serious contention for vice-president: Marcos, Robredo, Escudero, Cayetano, Honasan and Trillanes. We also have a seventh choice: “don’t know/refused/none”, which was the answer of 4% of respondents in this survey.

When we’re playing with things like margins of error, what we’re really doing is trying to test the hypothesis that what looks like a difference between two candidates’ vote shares isn’t different enough from 0 to be worth talking about. This means that in the case of a question with seven choices, we need a separate margin of error for every pair of candidates.

Let’s work with our real data now. From the Pulse Asia survey, Marcos has 25%, or 0.25, and Robredo and Escudero have 23%, or 0.23. The formula for the standard error of the difference between Marcos and Robredo (or Escudero) if there are more than two choices is:

\sqrt{\frac{0.25(0.75) + 0.23(0.77) + 2(0.25)(0.23)}{1800}}

which gives us 0.0163. Multiplying that by 1.96 gives us 3.2%, or 3%.

This means that there is a 95% chance that the interval between 22% and 28% contains the true vote share for Marcos, and a 95% chance that the interval between 20% and 26% contains the true vote share for Robredo. Meanwhile, applying the formula to find the margin of error for the difference between Escudero and survey fourth-placer Cayetano will give us 2.7%, so it seems safe to say that Escudero is far ahead of Cayetano.

Okay, but your margins of error are larger than Pulse Asia’s. That means Marcos, Robredo and Escudero are in fact even more “statistically tied” than ever, right?

Yes, it does, and that’s meaningless. I haven’t actually gotten to why being “statistically tied” is bullshit yet – I just explained why the way we’ve been thinking about the margin of error has been wrong all along.

Here’s the thing, see – if two candidates are “statistically tied”, it just means that we are less than 95% certain that a bunch of confidence intervals don’t overlap. It does not mean that two statistically tied candidates have an equal chance of winning. Marcos and Robredo/Escudero’s vote shares in this survey may be very close, but Marcos’s 2-point lead means that he does in fact have, at least according to this survey, a greater probability of winning the vice-presidential raceif the election were to be held that day and if the survey is properly representative of the voting population.

Rather than mess around with complicated formulas, the best way to get a close approximation of what exactly each vice-presidential candidate’s chances of winning are given this Pulse Asia survey is to simulate. Formally, the situation where we have 1,800 people each selecting one of 7 possible categories follows what is called a multinomial distribution. We can simulate 1,000,000 surveys of 1,800 people each using the proportions reported by the survey with the following code in the R programming language:

set.seed(9999)
simulation <- rmultinom(1000000, 1800, c(0.25, 0.23, 0.23, 0.14, 0.06, 0.05, 0.04))

where 0.25, 0.23, 0.23, 0.14, 0.06, 0.05 and 0.04 are the vote shares for Marcos, Robredo, Escudero, Cayetano, Honasan, Trillanes and “don’t know/refused/none” respectively. I’ve assigned the results to the variable called simulation. This will give us 1,000,000 possible outcomes for 1,800 people going to the polls where each individual’s chance of choosing one of the seven candidates is given by the survey results. The line with set.seed is there to ensure that the outcome of the simulation, which is a random process, can be reproduced if someone else runs this code.

One possible outcome might look like this, with Marcos winning:

marcoswins

While another possible outcome might look like this, with Robredo winning:robredowins.png

Or this, with Escudero winning:

escuderowins.png

The following code goes through each of the 1 million simulations, one by one, and returns a table of how often each candidate won:

table(apply(simulation, MARGIN = 2, FUN = which.max))

which results in the following:

simtable.PNG

Out of 1 million draws, Marcos won 821,550, Robredo won 89,411 and Escudero won 89,039. Translating these into probabilities, if the election were held that day and if the survey was properly representative of the voting population, Marcos has an 82% chance of winning the vice-presidency, while Robredo and Escudero have a 9% chance each of winning, and all of the other candidates have a 0% chance of winning.

This is good news or bad news depending on your preferences. My aim was to show, however, that talking about “statistical ties” is not a useful or realistic way of gauging candidates’ chances.

EDIT: Some people have been asking for more details on the simulation: if I simulate 1,000,000 draws from a probability distribution with certain parameters, won’t I just get those parameters back?

The answer is yes, I will get those parameters back. That’s why I do 1,000,000 trials, in order to ensure that the empirically simulated results are extremely close to the theoretical distribution.

What I do above is not to determine what the probabilities are from the simulation. The probabilities are the vote shares obtained through the survey. We already know that if I simulate enough times, we will get those probabilities back.

I am not computing vote shares – I am using a distribution derived from vote shares in order to compute the probability that each candidate will have the most votes.

In mathematical terms:

Let p_1, p_2, p_3, p_4, p_5, p_6, and p_7 be each candidates’ vote shares (0.25, 0.23, 0.23, 0.14, 0.06, 0.05 and 0.04) respectively.

Now let X_1, X_2, X_3, X_4, X_5, X_6, and X_7 be random variables that are obtained from one instance of a multinomial distribution with parameters p_1, p_2, p_3, p_4, p_5, p_6, and p_7.

I use the simulation to compute P(X_1 > X_2) \cap P(X_1>X_3) \cap P(X_1>X_4) \cap P(X_1>X_5) \cap P(X_1>X_6) \cap P(X_1>X_7),

which is the probability that X1, here meaning Marcos, would lead all other candidates. For Robredo, it would be

P(X_2 > X_1) \cap P(X_2>X_3) \cap P(X_2>X_4) \cap P(X_2>X_5) \cap P(X_2>X_6) \cap P(X_2>X_7). Etc.

Theoretically, these probabilities are what we would get if we repeated this survey an infinite number of times. In practice, 1,000,000 is large enough to approximate this very well.

If the difference between vote share and probability of winning confuses you, think about it this way: If Trillanes gets 5% of the vote in a survey, it doesn’t mean that Trillanes’s probability of winning is 5%.