5 things to keep in mind about the uncertainty of Philippine survey numbers

The latest surveys by SWS and Pulse Asia are out. Here are 5 things that social scientists, journalists, policymakers, political strategists, and other people who care about these sorts of things should keep in mind when reading the press releases:

1. Technically, each question has its own margin of error.

The margin of sampling error is calculated using the formula for the standard error of a binomial proportion multiplied by 1.96:

\sqrt{\frac{p_1(1 - p_1)}{n}} * 1.96

where p_1 is the percentage of some response to a question and n is the sample size. For example, if 80% of people in a survey of size 1,500 said that they were satisfied with President Duterte, then p_1 is 0.8 and n is 1,500. The margin of sampling error would then be \sqrt{\frac{0.67 * 0.33}{1500}} * 1.96 = 0.02 or 2%.

Each question will have a different p_1, therefore each question has its own margin of sampling error. It is generally considered too bothersome for the purposes of a press release to give every single number its own margin of sampling error. Therefore, press releases err on the side of caution.

The largest possible value for the margin of sampling error is achieved when p_1 = 0.5. This will result in a margin of sampling error of 2.5% for a survey of size 1,500, as with SWS, or 3% for a survey of size 1,200, as with Pulse Asia. Since it is better to assume more uncertainty than less, pollsters simply take the maximum margin of sampling error and say that it’s the margin of sampling error for every question.

2. The margins of error for “net” questions and for changes over time are ~much~ larger.

 

SWS in particular likes to report “net” statistics. For example, they report that President Duterte has a “net satisfaction rating” of 48, which is calculated by taking the % who said they were satisfied minus the % who said they were dissatisfied.

The formula for calculating the margin of error from a “net” statistic – technically, the standard error of the difference between two proportions from a multinomial distribution multiplied by 1.96 – is as follows:

\sqrt{\frac{p_1(1-p_1) + p_2(1-p_2) + 2p_1p_2}{n}} * 1.96

where, for example, p_1 would be % satisfied and p_2 would be % dissatisfied. Again, this means that each question would have its own margin of error; and again, for simplicity, we would just assume the maximum margin of error and apply that to all questions. The maximum margin of error is achieved if both p_1 and p_2 are 0.5. For a survey of size 1,500, then:

\sqrt{\frac{0.5 * 0.5 + 0.5 * 0.5 + 2*0.5*0.5}{1500}} * 1.96 = 0.05, or 5%.

The margin of error for a net statistic is twice the reported margin of sampling error. Keep this in mind when reading SWS reports. Pulse Asia isn’t a fan of reporting net statistics.

SWS also likes to report the change in the “net” over time. For example, it reports that President Duterte’s net satisfaction rating fell from 66 in June 2017 to 48 in Sept 2017. Guess what? The margin of sampling error for the change in the net statistic is even larger.

Let p_1 and p_2 be % satisfied and % dissatisfied in June 2017, and let p_3 and p_4 be % satisfied and % dissatisfied in Sept 2017. Both surveys have the same sample size of n = 1500. Then the margin of sampling error for the 18-point change in Duterte’s net satisfaction rating is:

\sqrt{\frac{p_1(1-p_1) + p_2(1 - p_2) + 2p_1p_2 + p_3(1-p_3) + p_4(1-p_4) + 2p_3p_4}{n}} * 1.96.

The maximum value of this margin of sampling error is when all of those p’s are 0.5, which comes out to 7.2%.

So if you want to look at the % of people in a single survey who were satisfied, then the margin of sampling error is 2.5%. If you want to look at the “net satisfaction”, the margin of sampling error is 5%. If you want to look at the change in the net satisfaction over two periods, the margin of sampling error is 7.2%.

3. The reported margin of error is almost certainly too low.

As stated above, SWS reports a margin of sampling error of 2.5% for their surveys of size 1,500, while Pulse Asia reports a margin of sampling error of 3% for their surveys of size 1,200. These numbers are accurate according to the formulas above. They are also wrong.

The formulas above all assume independent observations from a simple random sample. That is, they imagine a process where we have a list of every single eligible adult in the Philippines, and we randomly pick 1,200 or 1,500 people from that list and interview them.

That list doesn’t exist.

Instead, what polling firms do is they take a list of every single municipality and city in the Philippines, and select some number of them to conduct interviews in. Municipalities and cities with more households in them are more likely to be selected.

Then they take a list of every single barangay in the selected areas, and select some number of them to conduct interviews in. Barangays with more households in them are more likely to be selected.

Then they select five households in each barangay to interview by choosing a random starting point and sampling every 7th house (or 5th, or whatever, depending on the size of the barangay).

Finally, they knock on the door of a household, ask how many people age 18 and above live there, and choose one of them at random to interview.

Now the issue is that people from the same area are more likely to share the same opinions than people who aren’t from the same area. In other words, a sample of 1,500 obtained like this doesn’t actually contain 1,200 independent observations; it contains 240 clusters of 5 people each whose opinions are somewhat more similar to each other.

The formula for calculating the margin of sampling error from a multi-stage cluster sampling design like this is quite complicated. However, it will certainly be larger than what is reported. Unfortunately, I cannot recalculate the margin of sampling error myself, because I would need to know the mean response for every single cluster, and that isn’t happening without access to the raw data.

As an example, however, consider the survey conducted by Princeton Survey Research Associates International on behalf of the Pew Research Center in the Philippines. (Click the link and search “Philippines”). The survey had a sample size of 1,000. If SWS or Pulse Asia had done that survey, they would have reported a margin of sampling error of 3.1%; however, taking clustering into account, PSRAI reports the margin of sampling error to be 4.3%.

4. There is no consistent, objective measure of socioeconomic class.

Both SWS and Pulse Asia like reporting out statistics for “Class ABC”, “Class D”, and “Class E”. There is no accepted definition as to who falls in what class. I emailed Ronald Holmes, Pulse Asia Research President, asking him about how Pulse Asia determines whether a respondent is in class ABC, D, or E, and he replied:

A number of factors are used to classify households that are sampled. These factors include total household income; household facilities/furnishings; occupation of household head; educational attainment; home ownership; home maintenance; durability of the home; and, conditions of the neighborhood, among others. The indicators are culled from prior social science/market research and our enumerators document these indicators for subsequent classification into socio-economic classes of the sampled respondents/household.

There are exact criteria but there are also other criteria subject to the judgment of the enumerator. We do regularly ask about occupation of the household head, educational attainment and home ownership but not regularly on household income.

In other words, the field staff have to make some judgments about aspects of a person’s socioeconomic class via observation, and then SWS and Pulse Asia determine socioeconomic class via some index that may differ between the two firms.

The Class ABCDE construct is itself largely an invention of market research. For example, Nielsen divides the population of the Czech Republic into eight classes – A, B, C1, C2, C3, D1, D2 and E – and fits a regression model on variables such as household composition, occupation of the household head, household equipment, household income, education of the household head, and region of the country to assign each person a ‘score’. The top 12.5% are considered class A, the next 12.5% class B, and so on until the bottom 12.5% go to class E, such that an equal number of people are in each class.

However, SWS and Pulse Asia do not do it this way. I do not know exactly how they construct their socioeconomic class measure, but class ABC typically makes up less than 10% of the sample. This implies larger margins of error for statistics calculated over class ABC only, just like how Luzon, Visayas and Mindanao have larger margins of error.

For example, if we assume that only 150 out of 1,500 people in an SWS survey are class ABC, the margin of sampling error for class ABC would be 8%. The margin of sampling error for the change over time between two measures would be 11%.

As previously discussed, the margin of sampling error for a “net satisfaction” rating would be 8% doubled, or 16%. The margin of sampling error for the change over time in the net satisfaction rating would be a whopping 22.6%. This means that all but the most cataclysmic change in the net satisfaction rating of class ABC would be within sampling error.

Class D typically makes up about 60% of a sample, while class E typically makes up about 30% of a sample.

5. The margin of sampling error is purely statistical; it does not include error that comes from contextual factors, from undercoverage, or from nonresponse.

All sorts of things can affect what response someone gives to a question. Here’s a (picture of a) slide from the University of Michigan summarizing these things:

22365332_10155849832919673_4736222457184136417_n

Survey research literature has found, for example, that:

  • Older people and people with less education tend to give more agreeable responses (satisfied, approve, trust, etc.), regardless of what the question is asking about.
  • Many respondents will give the “socially desirable” answer rather than their true answer when asked about topics such as how often they smoke, or their views about poor people, etc. This is exacerbated when a live interviewer is present, as is the case with SWS and Pulse Asia, but is also present when the respondent is speaking over the phone. It pretty much disappears online where the respondent has more anonymity.
  • Many respondents will also give “socially desirable” answers if the interviewer looks like they might prefer it, or if someone else is present in the room. Interviewers generally request that they be allowed to survey someone alone, but they can’t really force it. If you are interviewing someone about their trust in Duterte, and their spouse is in the room wearing a Duterte shirt, then you can expect that they will indicate trust in Duterte regardless of what their true beliefs are. Some LGUs also insist that surveys be conducted with LGU officials present, which makes political measures less trustworthy.
  • According to SWS and Pulse Asia, respondents tend to open up to females more. This means that their field interviewers are all female.
  • The longer a survey goes, the more a respondent wants to get it over with. (You don’t really need a lit review to figure this out). This may result in “default” answers.

Undercoverage is also a real problem. For obvious reasons, the list of barangays will not include heavily conflict-affected barangays where the interviewer’s life would be threatened. Filipinos who are abroad also have 0 probability of getting selected (though balikbayans who happen to be in the country could still get sampled). Time of day may also affect who is available. I do not know exactly what field protocols are, but if, for example, interviewers are only in the field during working hours, then the sample will largely consist of people who are unemployed or who work from home. And what about exclusive gated communities, where you can’t even wander around without getting past a security guard? The government might be able to pull it off with official census workers, but private firms will have less luck.

Nonresponse is another problem that we have almost no information on. According to Pulse Asia,

Respondents sampled who were not available during first attempt were visited again with a maximum of two valid call backs. If the respondent remained unavailable after two valid call backs, a substitute who possessed the same qualities (in terms of gender, age bracket, working status and socio-economic class) as the original respondent was interviewed. The substitute respondent was taken from another household beyond the covered intervals in the sample barangay by continuing the interval sampling.

The primary concern here is that people who are willing to respond to surveys may have different opinions compared to people who are not willing to respond to surveys. To my knowledge, this has never been studied to my knowledge in the Philippines. I’m not even sure how you would do so. The Pew Research Center study of telephone nonresponse in the United States was able to do things like compare telephone surveys, which have very low response rates, to much more expensive face-to-face surveys with high response rates, or to publicly available voter records, in order to check whether the responding sample looked systematically different. We can’t do anything comparable in the Philippines. On the other hand, response rates in Philippine face-to-face surveys reportedly hover around more than 50%, so it isn’t as big of a problem to my mind compared to undercoverage and contextual effects.

Here’s the summary: Numbers from a survey have much greater uncertainty than the polling firms claim. Surveys are useful indicators of public opinion, but we should not assign too much gravitas to, say, a 7% decline in ‘net satisfaction’ between two surveys. (By contrast, the 18% decline is worth taking into consideration.)

Advertisements

Double-barreled responses: Congress poll on lowering minimum age of criminal responsibility

A one-question online poll on the website of the Philippine Congress gives me the opportunity to discuss questionnaire design a little bit.

congressciclpoll

 

This primer by Jon Krosnick outlines conventional wisdom on how to design a survey question from decades of research in survey methodology. Among the recommendations are to make response options exhaustive and to avoid leading or loaded word choices that push respondents towards an answer.

The problem with the response options above is that they are what are known as “double-barreled” response options, where the question asks you for one answer but each response options is actually two answers. You have the option of choosing Yes AND agreeing that the youth should be responsible for their actions and words as early as possible, or choosing No AND agreeing that punishing children violates child rights. The response options thus do not exhaust other possible ways of framing the issue. For example, a respondent might think that the youth should be responsible as early as possible but that nine years is too young, or that effective punishment would not necessarily deprive children of the chance to improve their lives.

Clearly, the question laudably tried to present balanced viewpoints on the issue. Certainly, allowing respondents to grapple with some arguments in favor of or against the question would result in more informed responses than if the response options were just straight-up Yes, No or Undecided. In order to avoid the double-barreling discussed above, a better way to ask the question would be to simply move the arguments from the responses into the question, as follows:

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. [RANDOMIZE ORDER OF PARENTHESES] Some argue that (the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes), while others argue that (punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character). Do you favor or oppose this proposal, or are you undecided?

  1. Favor
  2. Oppose
  3. Undecided

By doing so, we allow respondents to consider policy arguments in their heads while not railroading them into agreeing with a fixed set of arguments along with their opinion on whether they favor or oppose the proposal.

I would also suggest two changes which are already incorporated into the above question. First is to remove the phrase that mentions “unduly pampering” children with “impunity from criminal responsibility”. These word choices are loaded with negativity and may thus influence respondents’ answers. Furthermore, the argument being made in that phrase is effectively the same as arguing that children should take responsibility at an earlier age, thus making this statement redundant.

Second is to randomize the order in which the arguments are presented to the respondent. The above text would not be what is shown to the viewer; it would be what is shown to instruct whoever is in charge of programming the survey. Upon visiting the website, each respondent would have a 50-50 chance of seeing one of the two following questions:

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. Some argue that the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes, while others argue that punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character. Do you favor or oppose this proposal, or are you undecided?

vs.

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. Some argue that punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character, while others argue that the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes. Do you favor or oppose this proposal, or are you undecided?

There is a possibility that the argument that respondents see either first or most recently would stick out in their minds more. We do not want this to influence their answer, and switching up the order of the arguments should help prevent this.

There are of course broader issues to consider. The poll itself is an interesting exercise, but without any information on who is answering (i.e. who makes up the sample), the poll cannot provide any evidence on Philippine public opinion. Nearly 18,000 respondents as of 10:00 AM on 13 January 2017 may look more impressive than the 1,200 respondents we see from SWS or Pulse Asia surveys, but they may, for example, largely be highly-educated, English-speaking, middle to upper class Filipinos with Internet access who were interested enough in politics to answer a poll on the Congress website – or, in short, unrepresentative of the Filipino population as a whole. Hopefully no lawmaker or pundit will look to this poll to say that 90% of Filipinos oppose the proposal, because this poll does not provide sufficient evidence for that assertion.

Statistical adjustments of the results of online polls in order to reflect the population is an active field of research, but the bare minimum required to start doing so would be demographic information on the respondents. Without that, the poll is nothing more than a curiosity.

 

Data collection and accessibility are the core of civic data analytics

Data science and data analytics are red hot terms nowadays. You can’t go more than a page with Google search without finding some reference to how “sexy” data science is. And everybody wants to be sexy, right?

In the Philippines, helmed by a small enthusiastic community, more and more startups are mushrooming with the business model of providing data science and analytics services to businesses and corporations, such as Thinking Machines Data Science and DataSeer.

Of particular interest to me is “civic” data analytics, or analytics as applied to civic problems such as health, infrastructure, agriculture, poverty, education, the environment, and all sorts of other things that are the ambit of government agencies and nonprofits. The international volunteer organization DataKind, with chapters in Singapore, Dublin, Bangalore, the United Kingdom, Washington, DC, and San Francisco, describes its mission as “bringing together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact.”

One homegrown example would be this story that was published very recently in the Manila Bulletin. As the account goes, using data on dengue outbreaks in Dagupan City, Pangasinan, Wilson Chua and collaborators were able to narrow down the source to specific stagnant pools of water near a couple of elementary schools, and then work with the barangay (village) captain and the Bureau of Fisheries and Aquatic Resources to implement a targeted solution, and no new cases of dengue have been reported since then.

In meetups, conferences, training sessions, and press releases, a lot of attention is placed on the use of big data tools such as Hadoop and d3.js, which are used to easily organize massive amounts of possibly unstructured data and produce impressive-looking visualizations, such as this graph, also produced by Wilson Chua, comparing dengue outbreaks across Pangasinan barangays between 2014 and 2016.

15138584_10154435869058598_2838915204947344528_o.jpg

or this blog post by Thinking Machines that visualizes 114 years of Philippine disasters. This is in line with data science being “sexy” – not only can you use it to do sexy stuff, you can also make sexy looking graphics!

I feel, however, that the main takeaways from the dengue article above are about the thoroughly unsexy, fundamental, and undervalued activities that are at the core of data science: data collection and data access.

Before Wilson Chua could analyze the data, the data had to exist in the first place. Someone had to go out there and collect data on individual incidences of dengue. According to the article, Wilson sourced his data from the Philippine Integrated Diseases Surveillance and Response team at the Department of Health. Someone also had to think about what sorts of variables to collect; one of the keys to Wilson’s insights was that the PIDSR data included not just the date of occurrence and the barangay but also the patient’s age, from which Chua noticed that more school-aged children were getting dengue than any other age group. That means that during the data collection process, someone had to have recognized that age was a relevant epidemiological covariate, without which Chua would have been able to do far less.

They were then able to verify that specific pools of water initially located via Google Maps were locations in which rainfall would accumulate and stagnate without exit points, because a separate person, Nicanor Melecio, based with the Dagupan City government, had LIDAR (LIght Detection And Ranging) maps that were initially created to track flooding. This means that someone had to have recognized that creating LIDAR maps was not only useful but also more broadly applicable, and someone higher up had to have agreed to fund such a project.

Epidemiology (the study of public health) is a fairly well-established field, and the dengue problem was fairly well-defined and narrow in scope. Most civic issues are much murkier; people recognize that disasters, poverty, crime, etc. are problems, but it is not as straightforward to drill down to a specific problem that can be solved. Even when a seemingly specific problem can be identified, e.g. how to reduce casualties from flooding in a particular barangay, or how to improve the livelihoods of a particular group of people, or how to reduce recidivism rates among prisoners, there is still a wide range of possible approaches that must be considered – and more to the point, it isn’t immediately clear what data needs to be collected in order to approach these problems from a “scientific” perspective.

Depending on the application domain, data collection might be a painstakingly long and slow effort. It will also probably be an expensive effort, and thus one whose expenses need to be justified. And it will all be for naught if we do not pay attention to proper measurement, or “the idea of considering the connection between the data you gather and the underlying object of your study.”

People who want to do data science for social good need to focus on working with agencies and organizations charged with data collection in order to identify the specific problems they want to help solve and the specific kinds of data that the solution needs. If the data hasn’t been collected yet, we need to push for efforts to collect it. If the data has been collected but is incomplete or of low quality, we need to push for efforts to improve it. For example, Matthew Cua’s Skyeye Inc. uses drones that can take aerial camera shots to collect data that can help resolve property disputes and land claims, and the company works with the Department of Agrarian Reform to help settle land reform issues.

The current approach is to focus only on problems for which data is already available. For example, data science startups are now currently working with Waze, the traffic app, in order to use their data to try and come up with solutions to Metro Manila’s traffic problem, which affects millions of Filipinos every day, greatly reduces labor productivity, harms the environment, and makes Metro Manila less “liveable”. But the data scientists working on this specific problem did not choose it for its relative importance. They chose it because the data already existed.

Many social scientists are now interested in mining Twitter data to look at public sentiment, despite the fact that we have zero picture of how representative Filipinos on Twitter are versus the Filipino population as a whole. Why are we then treating Twitter as a reliable source of public opinion data? Because it’s already there.

The very example that Wilson uses as his inspiration, John Snow’s approach to solving a cholera outbreak, did not involve Snow accessing an API or writing a web-scraping script. It involved going door-to-door, boots on the ground, identifying houses with cholera. For the vast majority of applications, proper data collection does not involve complex mathematical models or whiz-bang software engineering. It is not sexy, and it is fundamental to good data science.

Then there is the question of data access. In the past few years, the Philippine government has taken great strides in setting up a web portal that allows for public access to some government data. The new administration has announced their intention to continue this program.

Setting aside the question of whether the government data is reliable and measures the needed variables, or whether data on a specific domain even exists in the first place, not all government data is open access. For example, much of the “open data” on the web portal actually just redirects you to the website of the concerned government agency, where they will have summary tables and charts of the data but not the raw data itself. The Philippine Geoportal project combines geospatial data from multiple agencies and allows the visitor to view things such as the location of every hospital and health center in the Philippines on a map, but if the user wants the actual coordinates, they still have to course their request to the Department of Health in writing, which they are not obliged to fulfill. Going back to the dengue article, this quote is telling:

Using his credentials as a technology writer for Manila Bulletin, he wrote the Philippine Integrated Diseases Surveillance and Response team (PIDSR) of the Department of Health, requesting for three years worth of data on Pangasinan.

The DOH acquiesced and sent him back a litany of data on an Excel sheet: 81,000 rows of numbers or around 27,000 rows of data per year.

Wilson had to use his “credentials” to make a request for the data, and the DOH chose whether or not to “acquiesce”.¬†If a less well-off, less well-connected, less prestigious private citizen from Dagupan city, perhaps a concerned elementary school teacher, were to make this request from the DOH, would they have acquiesced?

The burden should not be on the person making a request for data to somehow show that they have “credentials” or that they are “serious”. It should be just as easy for a street sweeper or a fish vendor to access the data as it is for a PhD or a businessman with decades of experience. The person should not even have to make a request. This data should have already been out there. The only data that should be locked behind requests for access are data that contain information that could directly identify individual people, and data that might compromise national security.

In a sense, the article is not merely a celebration of Wilson’s achivements, but a celebration of the good fortune that the DOH considered Wilson credible enough.

We are woefully lacking quality data on all manner of social problems in the Philippines. If data science in the Philippines is to advance, the community cannot merely sit back and hope that some unsexy, underpaid bureaucrats in government agencies or academics in research firms will be insightful enough to collect some good data, committed enough to justify the collection at a budget hearing or to multilateral funding organizations, and considerate enough to make this data as open to all as can be. All things considered, these bureaucrats and academics are data scientists too. They are part of what makes civic data science possible and they deserve the community’s support and advocacy.

 

M.A. Quantitative Methods in the Social Sciences at Columbia University: Things you should know

I am, as of right now (Fall 2016), in my third and final semester of a master’s degree in Quantitative Methods in the Social Sciences (QMSS) at Columbia University in New York City. There isn’t a whole lot of information about this program on the Internet, so to those who are giving this program some consideration, particularly students from outside the United States, here is some helpful information.

How long is QMSS?

One year, with the option to extend one more semester if you wish. Very rarely, some people will extend by two semesters – these are usually people on externally-funded scholarships that have a set timeframe.

Is it expensive?

Yep. This academic year, a full-time semester costs $28,780 for up to 20 units. (One full-time class is either 3 or 4 units, and the number of units isn’t necessarily a function of workload. QMSS classes are 4 units.) It’s expensive, though cheaper than other quantitative courses such as Mathematics of Finance or Statistics.

If you choose to extend your stay, the additional semester costs $10,944, with the restriction that only up to two non-QMSS classes can be taken that semester.

Scholarships?
Unfortunately, hardly any from within Columbia. There might be some merit-based scholarships that’ll give you like 5%. International students are best off seeking external funding. In general, master’s programs that aren’t in public policy tend to be very stingy with scholarships; they need your money to subsidize the Ph.Ds.

So I can take classes outside the program?
Hell yeah. The only required classes are “Theory and Methodology”, which is a survey class of approaches to quantitative social science that’s generally geared towards people who’ve never seen social science research before and is pretty boring and old hat otherwise; the “Research Seminar”, a two-semester sequence where you come in late at night to listen to an academic or practitioner give a talk (which may or may not be interesting), partially meant to expose you as a student to some of the latest research and practice being done, and your master’s thesis. Everyone has to do a thesis. There’s no getting around it.

By the way, Statistics students don’t have a thesis. If theses are your thing (e.g. if you really want to go on to a Ph.D), go for QMSS.

Aside from those classes, you can take any classes you want outside the program as long as a certain number of them are quantitative. If, for example, you find QMSS’s course offerings too easy, or you really have a particular substantive topic you’re looking to learn more about, you can take classes in the Stats, CS, Engineering, Econ, PoliSci, Sociology, and whatever other departments exist at Columbia that’ll let you in. And most classes will let you in provided there’s enough room. It’s only the super-duper high demand ones such as Data Science that won’t let you in… except if you’re in QMSS’s Data Science track.

Track?

QMSS students are required to pick one of three tracks. The “Economics” track is exactly what it says on the tin; it used to function as basically an MA in Econ when Columbia didn’t have their MA in Econ yet (it was just launched last year). The “Data Science” track lets you into classes offered by the Institute for Data Science and Engineering such as Algorithms for Data Science, Machine Learning for Data Science, etc. And the “Traditional” track is where everybody else who isn’t specializing in the former two tracks goes.

(There’s also a fourth track, “Experiments”, but to the best of my knowledge practically no one has ever taken that track and it might not even functionally exist anymore.)

I suck at math – what do I do?

QMSS was specifically designed for people with a limited background in quantitative methods. I didn’t even know what a regression was before I entered the program – my own background was in development studies and history. Now I wouldn’t call myself an expert on anything, but I’m more comfortable with this stuff and know where to look to go deeper.

You do need to kinda not suck at math, but if you can nail the Graduate Record Examinations, you’re good. (Not the math-specific GRE, just the general quantitative portion.) I would even say that you can graduate from the program without knowing calculus and linear algebra if you just take all the applied data analysis classes, though how much you actually understand would be a valid question.

How big is the program, and what are the people in it like?

Each cohort ranges from about 60-75 students, including part-time students. About half or so are from East Asia and the other half are from everywhere else, though mostly North America. Backgrounds range from idiots like myself to people who are already data scientists of some sort.

For Chinese people in particular, especially those who are coming straight from undergrad in China, if you’re also looking for a program with a good mix of Mandarin and English speakers, QMSS is for you. By comparison, Columbia’s Stats program is around 90% Chinese.

QMSS has an associated student-led organization, Society for Quantitative Approaches to Social Research (QASR), that organizes social events, alumni networking sessions, outreach efforts, and the like.

12311290_10156263920680158_8252427665775680723_n
Thanksgiving dinner
11416420_10101408298610104_5306192380019355888_o.jpg
QMSS students at Google’s offices

So what is in QMSS?

QMSS classes focus on applied quantitative social science. There’s a three-course sequence taught by sociologist and program director Gregory Eirich that starts with applied regression analysis through linear models, generalized linear models, causal inference methods, text mining, methods for longitudinal data, and time series processes. Those classes are light on the technical details and focus on understanding the structure and assumptions of various models and techniques at a high level and putting them to use with data via the open-source R programming language. This is in contrast with an econometrics class, which would cover the same material but with much more theoretical grounding and mathematical proof and much less actual data analysis.

There’s a two-course sequence taught by geographical sociologist Jeremy Porter on geographic information systems & spatial analysis that essentially teaches you how to work with data with some sort of location information attached. It uses the open-source software Quantum GIS and the R programming language.

Other electives include Social Network Analysis, also taught by Eirich, and Data Visualization, which is either a really good or a really crappy class depending on who’s teaching it. If you’re familiar with the Harry Potter series, it’s like Defense of the Dark Arts – the professor changes every year for some reason or another, and the syllabus completely changes every year in step.

There are also two classes taught by political scientist Benjamin Goodrich. In the fall, Data Mining for Social Science is an introduction to techniques such as tree-based models, neural networks, principal components analysis, etc., that are commonly used when the goal is to predict new data. It’s essentially an intro to machine learning class that’s decent preparation for more advanced classes, and it doubles as an actual introduction to R in the first couple of weeks. The other classes use R, but this class actually teaches it from the ground up at an accelerated pace.

In the spring, Bayesian Statistics for the Social Sciences is an introduction to Bayesian modelling, which is a different way of approaching a statistical problems that puts heavy emphasis on probability distributions. It’s the most math-heavy class in the program and also doubles as a shameless plug for the statistical modelling language Stan, which is being actively developed at Columbia by a team including Goodrich and others, which allows the user to specify flexible models for particular situations instead of relying on canned packages. Columbia is also the home of Andrew Gelman, one of the foremost Bayesian statisticians out there, and his influence looms large.

If none of those sounded exciting to you, remember that you don’t even have to take any of those classes if you don’t want to. Already a crackerjack who wants to go full-on into machine learning? Head over to Stats/CS/Engineering. Want to go into finance? Columbia Business School classes can be hard to get into but it’s doable. Not satisfied with the fairly high-level approach of Eirich’s classes and want to go really deep into the weeds? Either the Econ or PoliSci departments are for you.

There’s also one last thing. If you wish, QMSS can match you with professors from across the university who are looking for research assistants, which is a great opportunity to get your feet wet with social science research outside of a classroom setting.

I don’t want a Ph.D – I just want a better job or a change of career.

QMSS is perfect for that as well – while initially designed as a Ph.D preparation program, most people opt to go into industry for at least a while after graduation, and the flexible nature of the program allows for diverse interests. Data science and analytics are the flavor of the year, but policy research, tech, consulting and finance aren’t far behind. And for those last two industries in particular, there’s hardly a better place than New York City.

QMSS also qualifies as a Science, Technology, Engineering and Mathematics (STEM) program, which for international students who want to work in the United States means that they would qualify for special preferences given to STEM majors.

How easy would it be for me to find a job?

Jobs are everywhere for people skilled in quantitative methods. Many students in the program aren’t even particularly interested in the “social science” part. The barriers to entry are much higher for international students, however. Columbia University’s career office is a great resource, with lots of events such as job fairs, industry talks, and interview prep sessions. This being New York, though, their offerings are heavily skewed towards consulting, finance and tech. The good news is that larger firms are generally also more open to international hires.

Part of the reason some people stay for a third semester is to give themselves additional breathing room in the job hunt – instead of graduating in May and having to have a full-time job within three months or risk losing their visa status, they can graduate in December and hopefully score a summer internship beforehand.

Why shouldn’t I just major in Statistics?

Sure, you could. In fact, if your goal is to get a Ph.D in Statistics, or to get a top-level engineering job in some data science firm, then you would need much more quantitative stuff than the average QMSS student learns, although you could pursue those same classes within QMSS if you wanted to due to its flexibility.

Regarding Columbia specifically, the primary advantage to QMSS over the Statistics program are that it’s slightly cheaper and has 60-75 students as opposed to Stat’s 200+, meaning that it’s easier to raise concerns and get attention from the program administration.

If your goal is a Ph.D in a social science, QMSS is excellent preparation. Many social science Ph.Ds are actually somewhat behind when it comes to quantitative methods despite American social science’s heavy focus on it.

Why would QMSS not be for me?

The cost is a huge factor, honestly. Tuition aside, cost of living in New York City can be ridiculous. Food, transportation, and that kind of stuff isn’t very expensive, but rent…

Barring that, it does say “quantitative” in the title, which represents a particular approach to social science that isn’t by any means all-encompassing. If you’re interested in being the kind of researcher who can do detailed case studies, thick descriptions, in-depth interviews, etc., then the program probably isn’t for you. You might even encounter stuck-up people who will scoff at you for your ‘inferior’ methods, though it’s not as prevalent as it seems. Quantitative methods are generally useful for every social scientist to familiarize themselves with, but a program devoted to them isn’t for everybody.

If your bachelor’s degree was a fairly quant-heavy social science program, QMSS will be redundant.

Finally, if you intend to use the program’s flexibility to take courses in things like data science, it isn’t going to be immediately obvious to employers that you’ve taken those courses if they see QMSS on your resume. Your chosen track isn’t reflected on your transcript or diploma. So if you really do want to go into something more specific like stats/data science, you may want to consider going for the degree that actually says stats/data science. (Or you could self-study, build a portfolio, etc.)

On a lighter note, “Quantitative Methods for the Social Sciences” is also way too long of a course name. Imagine telling someone that in an elevator.

 

 

Uniformity and difference-in-differences

Earlier, I posted on Facebook about some problems with the analysis conducted by Yap and Contreras on 34 time periods sampled from 92,509 possible time periods, representing each and every precinct. In part of that post I discussed Contreras’s interpretation of high R-squared as representing uniform changes, and said that he was half right.

I gave Contreras too much credit here. Someone pointed out that an R-squared very close to 1 doesn’t even mean that the increase will always be close to 46,000, it means that the model explains almost all the variability around the mean of 46,000. A model fit to a tight zigzag can still have an R-squared near 1, which means that the line and the data are still pretty damn close to one another, but doesn’t tell you anything about whether the model systematically overpredicts or underpredicts at any given point. See this article for more information.

In order to directly check the claim of uniformity, here’s what should be done:

STEP 1. Get all 92,509 precincts in the order in which they transmitted.
STEP 2. Create a new set of 92,508 points. Yes, 92,508. Each point will be (difference in vote share after precinct 2 – difference in vote share after precinct 1), (difference in vote share after precinct 3 – difference in vote share after precinct 2), etc. That’s why it’ll be 1 less than 92,509.
STEP 3. Make a plot with the x-axis being 1 to 92,508 and the y-axis being the difference in differences (yeah, that’s confusing, but that’s what it is). Then run a linear regression model.

To reiterate: we are now plotting how much the gap changes every time a new precinct is added, rather than the gap itself.

IF each time a precinct transmits, the vote gaps are increasing uniformly, which would indeed indicate some major bullshit going on and prove Contreras’s point, you should expect the plot in Step 3 to be a flat horizontal line. The R-squared of the regression should be close to exactly 0.5.

IF each time a precinct transmits, the vote gaps are NOT increasing uniformly, which indicates a normal transmission process, you should expect the plot in Step 3 to have no discernible pattern. The R-squared of the regression should be close to 0.

I will demonstrate with some more simulated data, but what I do below is what should be done with the full data if we ever get it. First, let’s say that after we get all points in, it is indeed as Contreras says – every time a precinct transmits, everyone’s vote totals go up by exactly the same amount, which would be extremely suspicious. I’ll use 400 precincts (time periods) for illustrative purposes, and make two plots: one where precincts in order of transmission are on the x-axis and difference in votes are on the y-axis, and one where the difference in differences are on the y-axis instead.

uniformxyplot

uniformfdplot

Then run a regression model on the data that form the second plot:

lm(formula = z_unif ~ index)
            coef.est coef.se
(Intercept) 1.00     0.00   
index       0.00     0.00   
---
n = 399, k = 2
residual sd = 0.00, R-Squared = 0.50

The R-squared is exactly 50% as expected. I will now plot the estimated regression line on top of the second plot:

uniformfdplotwithfit

Remember that a regression model tries to fit a line to data as best as it can, so it’s not surprising that given uniform data, the regression line overlaps the first-differenced data entirely.

Now let’s say that the data look a lot more like we would expect from a normal transmission process – the gap is steadily increasing in favor of the leading candidate but we see dips and swings here and there as each precinct transmits. That data would look like this:

zigzagxyplot

And the first-differenced data would now look like this:

zigzagfdplot.png

Here’s a regression model fit to that plot:

lm(formula = z ~ index)
            coef.est coef.se
(Intercept) 1.03     0.66   
index       0.00     0.00   
---
n = 399, k = 2
residual sd = 6.55, R-Squared = 0.00

The R-squared is now pretty much 0 (it’ll be more like 0.003 or something, but the model summary rounds to two decimal places).

zigzagfdplotwithfit.png

Now you are trying to fit that volatile-looking zigzag with a single straight line. The model does the best it can, which ends up being a horizontal line drawn through the middle. But look at all the variation that the line doesn’t capture.

To conclude, a very good way to examine the claims of uniformity is to do what I did above on the actual data consisting of 92,509 precincts in order of transmission. You don’t even need to fit regression models, because the R-squared will plummet to near 0 at even the first hint of non-uniformity. You just need to eyeball the plot of difference in differences.

Code follows. Yeah, I know, I should be putting this on Github or something, but I’ll do that later.


install.packages("arm")
library(arm)
x_unif <- c(1:400)
y_unif <- c(101:500)
plot(x_unif, y_unif, type = "l", main = "Simulated data - uniform increase",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
z_unif <- rep(NA, 399)
for (i in 1:399) {
z_unif[i] <- y_unif[i + 1] - y_unif[i]
}
index <- c(1:399)
plot(index, z_unif, type = "l", main = "Simulated data - first-differenced - uniform increase",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z_unif ~ index))
abline(lm(z_unif ~ index), col = "red")
x <- c(1:400)
y <- c(runif(10, 100, 110), runif(10, 110, 120), runif(10, 120, 130), runif(10, 140, 150), runif(10, 130, 140), runif(10, 150, 160), runif(10, 160, 170), runif(10, 170, 180), runif(10, 180, 190), runif(10, 190, 200), runif(10, 200, 210), runif(10, 210, 220), runif(10, 220, 230), runif(10, 240, 250), runif(10, 230, 240), runif(10, 250, 260), runif(10, 270, 280), runif(10, 260, 270), runif(10, 280, 290), runif(10, 290, 300), runif(10, 300, 310), runif(10, 320, 330), runif(10, 310, 320), runif(10, 330, 340), runif(10, 340, 350), runif(10, 350, 360), runif(10, 380, 390), runif(10, 370, 380), runif(10, 360, 370), runif(10, 390, 400), runif(10, 400, 410), runif(10, 410, 420), runif(10, 420, 430), runif(10, 450, 460), runif(10, 440, 450), runif(10, 430, 440), runif(10, 460, 470), runif(10, 490, 500), runif(10, 470, 480), runif(10, 480, 490))
plot(x, y, type = "l", main = "Simulated data - zigzaggy",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
index <- c(1:399)
z <- rep(NA, 399)
for (i in 1:399) {
z[i] <- y[i + 1] - y[i]
}
plot(index, z, type = "l", main = "Simulated data - first-differenced - non-uniform movement",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z ~ index))
abline(lm(z ~ index), col = "red")

 

Did Leni cheat? We don’t know, but trendlines are insufficient evidence

The goal of the following exercise is to simulate a very simplified election in order to show that the apparent anomaly in which Marcos pulled ahead at first and then Robredo suddenly caught up and overtook him is consistent with a scenario in which no cheating occurred. In other words, there may or may not have been cheating, but the surge in Robredo’s vote totals doesn’t prove or even indicate it.

Suppose there are exactly 2000 election precincts in the Philippines.

Each precinct has 1,000,000 voters.

Further suppose that in 1,000, or exactly half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 55%, Robredo = 35%, Others = 10%

And suppose that in the other 1,000 or other half of these precincts, voters’ preferences for Vice-President followed a multinomial probability distribution as follows:

Marcos = 34.7%, Robredo = 55.3%, Others = 10%

As we can see, I’ve constructed simulated data that will give Robredo more votes in the final count. Before you accuse me of bias, you can just flip the numbers to give Marcos more votes if you want.

Finally, suppose we have two scenarios:

SCENARIO 1: Each precinct reports their vote tallies in random order.

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:

random_tally

As we can see, Marcos (in red) and Robredo (in yellow) would hew VERY close to each other, as would be expected if the precincts that submitted their vote tallies to the COMELEC server did so randomly.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:

random_difference.pngThe difference between Marcos and Robredo votes looks to be a fairly random process as well, trending towards favoring Robredo (because that’s how I simulated the data here) but alternating unpredictably between narrow and wide gaps/peaks and valleys.

If, in the actual Philippine election, the order in which each precinct reported in to COMELEC was completely randomized, we would expect graphs that look like the above. However, we know that the order in which each precinct reported in to COMELEC was not random; instead, certain regions reported earlier than others, so that overall, more precincts in the north reported first with precincts in the middle catching up later, for example. Let’s simulate something like this with our second scenario.

SCENARIO 2: All the precincts that favor Marcos report first, then the precincts that favor Robredo report last.

Under this scenario, let’s look at what a plot of how many precincts reported vote tallies vs. each candidate’s running total looks like:marcosfirst_tally

Now we see Marcos pulling away first, but after 1000 precincts report, Robredo begins closing the gap, and overtakes Marcos after the last few precincts come in.

In these two scenarios, the final count of Marcos vs. Robredo votes is identical, but depending on the order in which precincts reported their tallies, the trend lines will look very different.

Let’s also look at the number of precincts reporting tallies vs. the difference between Marcos and Robredo’s vote count:marcosfirst_difference.png

Now it looks like a very unnatural upside-down V! That’s because the gap widens first as all the precincts where Marcos leads report, then the gap narrows as all the precincts where Robredo leads report.

I have demonstrated here that the trendlines in both candidate total vote share and difference in votes as the number of precincts reporting increases, which some have pointed as evidence of cheating, are in fact consistent with a situation where no manipulation occurred while the votes were being reported.

As a statistician worth their salt would say, this is not to claim that no cheating occurred; it is to claim that that upside-down V is not sufficient evidence to claim that cheating did occur. This is analogous to hypothesis testing where we do not say that the alternative hypothesis is false, but rather we say that we failed to reject the null hypothesis.

Now, of course, the Philippines has way more than 2,000 precincts, they all have very different population sizes, and voters’ preferences are much more diverse than the super simplified scenario I depicted here. But we also do know that there are regional voting patterns. There are definitely clusters of precincts that are next to each other that all went for one candidate. And some of these clusters sent their tallies to COMELEC before others did. So given the actual election data and process, it wouldn’t be out of the ordinary to see a scenario like the above.

Some code in the R programming language to replicate the above follows. If you’re not interested in the code, you can stop reading here.

I didn’t set seeds this time, so your plots will look marginally different, but the patterns I pointed out will all hold.

marcos_leads_20p <- rmultinom(1000, 1000000, prob = c(0.55, 0.35, 0.05, 0.03, 0.01, 0.01))
robredo_leads_20.6p <- rmultinom(1000, 1000000, prob = c(0.347, 0.553, 0.04, 0.03, 0.02, 0.01))

combined <- cbind(marcos_leads_20p, robredo_leads_20.6p)
combined_order <- sample(1:2000, 2000)

marcos <- rep(NA, 2000)
robredo <- rep(NA, 2000)
marcos_total <- 0
robredo_total <- 0

for (i in 1:2000) {
marcos_total <- marcos_total + combined[1, combined_order[i]]
marcos[i] <- marcos_total
robredo_total <- robredo_total + combined[2, combined_order[i]]
robredo[i] <- robredo_total
}

plot(1:2000, marcos, col = “red”, type = “l”,
main = “Marcos vs. Robredo Running Tally, Random Precinct Reporting”,
xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)
points(1:2000, robredo, col = “yellow”, type = “l”)

marcos_2 <- rep(NA, 2000)
robredo_2 <- rep(NA, 2000)
marcos_total <- 0
robredo_total <- 0
for (j in 1:1000) {
marcos_total <- marcos_total + marcos_leads_20p[1, j]
marcos_2[j] <- marcos_total
robredo_total <- robredo_total + marcos_leads_20p[2, j]
robredo_2[j] <- robredo_total

}
for (k in 1001:2000) {
marcos_total <- marcos_total + robredo_leads_20.6p[1, k – 1000]
marcos_2[k] <- marcos_total
robredo_total <- robredo_total + robredo_leads_20.6p[2, k – 1000]
robredo_2[k] <- robredo_total
}

plot(1:2000, marcos_2, col = “red”, type = “l”,
main = “Marcos vs. Robredo Running Tally, Marcos Precincts Report First”,
xlab = “Number of Precincts Reported”, ylab = “Total Number of Votes”)
points(1:2000, robredo_2, col = “yellow”, type = “l”)

plot(1:2000, marcos – robredo, type = “l”, main = “Marcos minus Robredo Running Tally, Random Precinct Reporting”,
xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)
plot(1:2000, marcos_2 – robredo_2, type = “l”,
main = “Marcos minus Robredo Running Tally, Marcos Precincts Report First”,
xlab = “Number of Precincts Reported”, ylab = “Vote Difference”)

 

Update: Improved simulations!

My previous post detailed a method to simulate predicted probabilities that each vice-presidential candidate would “win” a Pulse Asia survey should the survey be repeated an infinite number of times. I’ll put the survey results here again for reference:

Comparative-VP_FCA23F37C2544692B0859AA790615323

Peter Julian Cayton, UP Diliman School of Statistics professor and current Ph.D student at Australian National University in Canberra, had two criticisms of my approach which I wholeheartedly agree with:

1.) The approach treats the survey results as fixed, when in fact the survey results have some uncertainty about them. In retrospect, I admit it was strange that I spent half my post talking about the margin of error and then proceeded to completely ignore it in my simulation. It isn’t good to treat the survey results as fixed, because the simulation results are very sensitive to slight changes in the parameters.

2.) The approach assumes that there are no geographic or socio-economic variations in people’s preferences. This is obviously false. I used a simplified assumption so that the model wouldn’t be too complex.

Therefore, I now present two improved simulations of the same survey data. The first simulation will explicitly model the uncertainty around the survey results, while the second will do that AND model geographic variations as well.

Simulation 1: uncertainty around vote shares

First, some code in R:
set.seed(9999)
se <- sqrt(0.5 * 0.5 / 1800)
marcos <- rnorm(1000000, 0.25, se)
robredo <- rnorm(1000000, 0.23, se)
escudero <- rnorm(1000000, 0.23, se)
cayetano <- rnorm(1000000, 0.14, se)
honasan <- rnorm(1000000, 0.06, se)
honasan[honasan < 0] <- 0
trillanes <- rnorm(1000000, 0.05, se)
trillanes[trillanes < 0] <- 0
ewan <- rnorm(1000000, 0.04, se)
ewan[ewan < 0] <- 0
simulation <- matrix(nrow = 7, ncol = 1000000)
for (i in 1:1000000) {
simulation[, i] <- rmultinom(1, 1800, c(marcos[i], robredo[i], escudero[i], cayetano[i], honasan[i], trillanes[i], ewan[i]))
}
table(apply(simulation, MARGIN = 2, FUN = which.max))

This time, instead of using marcos = 0.25, robredo = 0.23, escudero = 0.23, cayetano = 0.14, honasan = 0.06, trillanes = 0.05, and ewan = 0.04, as I did in the previous simulation, I explicitly model the uncertainty around each of these proportions by treating each as a normally distributed random variable with the mean being the Pulse Asia result and the standard deviation being the maximum standard error for a sample survey with 1,800 respondents. In other words: Before simulating which candidate will win the most times, I also simulate each candidate’s vote share. Just to be clear, in my previous post I criticized the use of this standard error as a basis for comparison between candidates, but if we’re only looking at the uncertainty around each individual proportion, this standard error is perfectly fine.

If that sounded like gobbledygook to you, what I’m doing is letting each proportion vary in a reasonable way. For example, Robredo’s Pulse Asia result is 0.23. Instead of fixing that result at 0.23, I recognize that “0.23” has some uncertainty and I allow 0.23 to vary in order to look something like this:

robredouncertainty.png

Thus, Robredo’s vote share can be as low as 19% or as high as 27%, but it’ll still most likely be around 23%.

I do this for all candidates. However, for candidates whose vote shares are very low, modeling their uncertainty in this way will occasionally give us negative values, and there’s no such thing as a negative proportion. So I do truncation: if Trillanes’s simulated vote share ever dips below 0%, I set it to exactly 0%.

I then simulate 1,000,000 draws from a multinomial distribution, like before, only this time, each draw has slightly different parameters because the parameters are themselves simulated. For example, on draw 1, marcos = 0.262, robredo = 0.219, escudero = 0.245, cayetano = 0.149, honasan = 0.047, trillanes = 0.054, and ewan = 0.065. (Due to rounding, these may not add up exactly to 1). On draw 2, marcos = 0.26, robredo = 0.24, escudero = 0.223, cayetano = 0.138, honasan = 0.067, trillanes = 0.058, and ewan = 0.044.

Here are the results:

simtable2.PNG

Marcos still has the edge here, but he now only has a 70% chance of winning, while both Robredo and Escudero have a 15% chance each of winning. (Again, this is if the election were held that day and if the survey is properly representative of the voting population).

Simulation 2: uncertainty around geographic vote shares

Now, let’s discard the assumption that vote shares are the same across the whole country. The Pulse Asia survey results clearly show us results for NCR, “BL” (Balance of Luzon), Visayas and Mindanao. If Pulse Asia’s methodology is anything like SWS, then let’s assume that Pulse Asia surveyed 450 people in each of NCR, the rest of Luzon, Visayas, and Mindanao. This means that the standard deviation used in modeling the uncertainty around geographic vote shares will be larger, because the sample size in each area is smaller.

Then we can do the same thing as above, but this time, we will first do a separate simulation for each region, and then combine them together.

Here is some (inefficient and unnecessarily long – sorry) code that does all this. You can skip to the very end of it if you want, because it’s fairly uninteresting – except for the last line, which I’ll get to.


set.seed(9999)
geog <- data.frame(ncr = c(0.36, 0.12, 0.30, 0.12, 0.04, 0.03, 0.02),
bl = c(0.31, 0.22, 0.23, 0.08, 0.05, 0.05, 0.07),
vis = c(0.17, 0.35, 0.22, 0.13, 0.05, 0.06, 0.03),
min = c(0.18, 0.19, 0.19, 0.29, 0.08, 0.07, 0.02))
rownames(geog) <- c("marcos", "robredo", "escudero", "cayetano", "honasan", "trillanes", "ewan")
se_geog <- sqrt(0.5 * 0.5 / 450)
iter <- 1000000
marcos_ncr <- rnorm(iter, geog$ncr[1], se_geog)
marcos_bl <- rnorm(iter, geog$bl[1], se_geog)
marcos_vis <- rnorm(iter, geog$vis[1], se_geog)
marcos_min <- rnorm(iter, geog$min[1], se_geog)
robredo_ncr <- rnorm(iter, geog$ncr[2], se_geog)
robredo_bl <- rnorm(iter, geog$bl[2], se_geog)
robredo_vis <- rnorm(iter, geog$vis[2], se_geog)
robredo_min <- rnorm(iter, geog$min[2], se_geog)
escudero_ncr <- rnorm(iter, geog$ncr[3], se_geog)
escudero_bl <- rnorm(iter, geog$bl[3], se_geog)
escudero_vis <- rnorm(iter, geog$vis[3], se_geog)
escudero_min <- rnorm(iter, geog$min[3], se_geog)
cayetano_ncr <- rnorm(iter, geog$ncr[4], se_geog)
cayetano_bl <- rnorm(iter, geog$bl[4], se_geog)
cayetano_vis <- rnorm(iter, geog$vis[4], se_geog)
cayetano_min <- rnorm(iter, geog$min[4], se_geog)
honasan_ncr <- rnorm(iter, geog$ncr[5], se_geog)
honasan_bl <- rnorm(iter, geog$bl[5], se_geog)
honasan_vis <- rnorm(iter, geog$vis[5], se_geog)
honasan_min <- rnorm(iter, geog$min[5], se_geog)
trillanes_ncr <- rnorm(iter, geog$ncr[6], se_geog)
trillanes_bl <- rnorm(iter, geog$bl[6], se_geog)
trillanes_vis <- rnorm(iter, geog$vis[6], se_geog)
trillanes_min <- rnorm(iter, geog$min[6], se_geog)
ewan_ncr <- rnorm(iter, geog$ncr[7], se_geog)
ewan_bl <- rnorm(iter, geog$bl[7], se_geog)
ewan_vis <- rnorm(iter, geog$vis[7], se_geog)
ewan_min <- rnorm(iter, geog$min[7], se_geog)
sim_geog_p <- rbind(marcos_ncr, marcos_bl, marcos_vis, marcos_min, robredo_ncr, robredo_bl, robredo_vis, robredo_min,
escudero_ncr, escudero_bl, escudero_vis, escudero_min, cayetano_ncr, cayetano_bl, cayetano_vis, cayetano_min,
honasan_ncr, honasan_bl, honasan_vis, honasan_min, trillanes_ncr, trillanes_bl, trillanes_vis, trillanes_min,
ewan_ncr, ewan_bl, ewan_vis, ewan_min)
sim_geog_p[sim_geog_p < 0] <- 0
sim_ncr <- matrix(nrow = 7, ncol = iter)
sim_bl <- matrix(nrow = 7, ncol = iter)
sim_vis <- matrix(nrow = 7, ncol = iter)
sim_min <- matrix(nrow = 7, ncol = iter)
for (i in 1:iter) {
sim_ncr[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_ncr", i],
sim_geog_p["robredo_ncr", i],
sim_geog_p["escudero_ncr", i],
sim_geog_p["cayetano_ncr", i],
sim_geog_p["honasan_ncr", i],
sim_geog_p["trillanes_ncr", i],
sim_geog_p["ewan_ncr", i]))
sim_bl[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_bl", i],
sim_geog_p["robredo_bl", i],
sim_geog_p["escudero_bl", i],
sim_geog_p["cayetano_bl", i],
sim_geog_p["honasan_bl", i],
sim_geog_p["trillanes_bl", i],
sim_geog_p["ewan_bl", i]))
sim_vis[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_vis", i],
sim_geog_p["robredo_vis", i],
sim_geog_p["escudero_vis", i],
sim_geog_p["cayetano_vis", i],
sim_geog_p["honasan_vis", i],
sim_geog_p["trillanes_vis", i],
sim_geog_p["ewan_vis", i]))
sim_min[, i] <- rmultinom(1, 450, c(sim_geog_p["marcos_min", i],
sim_geog_p["robredo_min", i],
sim_geog_p["escudero_min", i],
sim_geog_p["cayetano_min", i],
sim_geog_p["honasan_min", i],
sim_geog_p["trillanes_min", i],
sim_geog_p["ewan_min", i]))
}
sim_geog <- 0.115*sim_ncr + 0.445*sim_bl + 0.208*sim_vis + 0.232*sim_min

In order to combine the results of separate simulations for NCR, the rest of Luzon, Visayas, and Mindanao, one approach would be to just add the four simulations together.

We know, however, that the Philippine voting population is not equally distributed across these four areas. Straight-up adding the four simulations together assumes that each area has exactly 25% of the voting population, and that’s patently false.

Commission on Elections records as of November 2015 show that NCR has 11.50% of voters nationwide, the rest of Luzon has 44.5%, while Visayas has 20.8% and Mindanao has 23.2%.

The last line, therefore, assigns to the variable sim_geog the combination of the four simulations together according to the percentage of the voting population each area has. This is called weighting.

With all this, we’re now ready to simulate! First, let’s look at NCR:

simncr.PNG

Marcos = 88.5%, Escudero = 11.5%. 0% for Robredo and everyone else.

What about the rest of Luzon?

simluzon

Marcos = 93.3%, Robredo = 2.5%, Escudero = 4.2%.

Visayas?

simvisayas.PNG

Robredo wins it big!

Marcos = 0% (well, 0.009%, but that’s pretty much 0%), Robredo = 99.5%, Escudero = 0.5%, Cayetano = 0% (again, 0.0001% is pretty much 0%).

And Mindanao?

simmin.PNG

Cayetano is a rockstar, no doubt due to being Duterte’s running mate.

Marcos = 1%, Robredo = 1.5%, Escudero = 1.5%, Cayetano = 96%

None of this should come as a surprise, considering that you could have guessed these results just by looking at the Pulse Asia table.

But here it comes: what of the combined results across the nation?

simgeog

Marcos = 78.5%, Robredo = 11.5%, Escudero = 10%.

Yeah, looks like Robredo and Cayetano could stand to gain some popularity outside Visayas and Mindanao, respectively.

Remember, though: These results are all obtained via information from a single survey conducted two months before the election. Public opinion can still change – not to mention that, as far as I know, the survey doesn’t cover overseas voters (Fun fact: overseas Filipinos can vote as early as April 9.) Unfortunately, the survey also doesn’t account for bloc voting and vote buying (which usually takes place just a couple of weeks before the elections).

A comment by UP College of Mass Communication’s Dr. Clarissa David puts it best:

“Odds cannot predict things like Binay tanking both debates, or the Kidapawan conflict, or any number of unimaginable events that can happen during an election period. That’s why odds change so dramatically with each round of survey. The odds as calculated assume that nothing out of the ordinary will happen. This is why predicted odds are not a thing in electoral studies. Surveys before the election are never intended to “predict” the winner, they give us a picture of current sentiment.”

I would argue that probabilities give us a more intuitive picture of current sentiment than do difficult-to-interpret proportions and margins of error, but I agree with Dr. David.

I am grateful to Peter for suggesting improvements to the simulations, as well as to Jan Carlo Punongbayan, who conducted follow-up analysis and spurred further discussion based on my original blog post.

(And to Jan Fredrick Cruz for letting me know about JC Punongbayan’s note in the first place.)