Around 0.6% of Filipinos aged 18-53 are considered Class AB

Sometime in 2015, NEDA carried out what they called the “Filipino 2040 National Survey on the Aspirations, Values and Principles of Filipino People”, a survey of 10,000 Filipino adults aged between 15 and 50 from all across the country (except for high-risk areas like Basilan, but that’s neither here nor there). Summary tables of responses to all the questions can be found here, and a technical report can be found here.

What interests me the most about this survey, however, is that each of the summary tables just happens to provide the weighted sample sizes of five socio-economic groups – Class AB, “Upper C”, “Broad C”, D, and E – which means that we can get a rough idea of class inequality without having to use self-reported income, which is tricky to measure. The technical report also provides a handy matrix of what exactly is meant by “socio-economic class”, a term primarily used by market researchers:

socio-economic class

I’ve discussed this before, but measuring socio-economic class is left to the interviewer’s discretion. It’s generally expected that if, say, someone has a fairly large income, they’ll also be well-educated, have new facilities, a durable home, etc., although this obviously isn’t always true – but ambiguous cases are left up to the interviewer to decide.

According to the NEDA survey, among Filipinos aged 15-50 in 2015:

  • Around 0.6% were class AB;
  • Around 5% were class “Upper C”;
  • Around 17% were class “Broad C”;
  • Around 48% were class D;
  • Around 29% were class E.

Some other very interesting, and silly, things about the above matrix:

  • Both class AB and “Upper C” contain college graduates, but they can be distinguished from one another by whether they went to an exclusive university, including UP, or to a state university.
  • One way to tell whether you’re “Upper C”: you’re a junior executive or a young professional… or a provincial town official. Meanwhile, “skilled overseas workers” get shunted into “Broad C”.
  • The more your house needs a coat of paint, the lower class you are.
  • You can’t be class AB if you don’t have a car newer than 5 years old. Better get to that dealership.

Jokes aside, public opinion polls such as those done by SWS and Pulse Asia use a similar methodology to determine social class, though not exactly the same groups. So the next time some major poll comes out and everyone’s scratching their heads over some perceived shift in class ABC or whatever, keep that matrix in mind.

Note: The above percentages are based on a survey of Filipinos aged 15-50. Public opinion polls usually survey Filipinos 18 and up, including people well older than 50, so don’t take the above percentages as being descriptive of all Filipino adults.

(Thanks to Geneve for letting me know about the NEDA survey.)


Does the “margin of error” tell me how correct a survey number is? | What is a non-probability sample?


Here’s an example: A survey reports that 50% of respondents were satisfied with President X, with a margin of error of 3%. As you may have learned in Stat 101, this means that theoretically, if we repeated this survey an infinite number of times, then the range [47%, 53%] should encompass the “true value” of satisfaction with President X 95% of the time.

But what if, for example, most of President X’s supporters are young people, but the survey had far more old people in it? In other words, what if the survey was not representative of the population?

Then President X’s true support might be much higher, say 70% for example, in which case the margin of error tells you jack squat.

The difference between a survey estimate and the “true” value for the whole population of interest is called bias, and the margin of error tells you nothing about it. 70% isn’t anywhere near [47%, 53%].

A truly random sample should have no bias, but truly random samples of something as diffuse as the population of the Philippines are horrendously difficult, perhaps impossible, to get. So even the numbers that SWS and Pulse Asia put out, despite their best efforts, may be biased because of unknown factors.

The problem is that bias is difficult to measure. You’d have to know what the “true value” actually was, and you aren’t going to get that without a census – a survey of literally everybody in the population. Then you can use census numbers as benchmarks. For now, we can only trust that survey firms’ methodologies cover all bases.


The latest survey to come out in the news is a non-probability online survey of 1,200 registered voters aged 19-35, conducted by lobbying and campaign management firm Publicus Asia. I have previously made fun of Publicus in undiplomatic language, but in this case, while I would take the numbers with a grain of salt, there isn’t anything that strikes me as particularly deceptive about this latest effort. The term non-probability survey means that, rather than attempting a random sample (which would be considerably more expensive than what SWS and Pulse Asia usually do, considering that all 1,200 respondents have to be within a certain age group AND have to be registered to vote), they just put a poll somewhere online and let people register for the survey on their own volition. Another kind of non-probability survey would be those surveys you do in college where you stand somewhere with pleading Bambi eyes and try and get as many people as possible to fill out your paper form so you can meet your Psych 101 deadline.

Let’s assume that everyone who signed up for Publicus’s survey was indeed between age 19-35 and a registered voter (and a Filipino), which isn’t necessarily true because there isn’t any real way to check. The main issue with a non-probability sample is whether or not the resulting 1,200 people are fully representative of all Filipinos who are aged 19-35 and registered to vote. For example, do they have the same socio-economic breakdown? Is the ratio of males to females similar? Are they geographically distributed across the country? Do they have the same levels of education? And so on.

I bring this up because the term “margin of error” describes what would happen with repeated random samples all the way to infinity. Notice how Publicus Asia doesn’t report any margin of error? That’s correct – the margin of sampling error technically doesn’t apply to non-probability surveys. Pretending to have one would be deceptive, so it’s good that they don’t have any. There is also, of course, uncertainty over what the numbers would be if they did the same exact survey again, but the extent of that uncertainty is unknown.

If the non-probability sample has too many of a particular type of person compared to the rest of the Philippines, then its estimates may be biased. For example, Publicus’s survey was administered online, so the sample could overrepresent people with Internet access, which may also be correlated with socioeconomic status. This is not political bias – it’s just a statistical term.

Theoretically, a random sample can grab an adult Filipino from anywhere in the country, and so a large enough random sample will be representative on all these points, and will be unbiased. We don’t have that same assurance with a non-probability sample, where we don’t really have a clear sense of what caused someone to opt into the survey. Were they feeling bored? Did they visit a particular website that hosted the banner ads recruiting people into the survey? Why’d they click? Why did they finish the entire thing?

In practice, non-probability samples are not useless. Plenty of market researchers use non-probability samples mostly because they’re easy to do, but also because they aren’t super interested in population inference – they just want anyone who could potentially buy what they’re hawking. For those who do care about population inference, or making a sample look like the population, techniques such as applying nonresponse adjustments, calibration, and poststratification can produce accurate estimates out of non-probability samples. These same techniques are, in fact, also applied to “probability-based” surveys, because most surveys aren’t really random, even the ones done as rigorously as possible. SWS and Pulse Asia, for example, do a basic adjustment where after sampling 300 people from each of NCR, the rest of Luzon, Visayas and Mindanao, they adjust their numbers to match the actual percentages of people who live in those places – according to the census.

Many of these techniques require accurate census data. Really, the best way to advance survey research in the Philippines would be to start by beefing up the government’s capability to conduct the national census. Not only do you need that census to pull off fancy adjustments with your surveys, you also need that census to figure out whether what you’re doing is working in the first place.

5 things to keep in mind about the uncertainty of Philippine survey numbers

The latest surveys by SWS and Pulse Asia are out. Here are 5 things that social scientists, journalists, policymakers, political strategists, and other people who care about these sorts of things should keep in mind when reading the press releases:

1. Technically, each question has its own margin of error.

The margin of sampling error is calculated using the formula for the standard error of a binomial proportion multiplied by 1.96:

\sqrt{\frac{p_1(1 - p_1)}{n}} * 1.96

where p_1 is the percentage of some response to a question and n is the sample size. For example, if 80% of people in a survey of size 1,500 said that they were satisfied with President Duterte, then p_1 is 0.8 and n is 1,500. The margin of sampling error would then be \sqrt{\frac{0.67 * 0.33}{1500}} * 1.96 = 0.02 or 2%.

Each question will have a different p_1, therefore each question has its own margin of sampling error. It is generally considered too bothersome for the purposes of a press release to give every single number its own margin of sampling error. Therefore, press releases err on the side of caution.

The largest possible value for the margin of sampling error is achieved when p_1 = 0.5. This will result in a margin of sampling error of 2.5% for a survey of size 1,500, as with SWS, or 3% for a survey of size 1,200, as with Pulse Asia. Since it is better to assume more uncertainty than less, pollsters simply take the maximum margin of sampling error and say that it’s the margin of sampling error for every question.

2. The margins of error for “net” questions and for changes over time are ~much~ larger.


SWS in particular likes to report “net” statistics. For example, they report that President Duterte has a “net satisfaction rating” of 48, which is calculated by taking the % who said they were satisfied minus the % who said they were dissatisfied.

The formula for calculating the margin of error from a “net” statistic – technically, the standard error of the difference between two proportions from a multinomial distribution multiplied by 1.96 – is as follows:

\sqrt{\frac{p_1(1-p_1) + p_2(1-p_2) + 2p_1p_2}{n}} * 1.96

where, for example, p_1 would be % satisfied and p_2 would be % dissatisfied. Again, this means that each question would have its own margin of error; and again, for simplicity, we would just assume the maximum margin of error and apply that to all questions. The maximum margin of error is achieved if both p_1 and p_2 are 0.5. For a survey of size 1,500, then:

\sqrt{\frac{0.5 * 0.5 + 0.5 * 0.5 + 2*0.5*0.5}{1500}} * 1.96 = 0.05, or 5%.

The margin of error for a net statistic is twice the reported margin of sampling error. Keep this in mind when reading SWS reports. Pulse Asia isn’t a fan of reporting net statistics.

SWS also likes to report the change in the “net” over time. For example, it reports that President Duterte’s net satisfaction rating fell from 66 in June 2017 to 48 in Sept 2017. Guess what? The margin of sampling error for the change in the net statistic is even larger.

Let p_1 and p_2 be % satisfied and % dissatisfied in June 2017, and let p_3 and p_4 be % satisfied and % dissatisfied in Sept 2017. Both surveys have the same sample size of n = 1500. Then the margin of sampling error for the 18-point change in Duterte’s net satisfaction rating is:

\sqrt{\frac{p_1(1-p_1) + p_2(1 - p_2) + 2p_1p_2 + p_3(1-p_3) + p_4(1-p_4) + 2p_3p_4}{n}} * 1.96.

The maximum value of this margin of sampling error is when all of those p’s are 0.5, which comes out to 7.2%.

So if you want to look at the % of people in a single survey who were satisfied, then the margin of sampling error is 2.5%. If you want to look at the “net satisfaction”, the margin of sampling error is 5%. If you want to look at the change in the net satisfaction over two periods, the margin of sampling error is 7.2%.

3. The reported margin of error is almost certainly too low.

As stated above, SWS reports a margin of sampling error of 2.5% for their surveys of size 1,500, while Pulse Asia reports a margin of sampling error of 3% for their surveys of size 1,200. These numbers are accurate according to the formulas above. They are also wrong.

The formulas above all assume independent observations from a simple random sample. That is, they imagine a process where we have a list of every single eligible adult in the Philippines, and we randomly pick 1,200 or 1,500 people from that list and interview them.

That list doesn’t exist.

Instead, what polling firms do is they take a list of every single municipality and city in the Philippines, and select some number of them to conduct interviews in. Municipalities and cities with more households in them are more likely to be selected.

Then they take a list of every single barangay in the selected areas, and select some number of them to conduct interviews in. Barangays with more households in them are more likely to be selected.

Then they select five households in each barangay to interview by choosing a random starting point and sampling every 7th house (or 5th, or whatever, depending on the size of the barangay).

Finally, they knock on the door of a household, ask how many people age 18 and above live there, and choose one of them at random to interview.

Now the issue is that people from the same area are more likely to share the same opinions than people who aren’t from the same area. In other words, a sample of 1,500 obtained like this doesn’t actually contain 1,200 independent observations; it contains 240 clusters of 5 people each whose opinions are somewhat more similar to each other.

The formula for calculating the margin of sampling error from a multi-stage cluster sampling design like this is quite complicated. However, it will certainly be larger than what is reported. Unfortunately, I cannot recalculate the margin of sampling error myself, because I would need to know the mean response for every single cluster, and that isn’t happening without access to the raw data.

As an example, however, consider the survey conducted by Princeton Survey Research Associates International on behalf of the Pew Research Center in the Philippines. (Click the link and search “Philippines”). The survey had a sample size of 1,000. If SWS or Pulse Asia had done that survey, they would have reported a margin of sampling error of 3.1%; however, taking clustering into account, PSRAI reports the margin of sampling error to be 4.3%.

4. There is no consistent, objective measure of socioeconomic class.

Both SWS and Pulse Asia like reporting out statistics for “Class ABC”, “Class D”, and “Class E”. There is no accepted definition as to who falls in what class. I emailed Ronald Holmes, Pulse Asia Research President, asking him about how Pulse Asia determines whether a respondent is in class ABC, D, or E, and he replied:

A number of factors are used to classify households that are sampled. These factors include total household income; household facilities/furnishings; occupation of household head; educational attainment; home ownership; home maintenance; durability of the home; and, conditions of the neighborhood, among others. The indicators are culled from prior social science/market research and our enumerators document these indicators for subsequent classification into socio-economic classes of the sampled respondents/household.

There are exact criteria but there are also other criteria subject to the judgment of the enumerator. We do regularly ask about occupation of the household head, educational attainment and home ownership but not regularly on household income.

In other words, the field staff have to make some judgments about aspects of a person’s socioeconomic class via observation, and then SWS and Pulse Asia determine socioeconomic class via some index that may differ between the two firms.

The Class ABCDE construct is itself largely an invention of market research. For example, Nielsen divides the population of the Czech Republic into eight classes – A, B, C1, C2, C3, D1, D2 and E – and fits a regression model on variables such as household composition, occupation of the household head, household equipment, household income, education of the household head, and region of the country to assign each person a ‘score’. The top 12.5% are considered class A, the next 12.5% class B, and so on until the bottom 12.5% go to class E, such that an equal number of people are in each class.

However, SWS and Pulse Asia do not do it this way. I do not know exactly how they construct their socioeconomic class measure, but class ABC typically makes up less than 10% of the sample. This implies larger margins of error for statistics calculated over class ABC only, just like how Luzon, Visayas and Mindanao have larger margins of error.

For example, if we assume that only 150 out of 1,500 people in an SWS survey are class ABC, the margin of sampling error for class ABC would be 8%. The margin of sampling error for the change over time between two measures would be 11%.

As previously discussed, the margin of sampling error for a “net satisfaction” rating would be 8% doubled, or 16%. The margin of sampling error for the change over time in the net satisfaction rating would be a whopping 22.6%. This means that all but the most cataclysmic change in the net satisfaction rating of class ABC would be within sampling error.

Class D typically makes up about 60% of a sample, while class E typically makes up about 30% of a sample.

5. The margin of sampling error is purely statistical; it does not include error that comes from contextual factors, from undercoverage, or from nonresponse.

All sorts of things can affect what response someone gives to a question. Here’s a (picture of a) slide from the University of Michigan summarizing these things:


Survey research literature has found, for example, that:

  • Older people and people with less education tend to give more agreeable responses (satisfied, approve, trust, etc.), regardless of what the question is asking about.
  • Many respondents will give the “socially desirable” answer rather than their true answer when asked about topics such as how often they smoke, or their views about poor people, etc. This is exacerbated when a live interviewer is present, as is the case with SWS and Pulse Asia, but is also present when the respondent is speaking over the phone. It pretty much disappears online where the respondent has more anonymity.
  • Many respondents will also give “socially desirable” answers if the interviewer looks like they might prefer it, or if someone else is present in the room. Interviewers generally request that they be allowed to survey someone alone, but they can’t really force it. If you are interviewing someone about their trust in Duterte, and their spouse is in the room wearing a Duterte shirt, then you can expect that they will indicate trust in Duterte regardless of what their true beliefs are. Some LGUs also insist that surveys be conducted with LGU officials present, which makes political measures less trustworthy.
  • According to SWS and Pulse Asia, respondents tend to open up to females more. This means that their field interviewers are all female.
  • The longer a survey goes, the more a respondent wants to get it over with. (You don’t really need a lit review to figure this out). This may result in “default” answers.

Undercoverage is also a real problem. For obvious reasons, the list of barangays will not include heavily conflict-affected barangays where the interviewer’s life would be threatened. Filipinos who are abroad also have 0 probability of getting selected (though balikbayans who happen to be in the country could still get sampled). Time of day may also affect who is available. I do not know exactly what field protocols are, but if, for example, interviewers are only in the field during working hours, then the sample will largely consist of people who are unemployed or who work from home. And what about exclusive gated communities, where you can’t even wander around without getting past a security guard? The government might be able to pull it off with official census workers, but private firms will have less luck.

Nonresponse is another problem that we have almost no information on. According to Pulse Asia,

Respondents sampled who were not available during first attempt were visited again with a maximum of two valid call backs. If the respondent remained unavailable after two valid call backs, a substitute who possessed the same qualities (in terms of gender, age bracket, working status and socio-economic class) as the original respondent was interviewed. The substitute respondent was taken from another household beyond the covered intervals in the sample barangay by continuing the interval sampling.

The primary concern here is that people who are willing to respond to surveys may have different opinions compared to people who are not willing to respond to surveys. To my knowledge, this has never been studied to my knowledge in the Philippines. I’m not even sure how you would do so. The Pew Research Center study of telephone nonresponse in the United States was able to do things like compare telephone surveys, which have very low response rates, to much more expensive face-to-face surveys with high response rates, or to publicly available voter records, in order to check whether the responding sample looked systematically different. We can’t do anything comparable in the Philippines. On the other hand, response rates in Philippine face-to-face surveys reportedly hover around more than 50%, so it isn’t as big of a problem to my mind compared to undercoverage and contextual effects.

Here’s the summary: Numbers from a survey have much greater uncertainty than the polling firms claim. Surveys are useful indicators of public opinion, but we should not assign too much gravitas to, say, a 7% decline in ‘net satisfaction’ between two surveys. (By contrast, the 18% decline is worth taking into consideration.)

Double-barreled responses: Congress poll on lowering minimum age of criminal responsibility

A one-question online poll on the website of the Philippine Congress gives me the opportunity to discuss questionnaire design a little bit.



This primer by Jon Krosnick outlines conventional wisdom on how to design a survey question from decades of research in survey methodology. Among the recommendations are to make response options exhaustive and to avoid leading or loaded word choices that push respondents towards an answer.

The problem with the response options above is that they are what are known as “double-barreled” response options, where the question asks you for one answer but each response options is actually two answers. You have the option of choosing Yes AND agreeing that the youth should be responsible for their actions and words as early as possible, or choosing No AND agreeing that punishing children violates child rights. The response options thus do not exhaust other possible ways of framing the issue. For example, a respondent might think that the youth should be responsible as early as possible but that nine years is too young, or that effective punishment would not necessarily deprive children of the chance to improve their lives.

Clearly, the question laudably tried to present balanced viewpoints on the issue. Certainly, allowing respondents to grapple with some arguments in favor of or against the question would result in more informed responses than if the response options were just straight-up Yes, No or Undecided. In order to avoid the double-barreling discussed above, a better way to ask the question would be to simply move the arguments from the responses into the question, as follows:

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. [RANDOMIZE ORDER OF PARENTHESES] Some argue that (the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes), while others argue that (punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character). Do you favor or oppose this proposal, or are you undecided?

  1. Favor
  2. Oppose
  3. Undecided

By doing so, we allow respondents to consider policy arguments in their heads while not railroading them into agreeing with a fixed set of arguments along with their opinion on whether they favor or oppose the proposal.

I would also suggest two changes which are already incorporated into the above question. First is to remove the phrase that mentions “unduly pampering” children with “impunity from criminal responsibility”. These word choices are loaded with negativity and may thus influence respondents’ answers. Furthermore, the argument being made in that phrase is effectively the same as arguing that children should take responsibility at an earlier age, thus making this statement redundant.

Second is to randomize the order in which the arguments are presented to the respondent. The above text would not be what is shown to the viewer; it would be what is shown to instruct whoever is in charge of programming the survey. Upon visiting the website, each respondent would have a 50-50 chance of seeing one of the two following questions:

Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. Some argue that the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes, while others argue that punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character. Do you favor or oppose this proposal, or are you undecided?


Lawmakers have proposed that the minimum age of criminal responsibility be lowered from 15 to 9 years. Some argue that punishing children in conflict with the law violates child rights and deprives them of the chance to rebuild their lives and improve their character, while others argue that the youth should be responsible for their actions and words as early as possible and that this would serve as a deterrent to the use of youth in the commission of crimes. Do you favor or oppose this proposal, or are you undecided?

There is a possibility that the argument that respondents see either first or most recently would stick out in their minds more. We do not want this to influence their answer, and switching up the order of the arguments should help prevent this.

There are of course broader issues to consider. The poll itself is an interesting exercise, but without any information on who is answering (i.e. who makes up the sample), the poll cannot provide any evidence on Philippine public opinion. Nearly 18,000 respondents as of 10:00 AM on 13 January 2017 may look more impressive than the 1,200 respondents we see from SWS or Pulse Asia surveys, but they may, for example, largely be highly-educated, English-speaking, middle to upper class Filipinos with Internet access who were interested enough in politics to answer a poll on the Congress website – or, in short, unrepresentative of the Filipino population as a whole. Hopefully no lawmaker or pundit will look to this poll to say that 90% of Filipinos oppose the proposal, because this poll does not provide sufficient evidence for that assertion.

Statistical adjustments of the results of online polls in order to reflect the population is an active field of research, but the bare minimum required to start doing so would be demographic information on the respondents. Without that, the poll is nothing more than a curiosity.


Data collection and accessibility are the core of civic data analytics

Data science and data analytics are red hot terms nowadays. You can’t go more than a page with Google search without finding some reference to how “sexy” data science is. And everybody wants to be sexy, right?

In the Philippines, helmed by a small enthusiastic community, more and more startups are mushrooming with the business model of providing data science and analytics services to businesses and corporations, such as Thinking Machines Data Science and DataSeer.

Of particular interest to me is “civic” data analytics, or analytics as applied to civic problems such as health, infrastructure, agriculture, poverty, education, the environment, and all sorts of other things that are the ambit of government agencies and nonprofits. The international volunteer organization DataKind, with chapters in Singapore, Dublin, Bangalore, the United Kingdom, Washington, DC, and San Francisco, describes its mission as “bringing together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact.”

One homegrown example would be this story that was published very recently in the Manila Bulletin. As the account goes, using data on dengue outbreaks in Dagupan City, Pangasinan, Wilson Chua and collaborators were able to narrow down the source to specific stagnant pools of water near a couple of elementary schools, and then work with the barangay (village) captain and the Bureau of Fisheries and Aquatic Resources to implement a targeted solution, and no new cases of dengue have been reported since then.

In meetups, conferences, training sessions, and press releases, a lot of attention is placed on the use of big data tools such as Hadoop and d3.js, which are used to easily organize massive amounts of possibly unstructured data and produce impressive-looking visualizations, such as this graph, also produced by Wilson Chua, comparing dengue outbreaks across Pangasinan barangays between 2014 and 2016.


or this blog post by Thinking Machines that visualizes 114 years of Philippine disasters. This is in line with data science being “sexy” – not only can you use it to do sexy stuff, you can also make sexy looking graphics!

I feel, however, that the main takeaways from the dengue article above are about the thoroughly unsexy, fundamental, and undervalued activities that are at the core of data science: data collection and data access.

Before Wilson Chua could analyze the data, the data had to exist in the first place. Someone had to go out there and collect data on individual incidences of dengue. According to the article, Wilson sourced his data from the Philippine Integrated Diseases Surveillance and Response team at the Department of Health. Someone also had to think about what sorts of variables to collect; one of the keys to Wilson’s insights was that the PIDSR data included not just the date of occurrence and the barangay but also the patient’s age, from which Chua noticed that more school-aged children were getting dengue than any other age group. That means that during the data collection process, someone had to have recognized that age was a relevant epidemiological covariate, without which Chua would have been able to do far less.

They were then able to verify that specific pools of water initially located via Google Maps were locations in which rainfall would accumulate and stagnate without exit points, because a separate person, Nicanor Melecio, based with the Dagupan City government, had LIDAR (LIght Detection And Ranging) maps that were initially created to track flooding. This means that someone had to have recognized that creating LIDAR maps was not only useful but also more broadly applicable, and someone higher up had to have agreed to fund such a project.

Epidemiology (the study of public health) is a fairly well-established field, and the dengue problem was fairly well-defined and narrow in scope. Most civic issues are much murkier; people recognize that disasters, poverty, crime, etc. are problems, but it is not as straightforward to drill down to a specific problem that can be solved. Even when a seemingly specific problem can be identified, e.g. how to reduce casualties from flooding in a particular barangay, or how to improve the livelihoods of a particular group of people, or how to reduce recidivism rates among prisoners, there is still a wide range of possible approaches that must be considered – and more to the point, it isn’t immediately clear what data needs to be collected in order to approach these problems from a “scientific” perspective.

Depending on the application domain, data collection might be a painstakingly long and slow effort. It will also probably be an expensive effort, and thus one whose expenses need to be justified. And it will all be for naught if we do not pay attention to proper measurement, or “the idea of considering the connection between the data you gather and the underlying object of your study.”

People who want to do data science for social good need to focus on working with agencies and organizations charged with data collection in order to identify the specific problems they want to help solve and the specific kinds of data that the solution needs. If the data hasn’t been collected yet, we need to push for efforts to collect it. If the data has been collected but is incomplete or of low quality, we need to push for efforts to improve it. For example, Matthew Cua’s Skyeye Inc. uses drones that can take aerial camera shots to collect data that can help resolve property disputes and land claims, and the company works with the Department of Agrarian Reform to help settle land reform issues.

The current approach is to focus only on problems for which data is already available. For example, data science startups are now currently working with Waze, the traffic app, in order to use their data to try and come up with solutions to Metro Manila’s traffic problem, which affects millions of Filipinos every day, greatly reduces labor productivity, harms the environment, and makes Metro Manila less “liveable”. But the data scientists working on this specific problem did not choose it for its relative importance. They chose it because the data already existed.

Many social scientists are now interested in mining Twitter data to look at public sentiment, despite the fact that we have zero picture of how representative Filipinos on Twitter are versus the Filipino population as a whole. Why are we then treating Twitter as a reliable source of public opinion data? Because it’s already there.

The very example that Wilson uses as his inspiration, John Snow’s approach to solving a cholera outbreak, did not involve Snow accessing an API or writing a web-scraping script. It involved going door-to-door, boots on the ground, identifying houses with cholera. For the vast majority of applications, proper data collection does not involve complex mathematical models or whiz-bang software engineering. It is not sexy, and it is fundamental to good data science.

Then there is the question of data access. In the past few years, the Philippine government has taken great strides in setting up a web portal that allows for public access to some government data. The new administration has announced their intention to continue this program.

Setting aside the question of whether the government data is reliable and measures the needed variables, or whether data on a specific domain even exists in the first place, not all government data is open access. For example, much of the “open data” on the web portal actually just redirects you to the website of the concerned government agency, where they will have summary tables and charts of the data but not the raw data itself. The Philippine Geoportal project combines geospatial data from multiple agencies and allows the visitor to view things such as the location of every hospital and health center in the Philippines on a map, but if the user wants the actual coordinates, they still have to course their request to the Department of Health in writing, which they are not obliged to fulfill. Going back to the dengue article, this quote is telling:

Using his credentials as a technology writer for Manila Bulletin, he wrote the Philippine Integrated Diseases Surveillance and Response team (PIDSR) of the Department of Health, requesting for three years worth of data on Pangasinan.

The DOH acquiesced and sent him back a litany of data on an Excel sheet: 81,000 rows of numbers or around 27,000 rows of data per year.

Wilson had to use his “credentials” to make a request for the data, and the DOH chose whether or not to “acquiesce”. If a less well-off, less well-connected, less prestigious private citizen from Dagupan city, perhaps a concerned elementary school teacher, were to make this request from the DOH, would they have acquiesced?

The burden should not be on the person making a request for data to somehow show that they have “credentials” or that they are “serious”. It should be just as easy for a street sweeper or a fish vendor to access the data as it is for a PhD or a businessman with decades of experience. The person should not even have to make a request. This data should have already been out there. The only data that should be locked behind requests for access are data that contain information that could directly identify individual people, and data that might compromise national security.

In a sense, the article is not merely a celebration of Wilson’s achivements, but a celebration of the good fortune that the DOH considered Wilson credible enough.

We are woefully lacking quality data on all manner of social problems in the Philippines. If data science in the Philippines is to advance, the community cannot merely sit back and hope that some unsexy, underpaid bureaucrats in government agencies or academics in research firms will be insightful enough to collect some good data, committed enough to justify the collection at a budget hearing or to multilateral funding organizations, and considerate enough to make this data as open to all as can be. All things considered, these bureaucrats and academics are data scientists too. They are part of what makes civic data science possible and they deserve the community’s support and advocacy.


M.A. Quantitative Methods in the Social Sciences at Columbia University: Things you should know

I am, as of right now (Fall 2016), in my third and final semester of a master’s degree in Quantitative Methods in the Social Sciences (QMSS) at Columbia University in New York City. There isn’t a whole lot of information about this program on the Internet, so to those who are giving this program some consideration, particularly students from outside the United States, here is some helpful information.

How long is QMSS?

One year, with the option to extend one more semester if you wish. Very rarely, some people will extend by two semesters – these are usually people on externally-funded scholarships that have a set timeframe.

Is it expensive?

Yep. This academic year, a full-time semester costs $28,780 for up to 20 units. (One full-time class is either 3 or 4 units, and the number of units isn’t necessarily a function of workload. QMSS classes are 4 units.) It’s expensive, though cheaper than other quantitative courses such as Mathematics of Finance or Statistics.

If you choose to extend your stay, the additional semester costs $10,944, with the restriction that only up to two non-QMSS classes can be taken that semester.

Unfortunately, hardly any from within Columbia. There might be some merit-based scholarships that’ll give you like 5%. International students are best off seeking external funding. In general, master’s programs that aren’t in public policy tend to be very stingy with scholarships; they need your money to subsidize the Ph.Ds.

So I can take classes outside the program?
Hell yeah. The only required classes are “Theory and Methodology”, which is a survey class of approaches to quantitative social science that’s generally geared towards people who’ve never seen social science research before and is pretty boring and old hat otherwise; the “Research Seminar”, a two-semester sequence where you come in late at night to listen to an academic or practitioner give a talk (which may or may not be interesting), partially meant to expose you as a student to some of the latest research and practice being done, and your master’s thesis. Everyone has to do a thesis. There’s no getting around it.

By the way, Statistics students don’t have a thesis. If theses are your thing (e.g. if you really want to go on to a Ph.D), go for QMSS.

Aside from those classes, you can take any classes you want outside the program as long as a certain number of them are quantitative. If, for example, you find QMSS’s course offerings too easy, or you really have a particular substantive topic you’re looking to learn more about, you can take classes in the Stats, CS, Engineering, Econ, PoliSci, Sociology, and whatever other departments exist at Columbia that’ll let you in. And most classes will let you in provided there’s enough room. It’s only the super-duper high demand ones such as Data Science that won’t let you in… except if you’re in QMSS’s Data Science track.


QMSS students are required to pick one of three tracks. The “Economics” track is exactly what it says on the tin; it used to function as basically an MA in Econ when Columbia didn’t have their MA in Econ yet (it was just launched last year). The “Data Science” track lets you into classes offered by the Institute for Data Science and Engineering such as Algorithms for Data Science, Machine Learning for Data Science, etc. And the “Traditional” track is where everybody else who isn’t specializing in the former two tracks goes.

(There’s also a fourth track, “Experiments”, but to the best of my knowledge practically no one has ever taken that track and it might not even functionally exist anymore.)

I suck at math – what do I do?

QMSS was specifically designed for people with a limited background in quantitative methods. I didn’t even know what a regression was before I entered the program – my own background was in development studies and history. Now I wouldn’t call myself an expert on anything, but I’m more comfortable with this stuff and know where to look to go deeper.

You do need to kinda not suck at math, but if you can nail the Graduate Record Examinations, you’re good. (Not the math-specific GRE, just the general quantitative portion.) I would even say that you can graduate from the program without knowing calculus and linear algebra if you just take all the applied data analysis classes, though how much you actually understand would be a valid question.

How big is the program, and what are the people in it like?

Each cohort ranges from about 60-75 students, including part-time students. About half or so are from East Asia and the other half are from everywhere else, though mostly North America. Backgrounds range from idiots like myself to people who are already data scientists of some sort.

For Chinese people in particular, especially those who are coming straight from undergrad in China, if you’re also looking for a program with a good mix of Mandarin and English speakers, QMSS is for you. By comparison, Columbia’s Stats program is around 90% Chinese.

QMSS has an associated student-led organization, Society for Quantitative Approaches to Social Research (QASR), that organizes social events, alumni networking sessions, outreach efforts, and the like.

Thanksgiving dinner
QMSS students at Google’s offices

So what is in QMSS?

QMSS classes focus on applied quantitative social science. There’s a three-course sequence taught by sociologist and program director Gregory Eirich that starts with applied regression analysis through linear models, generalized linear models, causal inference methods, text mining, methods for longitudinal data, and time series processes. Those classes are light on the technical details and focus on understanding the structure and assumptions of various models and techniques at a high level and putting them to use with data via the open-source R programming language. This is in contrast with an econometrics class, which would cover the same material but with much more theoretical grounding and mathematical proof and much less actual data analysis.

There’s a two-course sequence taught by geographical sociologist Jeremy Porter on geographic information systems & spatial analysis that essentially teaches you how to work with data with some sort of location information attached. It uses the open-source software Quantum GIS and the R programming language.

Other electives include Social Network Analysis, also taught by Eirich, and Data Visualization, which is either a really good or a really crappy class depending on who’s teaching it. If you’re familiar with the Harry Potter series, it’s like Defense of the Dark Arts – the professor changes every year for some reason or another, and the syllabus completely changes every year in step.

There are also two classes taught by political scientist Benjamin Goodrich. In the fall, Data Mining for Social Science is an introduction to techniques such as tree-based models, neural networks, principal components analysis, etc., that are commonly used when the goal is to predict new data. It’s essentially an intro to machine learning class that’s decent preparation for more advanced classes, and it doubles as an actual introduction to R in the first couple of weeks. The other classes use R, but this class actually teaches it from the ground up at an accelerated pace.

In the spring, Bayesian Statistics for the Social Sciences is an introduction to Bayesian modelling, which is a different way of approaching a statistical problems that puts heavy emphasis on probability distributions. It’s the most math-heavy class in the program and also doubles as a shameless plug for the statistical modelling language Stan, which is being actively developed at Columbia by a team including Goodrich and others, which allows the user to specify flexible models for particular situations instead of relying on canned packages. Columbia is also the home of Andrew Gelman, one of the foremost Bayesian statisticians out there, and his influence looms large.

If none of those sounded exciting to you, remember that you don’t even have to take any of those classes if you don’t want to. Already a crackerjack who wants to go full-on into machine learning? Head over to Stats/CS/Engineering. Want to go into finance? Columbia Business School classes can be hard to get into but it’s doable. Not satisfied with the fairly high-level approach of Eirich’s classes and want to go really deep into the weeds? Either the Econ or PoliSci departments are for you.

There’s also one last thing. If you wish, QMSS can match you with professors from across the university who are looking for research assistants, which is a great opportunity to get your feet wet with social science research outside of a classroom setting.

I don’t want a Ph.D – I just want a better job or a change of career.

QMSS is perfect for that as well – while initially designed as a Ph.D preparation program, most people opt to go into industry for at least a while after graduation, and the flexible nature of the program allows for diverse interests. Data science and analytics are the flavor of the year, but policy research, tech, consulting and finance aren’t far behind. And for those last two industries in particular, there’s hardly a better place than New York City.

QMSS also qualifies as a Science, Technology, Engineering and Mathematics (STEM) program, which for international students who want to work in the United States means that they would qualify for special preferences given to STEM majors.

How easy would it be for me to find a job?

Jobs are everywhere for people skilled in quantitative methods. Many students in the program aren’t even particularly interested in the “social science” part. The barriers to entry are much higher for international students, however. Columbia University’s career office is a great resource, with lots of events such as job fairs, industry talks, and interview prep sessions. This being New York, though, their offerings are heavily skewed towards consulting, finance and tech. The good news is that larger firms are generally also more open to international hires.

Part of the reason some people stay for a third semester is to give themselves additional breathing room in the job hunt – instead of graduating in May and having to have a full-time job within three months or risk losing their visa status, they can graduate in December and hopefully score a summer internship beforehand.

Why shouldn’t I just major in Statistics?

Sure, you could. In fact, if your goal is to get a Ph.D in Statistics, or to get a top-level engineering job in some data science firm, then you would need much more quantitative stuff than the average QMSS student learns, although you could pursue those same classes within QMSS if you wanted to due to its flexibility.

Regarding Columbia specifically, the primary advantage to QMSS over the Statistics program are that it’s slightly cheaper and has 60-75 students as opposed to Stat’s 200+, meaning that it’s easier to raise concerns and get attention from the program administration.

If your goal is a Ph.D in a social science, QMSS is excellent preparation. Many social science Ph.Ds are actually somewhat behind when it comes to quantitative methods despite American social science’s heavy focus on it.

Why would QMSS not be for me?

The cost is a huge factor, honestly. Tuition aside, cost of living in New York City can be ridiculous. Food, transportation, and that kind of stuff isn’t very expensive, but rent…

Barring that, it does say “quantitative” in the title, which represents a particular approach to social science that isn’t by any means all-encompassing. If you’re interested in being the kind of researcher who can do detailed case studies, thick descriptions, in-depth interviews, etc., then the program probably isn’t for you. You might even encounter stuck-up people who will scoff at you for your ‘inferior’ methods, though it’s not as prevalent as it seems. Quantitative methods are generally useful for every social scientist to familiarize themselves with, but a program devoted to them isn’t for everybody.

If your bachelor’s degree was a fairly quant-heavy social science program, QMSS will be redundant.

Finally, if you intend to use the program’s flexibility to take courses in things like data science, it isn’t going to be immediately obvious to employers that you’ve taken those courses if they see QMSS on your resume. Your chosen track isn’t reflected on your transcript or diploma. So if you really do want to go into something more specific like stats/data science, you may want to consider going for the degree that actually says stats/data science. (Or you could self-study, build a portfolio, etc.)

On a lighter note, “Quantitative Methods for the Social Sciences” is also way too long of a course name. Imagine telling someone that in an elevator.



Uniformity and difference-in-differences

Earlier, I posted on Facebook about some problems with the analysis conducted by Yap and Contreras on 34 time periods sampled from 92,509 possible time periods, representing each and every precinct. In part of that post I discussed Contreras’s interpretation of high R-squared as representing uniform changes, and said that he was half right.

I gave Contreras too much credit here. Someone pointed out that an R-squared very close to 1 doesn’t even mean that the increase will always be close to 46,000, it means that the model explains almost all the variability around the mean of 46,000. A model fit to a tight zigzag can still have an R-squared near 1, which means that the line and the data are still pretty damn close to one another, but doesn’t tell you anything about whether the model systematically overpredicts or underpredicts at any given point. See this article for more information.

In order to directly check the claim of uniformity, here’s what should be done:

STEP 1. Get all 92,509 precincts in the order in which they transmitted.
STEP 2. Create a new set of 92,508 points. Yes, 92,508. Each point will be (difference in vote share after precinct 2 – difference in vote share after precinct 1), (difference in vote share after precinct 3 – difference in vote share after precinct 2), etc. That’s why it’ll be 1 less than 92,509.
STEP 3. Make a plot with the x-axis being 1 to 92,508 and the y-axis being the difference in differences (yeah, that’s confusing, but that’s what it is). Then run a linear regression model.

To reiterate: we are now plotting how much the gap changes every time a new precinct is added, rather than the gap itself.

IF each time a precinct transmits, the vote gaps are increasing uniformly, which would indeed indicate some major bullshit going on and prove Contreras’s point, you should expect the plot in Step 3 to be a flat horizontal line. The R-squared of the regression should be close to exactly 0.5.

IF each time a precinct transmits, the vote gaps are NOT increasing uniformly, which indicates a normal transmission process, you should expect the plot in Step 3 to have no discernible pattern. The R-squared of the regression should be close to 0.

I will demonstrate with some more simulated data, but what I do below is what should be done with the full data if we ever get it. First, let’s say that after we get all points in, it is indeed as Contreras says – every time a precinct transmits, everyone’s vote totals go up by exactly the same amount, which would be extremely suspicious. I’ll use 400 precincts (time periods) for illustrative purposes, and make two plots: one where precincts in order of transmission are on the x-axis and difference in votes are on the y-axis, and one where the difference in differences are on the y-axis instead.



Then run a regression model on the data that form the second plot:

lm(formula = z_unif ~ index)
(Intercept) 1.00     0.00   
index       0.00     0.00   
n = 399, k = 2
residual sd = 0.00, R-Squared = 0.50

The R-squared is exactly 50% as expected. I will now plot the estimated regression line on top of the second plot:


Remember that a regression model tries to fit a line to data as best as it can, so it’s not surprising that given uniform data, the regression line overlaps the first-differenced data entirely.

Now let’s say that the data look a lot more like we would expect from a normal transmission process – the gap is steadily increasing in favor of the leading candidate but we see dips and swings here and there as each precinct transmits. That data would look like this:


And the first-differenced data would now look like this:


Here’s a regression model fit to that plot:

lm(formula = z ~ index)
(Intercept) 1.03     0.66   
index       0.00     0.00   
n = 399, k = 2
residual sd = 6.55, R-Squared = 0.00

The R-squared is now pretty much 0 (it’ll be more like 0.003 or something, but the model summary rounds to two decimal places).


Now you are trying to fit that volatile-looking zigzag with a single straight line. The model does the best it can, which ends up being a horizontal line drawn through the middle. But look at all the variation that the line doesn’t capture.

To conclude, a very good way to examine the claims of uniformity is to do what I did above on the actual data consisting of 92,509 precincts in order of transmission. You don’t even need to fit regression models, because the R-squared will plummet to near 0 at even the first hint of non-uniformity. You just need to eyeball the plot of difference in differences.

Code follows. Yeah, I know, I should be putting this on Github or something, but I’ll do that later.

x_unif <- c(1:400)
y_unif <- c(101:500)
plot(x_unif, y_unif, type = "l", main = "Simulated data - uniform increase",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
z_unif <- rep(NA, 399)
for (i in 1:399) {
z_unif[i] <- y_unif[i + 1] - y_unif[i]
index <- c(1:399)
plot(index, z_unif, type = "l", main = "Simulated data - first-differenced - uniform increase",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z_unif ~ index))
abline(lm(z_unif ~ index), col = "red")
x <- c(1:400)
y <- c(runif(10, 100, 110), runif(10, 110, 120), runif(10, 120, 130), runif(10, 140, 150), runif(10, 130, 140), runif(10, 150, 160), runif(10, 160, 170), runif(10, 170, 180), runif(10, 180, 190), runif(10, 190, 200), runif(10, 200, 210), runif(10, 210, 220), runif(10, 220, 230), runif(10, 240, 250), runif(10, 230, 240), runif(10, 250, 260), runif(10, 270, 280), runif(10, 260, 270), runif(10, 280, 290), runif(10, 290, 300), runif(10, 300, 310), runif(10, 320, 330), runif(10, 310, 320), runif(10, 330, 340), runif(10, 340, 350), runif(10, 350, 360), runif(10, 380, 390), runif(10, 370, 380), runif(10, 360, 370), runif(10, 390, 400), runif(10, 400, 410), runif(10, 410, 420), runif(10, 420, 430), runif(10, 450, 460), runif(10, 440, 450), runif(10, 430, 440), runif(10, 460, 470), runif(10, 490, 500), runif(10, 470, 480), runif(10, 480, 490))
plot(x, y, type = "l", main = "Simulated data - zigzaggy",
xlab = "Precincts in order of transmission",
ylab = "Difference in votes")
index <- c(1:399)
z <- rep(NA, 399)
for (i in 1:399) {
z[i] <- y[i + 1] - y[i]
plot(index, z, type = "l", main = "Simulated data - first-differenced - non-uniform movement",
xlab = "Precinct 2 minus Precinct 1, Precinct 3 minus Precinct 2, etc.",
ylab = "Difference in difference in votes")
display(lm(z ~ index))
abline(lm(z ~ index), col = "red")