Data collection and accessibility are the core of civic data analytics

Data science and data analytics are red hot terms nowadays. You can’t go more than a page with Google search without finding some reference to how “sexy” data science is. And everybody wants to be sexy, right?

In the Philippines, helmed by a small enthusiastic community, more and more startups are mushrooming with the business model of providing data science and analytics services to businesses and corporations, such as Thinking Machines Data Science and DataSeer.

Of particular interest to me is “civic” data analytics, or analytics as applied to civic problems such as health, infrastructure, agriculture, poverty, education, the environment, and all sorts of other things that are the ambit of government agencies and nonprofits. The international volunteer organization DataKind, with chapters in Singapore, Dublin, Bangalore, the United Kingdom, Washington, DC, and San Francisco, describes its mission as “bringing together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact.”

One homegrown example would be this story that was published very recently in the Manila Bulletin. As the account goes, using data on dengue outbreaks in Dagupan City, Pangasinan, Wilson Chua and collaborators were able to narrow down the source to specific stagnant pools of water near a couple of elementary schools, and then work with the barangay (village) captain and the Bureau of Fisheries and Aquatic Resources to implement a targeted solution, and no new cases of dengue have been reported since then.

In meetups, conferences, training sessions, and press releases, a lot of attention is placed on the use of big data tools such as Hadoop and d3.js, which are used to easily organize massive amounts of possibly unstructured data and produce impressive-looking visualizations, such as this graph, also produced by Wilson Chua, comparing dengue outbreaks across Pangasinan barangays between 2014 and 2016.

15138584_10154435869058598_2838915204947344528_o.jpg

or this blog post by Thinking Machines that visualizes 114 years of Philippine disasters. This is in line with data science being “sexy” – not only can you use it to do sexy stuff, you can also make sexy looking graphics!

I feel, however, that the main takeaways from the dengue article above are about the thoroughly unsexy, fundamental, and undervalued activities that are at the core of data science: data collection and data access.

Before Wilson Chua could analyze the data, the data had to exist in the first place. Someone had to go out there and collect data on individual incidences of dengue. According to the article, Wilson sourced his data from the Philippine Integrated Diseases Surveillance and Response team at the Department of Health. Someone also had to think about what sorts of variables to collect; one of the keys to Wilson’s insights was that the PIDSR data included not just the date of occurrence and the barangay but also the patient’s age, from which Chua noticed that more school-aged children were getting dengue than any other age group. That means that during the data collection process, someone had to have recognized that age was a relevant epidemiological covariate, without which Chua would have been able to do far less.

They were then able to verify that specific pools of water initially located via Google Maps were locations in which rainfall would accumulate and stagnate without exit points, because a separate person, Nicanor Melecio, based with the Dagupan City government, had LIDAR (LIght Detection And Ranging) maps that were initially created to track flooding. This means that someone had to have recognized that creating LIDAR maps was not only useful but also more broadly applicable, and someone higher up had to have agreed to fund such a project.

Epidemiology (the study of public health) is a fairly well-established field, and the dengue problem was fairly well-defined and narrow in scope. Most civic issues are much murkier; people recognize that disasters, poverty, crime, etc. are problems, but it is not as straightforward to drill down to a specific problem that can be solved. Even when a seemingly specific problem can be identified, e.g. how to reduce casualties from flooding in a particular barangay, or how to improve the livelihoods of a particular group of people, or how to reduce recidivism rates among prisoners, there is still a wide range of possible approaches that must be considered – and more to the point, it isn’t immediately clear what data needs to be collected in order to approach these problems from a “scientific” perspective.

Depending on the application domain, data collection might be a painstakingly long and slow effort. It will also probably be an expensive effort, and thus one whose expenses need to be justified. And it will all be for naught if we do not pay attention to proper measurement, or “the idea of considering the connection between the data you gather and the underlying object of your study.”

People who want to do data science for social good need to focus on working with agencies and organizations charged with data collection in order to identify the specific problems they want to help solve and the specific kinds of data that the solution needs. If the data hasn’t been collected yet, we need to push for efforts to collect it. If the data has been collected but is incomplete or of low quality, we need to push for efforts to improve it. For example, Matthew Cua’s Skyeye Inc. uses drones that can take aerial camera shots to collect data that can help resolve property disputes and land claims, and the company works with the Department of Agrarian Reform to help settle land reform issues.

The current approach is to focus only on problems for which data is already available. For example, data science startups are now currently working with Waze, the traffic app, in order to use their data to try and come up with solutions to Metro Manila’s traffic problem, which affects millions of Filipinos every day, greatly reduces labor productivity, harms the environment, and makes Metro Manila less “liveable”. But the data scientists working on this specific problem did not choose it for its relative importance. They chose it because the data already existed.

Many social scientists are now interested in mining Twitter data to look at public sentiment, despite the fact that we have zero picture of how representative Filipinos on Twitter are versus the Filipino population as a whole. Why are we then treating Twitter as a reliable source of public opinion data? Because it’s already there.

The very example that Wilson uses as his inspiration, John Snow’s approach to solving a cholera outbreak, did not involve Snow accessing an API or writing a web-scraping script. It involved going door-to-door, boots on the ground, identifying houses with cholera. For the vast majority of applications, proper data collection does not involve complex mathematical models or whiz-bang software engineering. It is not sexy, and it is fundamental to good data science.

Then there is the question of data access. In the past few years, the Philippine government has taken great strides in setting up a web portal that allows for public access to some government data. The new administration has announced their intention to continue this program.

Setting aside the question of whether the government data is reliable and measures the needed variables, or whether data on a specific domain even exists in the first place, not all government data is open access. For example, much of the “open data” on the web portal actually just redirects you to the website of the concerned government agency, where they will have summary tables and charts of the data but not the raw data itself. The Philippine Geoportal project combines geospatial data from multiple agencies and allows the visitor to view things such as the location of every hospital and health center in the Philippines on a map, but if the user wants the actual coordinates, they still have to course their request to the Department of Health in writing, which they are not obliged to fulfill. Going back to the dengue article, this quote is telling:

Using his credentials as a technology writer for Manila Bulletin, he wrote the Philippine Integrated Diseases Surveillance and Response team (PIDSR) of the Department of Health, requesting for three years worth of data on Pangasinan.

The DOH acquiesced and sent him back a litany of data on an Excel sheet: 81,000 rows of numbers or around 27,000 rows of data per year.

Wilson had to use his “credentials” to make a request for the data, and the DOH chose whether or not to “acquiesce”. If a less well-off, less well-connected, less prestigious private citizen from Dagupan city, perhaps a concerned elementary school teacher, were to make this request from the DOH, would they have acquiesced?

The burden should not be on the person making a request for data to somehow show that they have “credentials” or that they are “serious”. It should be just as easy for a street sweeper or a fish vendor to access the data as it is for a PhD or a businessman with decades of experience. The person should not even have to make a request. This data should have already been out there. The only data that should be locked behind requests for access are data that contain information that could directly identify individual people, and data that might compromise national security.

In a sense, the article is not merely a celebration of Wilson’s achivements, but a celebration of the good fortune that the DOH considered Wilson credible enough.

We are woefully lacking quality data on all manner of social problems in the Philippines. If data science in the Philippines is to advance, the community cannot merely sit back and hope that some unsexy, underpaid bureaucrats in government agencies or academics in research firms will be insightful enough to collect some good data, committed enough to justify the collection at a budget hearing or to multilateral funding organizations, and considerate enough to make this data as open to all as can be. All things considered, these bureaucrats and academics are data scientists too. They are part of what makes civic data science possible and they deserve the community’s support and advocacy.