Data collection and accessibility are the core of civic data analytics

Data science and data analytics are red hot terms nowadays. You can’t go more than a page with Google search without finding some reference to how “sexy” data science is. And everybody wants to be sexy, right?

In the Philippines, helmed by a small enthusiastic community, more and more startups are mushrooming with the business model of providing data science and analytics services to businesses and corporations, such as Thinking Machines Data Science and DataSeer.

Of particular interest to me is “civic” data analytics, or analytics as applied to civic problems such as health, infrastructure, agriculture, poverty, education, the environment, and all sorts of other things that are the ambit of government agencies and nonprofits. The international volunteer organization DataKind, with chapters in Singapore, Dublin, Bangalore, the United Kingdom, Washington, DC, and San Francisco, describes its mission as “bringing together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact.”

One homegrown example would be this story that was published very recently in the Manila Bulletin. As the account goes, using data on dengue outbreaks in Dagupan City, Pangasinan, Wilson Chua and collaborators were able to narrow down the source to specific stagnant pools of water near a couple of elementary schools, and then work with the barangay (village) captain and the Bureau of Fisheries and Aquatic Resources to implement a targeted solution, and no new cases of dengue have been reported since then.

In meetups, conferences, training sessions, and press releases, a lot of attention is placed on the use of big data tools such as Hadoop and d3.js, which are used to easily organize massive amounts of possibly unstructured data and produce impressive-looking visualizations, such as this graph, also produced by Wilson Chua, comparing dengue outbreaks across Pangasinan barangays between 2014 and 2016.

15138584_10154435869058598_2838915204947344528_o.jpg

or this blog post by Thinking Machines that visualizes 114 years of Philippine disasters. This is in line with data science being “sexy” – not only can you use it to do sexy stuff, you can also make sexy looking graphics!

I feel, however, that the main takeaways from the dengue article above are about the thoroughly unsexy, fundamental, and undervalued activities that are at the core of data science: data collection and data access.

Before Wilson Chua could analyze the data, the data had to exist in the first place. Someone had to go out there and collect data on individual incidences of dengue. According to the article, Wilson sourced his data from the Philippine Integrated Diseases Surveillance and Response team at the Department of Health. Someone also had to think about what sorts of variables to collect; one of the keys to Wilson’s insights was that the PIDSR data included not just the date of occurrence and the barangay but also the patient’s age, from which Chua noticed that more school-aged children were getting dengue than any other age group. That means that during the data collection process, someone had to have recognized that age was a relevant epidemiological covariate, without which Chua would have been able to do far less.

They were then able to verify that specific pools of water initially located via Google Maps were locations in which rainfall would accumulate and stagnate without exit points, because a separate person, Nicanor Melecio, based with the Dagupan City government, had LIDAR (LIght Detection And Ranging) maps that were initially created to track flooding. This means that someone had to have recognized that creating LIDAR maps was not only useful but also more broadly applicable, and someone higher up had to have agreed to fund such a project.

Epidemiology (the study of public health) is a fairly well-established field, and the dengue problem was fairly well-defined and narrow in scope. Most civic issues are much murkier; people recognize that disasters, poverty, crime, etc. are problems, but it is not as straightforward to drill down to a specific problem that can be solved. Even when a seemingly specific problem can be identified, e.g. how to reduce casualties from flooding in a particular barangay, or how to improve the livelihoods of a particular group of people, or how to reduce recidivism rates among prisoners, there is still a wide range of possible approaches that must be considered – and more to the point, it isn’t immediately clear what data needs to be collected in order to approach these problems from a “scientific” perspective.

Depending on the application domain, data collection might be a painstakingly long and slow effort. It will also probably be an expensive effort, and thus one whose expenses need to be justified. And it will all be for naught if we do not pay attention to proper measurement, or “the idea of considering the connection between the data you gather and the underlying object of your study.”

People who want to do data science for social good need to focus on working with agencies and organizations charged with data collection in order to identify the specific problems they want to help solve and the specific kinds of data that the solution needs. If the data hasn’t been collected yet, we need to push for efforts to collect it. If the data has been collected but is incomplete or of low quality, we need to push for efforts to improve it. For example, Matthew Cua’s Skyeye Inc. uses drones that can take aerial camera shots to collect data that can help resolve property disputes and land claims, and the company works with the Department of Agrarian Reform to help settle land reform issues.

The current approach is to focus only on problems for which data is already available. For example, data science startups are now currently working with Waze, the traffic app, in order to use their data to try and come up with solutions to Metro Manila’s traffic problem, which affects millions of Filipinos every day, greatly reduces labor productivity, harms the environment, and makes Metro Manila less “liveable”. But the data scientists working on this specific problem did not choose it for its relative importance. They chose it because the data already existed.

Many social scientists are now interested in mining Twitter data to look at public sentiment, despite the fact that we have zero picture of how representative Filipinos on Twitter are versus the Filipino population as a whole. Why are we then treating Twitter as a reliable source of public opinion data? Because it’s already there.

The very example that Wilson uses as his inspiration, John Snow’s approach to solving a cholera outbreak, did not involve Snow accessing an API or writing a web-scraping script. It involved going door-to-door, boots on the ground, identifying houses with cholera. For the vast majority of applications, proper data collection does not involve complex mathematical models or whiz-bang software engineering. It is not sexy, and it is fundamental to good data science.

Then there is the question of data access. In the past few years, the Philippine government has taken great strides in setting up a web portal that allows for public access to some government data. The new administration has announced their intention to continue this program.

Setting aside the question of whether the government data is reliable and measures the needed variables, or whether data on a specific domain even exists in the first place, not all government data is open access. For example, much of the “open data” on the web portal actually just redirects you to the website of the concerned government agency, where they will have summary tables and charts of the data but not the raw data itself. The Philippine Geoportal project combines geospatial data from multiple agencies and allows the visitor to view things such as the location of every hospital and health center in the Philippines on a map, but if the user wants the actual coordinates, they still have to course their request to the Department of Health in writing, which they are not obliged to fulfill. Going back to the dengue article, this quote is telling:

Using his credentials as a technology writer for Manila Bulletin, he wrote the Philippine Integrated Diseases Surveillance and Response team (PIDSR) of the Department of Health, requesting for three years worth of data on Pangasinan.

The DOH acquiesced and sent him back a litany of data on an Excel sheet: 81,000 rows of numbers or around 27,000 rows of data per year.

Wilson had to use his “credentials” to make a request for the data, and the DOH chose whether or not to “acquiesce”. If a less well-off, less well-connected, less prestigious private citizen from Dagupan city, perhaps a concerned elementary school teacher, were to make this request from the DOH, would they have acquiesced?

The burden should not be on the person making a request for data to somehow show that they have “credentials” or that they are “serious”. It should be just as easy for a street sweeper or a fish vendor to access the data as it is for a PhD or a businessman with decades of experience. The person should not even have to make a request. This data should have already been out there. The only data that should be locked behind requests for access are data that contain information that could directly identify individual people, and data that might compromise national security.

In a sense, the article is not merely a celebration of Wilson’s achivements, but a celebration of the good fortune that the DOH considered Wilson credible enough.

We are woefully lacking quality data on all manner of social problems in the Philippines. If data science in the Philippines is to advance, the community cannot merely sit back and hope that some unsexy, underpaid bureaucrats in government agencies or academics in research firms will be insightful enough to collect some good data, committed enough to justify the collection at a budget hearing or to multilateral funding organizations, and considerate enough to make this data as open to all as can be. All things considered, these bureaucrats and academics are data scientists too. They are part of what makes civic data science possible and they deserve the community’s support and advocacy.

 

Advertisements

4 thoughts on “Data collection and accessibility are the core of civic data analytics

  1. Hey there. I think you are spot on with the “go to the source of the data” argument. In the social sciences domain, often census or data products from “official” or “regulated” or “centralized” collection methods have traditionally been the golden source.

    In the US, there is a mature data infrastructure around accessing, sharing and disseminating knowledge based on the US census. In developing countries these infrastructures are woefully inadequate and so I agree that social media is the “next best thing” to sample social phenomena across large populations.
    This is a pretty sad state of affairs though and forms the creation of problematic biases because of issues of representation etc.

    I think social impact organizations in developing countries need to:
    1) create sustainable infrastructure around their own census departments to improve the accessbiility and usability of such data so that it can be studied by folks like yourself

    2)In the event “source data” such as census does not exist or impossibly difficult to access, then leverage open hardware to create a “robust” data set where issues of privacy and representation are addressed adequately and then employ analysis techniques to approach challenges in the civic space to affect positive outcomes.

    We started ARGO (argolabs.org) after graduating from NYU’s Center for Urban Science and Progress (CUSP) program around a similar intention and have begun framing and operationalizing a Civic Data Science encompassing Device (sensors, open hardware etc.), Data (Big Data Analytics, Visualization etc.) and Decision-making to perform service delivery better using data.

    We are currently focussed

    – on Streets (Project: SQUID) where we developed a small device to conduct a census for street quality.

    – on Water (Project: California Data Collaborative) where we have developed decision-making tools to assist CA Water managers.

    You can read more at argolabs.org/projects. Love to hear your thoughts or catch up sometime.

    Some links to unpack our thinking:

    http://www.argolabs.org/blog-1/2015/8/17/towards-a-definition-of-civic-data-science
    http://www.argolabs.org/blog-1/2015/11/26/decision-making-within-the-civic-data-science-framework

    • Looks like your team has some very interesting ideas! I agree that one avenue by which organizations can create social impact is by building up data infrastructure. I wonder whether you have any ideas on how this might be achieved in an “immature” setting.

  2. Start simple and democratize meaningfully as much as possible. The main takeaway reading about and observing the “open data” movement in the US is that the majority of this so called “open data” does not translate to on-the-ground outcomes or decisions.

    Quite often open data is not clean and structured enough to make operational decisions for some basic social metrics such as “how many poor people are in this area vs that area and overall in the city” or “what is the quality of access to education in this area vs that area and overall in the city”.

    The complexity of making these decisions quickly is a factor of many things, some of which include legacy systems and complex procurement, a fear of failure or experimentation, middle managers not being functional technology users and not appreciating the “art of the possible”.

    The proper handling preservation of personally identifiable data is also a source of

    This article in the Times lays out the barriers and opportunities quite nicely:
    http://www.nytimes.com/interactive/2016/11/13/magazine/design-issue-code-for-america.html?_r=2

    w.r.t to creating your own “social impact” data using hardware solutions – check out HeatSeek NYC based http://heatseek.org/, South Africa based http://lumkani.com and Japanese based http://blog.safecast.org/

    These are all bottom up approaches that can be sustained over long periods of time by small teams

  3. Hi Asintunado.

    I thoroughly agree with you on the fact that Big Data analysis is actually 80% data collection, cleansing and formatting…(the drudgery of it all. ) The beauty comes once we can get this altogether in one spot and make the data visualization tell a powerful story. And give local officials the insight they need to do vector control.

    And yes, that is why we support FOI. So that any individual can have access to the data for public good. Keep up your good work too. We shoudn’t have to beg for the data. It is OUR data. We paid for it, di ba?

    But at the same time, I need to make a shout out to the public officials that dutifully compiled the data, without which as you correctly pointed out, the analysis would not have been possible. Their data surprisingly included the age of the patients, which was given to me, but i did not ask for. That led to the insight that the victims were bitten probably in school.

    Thanks for creating awareness of this big data gap issue.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s