From 2011 to 2013, Google created Google Flu Tracker, which supposedly tracked and reported the occurrences of flu cases by area. It emerged right at the first crescent of the big data wave and was promoted by the tech giant as proof of the powerful and relevant big data applications.
Recently, however, an article published by the Science journal attacks the validity of the GTF report and calls into question not the concept of big data itself but the integrity of Google’s data collection and reporting methods. The data in the GFT report was based on the amount of flu-related Google search queries; Google claimed that they “had found a close relationship between” flu-related search terms and actual flu cases, and so they used flu search terms to indicate areas of higher flu case occurrences.
Turns out, the relationship between the two was not close at all. The Science journal article reveals that Google overestimated the occurrence of flu cases by a whopping 50% on average.
The responses to this news have varied. Some think that this big data malfunction was due to an overhyping of big data’s abilities, while others believe that Google fell victim to the trap of simply collecting the wrong information and mistranslating it.
Others like myself believe that this “misunderstanding” is actually representative of something a bit more sinister and troubling than a simple case of misinterpreted data. In my opinion this scandal shines a public light on the all-too-common (yet rarely publicized) danger of “big data wishful thinking,” which has little to do with errors in the data itself but rather highlights the irresponsibility on the part of large companies like Google who release these “big data reports.”
Google basically collected a mass of data about the flu and thought to themselves “How great would it be if this data could actually tell people who has the flu?” The potential for a logical connection existed between flu-related search terms and actual cases of the flu, so they decided to bridge that connection without conducting proper research or due-diligence to confirm the validity of their claim.
The solution would have been simple: Google could have tracked the Internet activities of a group of test subjects, with and without the flu, and connected the instances of flu-related search terms between the parties in order to paint a very realistic picture of the relationship between searches and actual flu cases.
Google is a massive, hyper-intelligent company with the intelligence and the resources to conduct this sort of authentication study. So why didn’t they?
Because for some companies, big data isn’t about the truth. Proper experimental processes dictate that data must be collected and objectively analyzed in order to come to a conclusion. However it appears that in Google’s case they began with the end in mind, a desired conclusion that would improve their image, credibility, and ranking within the big-data-sphere. With their conclusion already chosen, they filled in the blanks with data that they surely knew was not completely suitable or relevant to support their claim.
This entire case highlights one of the main dangers of big data. Even just the term “big data” sounds so technical and sterile that it’s hard to believe that it could be manipulated to reflect a company’s agenda. We live in a technocracy where the public blindly puts their unquestioning faith in tech giants like Google and Microsoft to the point that these companies have more than ample opportunity to delude the public for the sake of furthering their own agenda.
It’s difficult to believe that big data is a tool that can be manipulated by these companies, yet we must recognize that data collection methods can be flawed and analysis processes skewed in favor of a desired conclusion that benefits the analyzer.
Big data privacy laws pose a unique challenge in the sense that, while they’re meant to keep our personal information safe, they also limit the amount of information that companies are allowed to reveal pertaining to their reports. In the case of the Google Flu Tracker, Google wasn’t allowed to reveal the search terms upon which the entire report was based. The problem was that we only saw Step 3, the conclusion of the report, yet Steps 1 and 2, the raw data and analytics processes, remained shrouded in the darkness of stringent privacy regulations.
And with that, dear readers, I wish you a Happy Friday and advise you to always consider the source of your information before you believe it at face value.
Data-Fully Yours The Captain
The Science article can be viewed here: http://www.sciencemag.org/content/343/6176/1203