Big Data and Causation: The Curly Fries Case Study

Curly-Fry-Cutter-review Big data can be an extremely accurate predictive tool and often reveal patterns and outliers, but often the relationships between big data and the information it attempts to represent are convoluted and misleading. To delve through the data  we must ask "why" to figure out the reasons why a particular big data finding exists and connect the numbers to their real-life representations. The fact that "big data" is often open for interpretation leaves room for human error, meaning that we must be especially aware and make a concerted effort to apply intelligence to the data in order to glean as much accurate value from it as possible.

An interesting case came to light in the recent TEDX MidAtlantic talk when scientist Jennifer Golbeck revealed a phenomenon dubbed the “Curly Fries” Case Study. The concept is simple but quite bizarre. When scientists were looking into the possibility of selling people’s personal social media activity data to future employers and marketers, they found that there was an astronomically high percentage of “smart-people” that had liked curly fries on Facebook. The question is why, and the potential thought processes used to answer that question can reveal a common analyzing error typically made when formulating explanations to support big data findings.

A mechanic, robotic approach to analysis would use the straight-line method to connect one and two and conclude that smart people like curly fries. The hypotheses that could follow: If my child develops a penchant for curly fries at a young age, can this predict his future intelligence? If I feed my child curly fries, will he become smarter?

In reality, we have to take the big data findings at face value and not form irrational conclusions. The study has told us that a lot of smart people like curly fries on FACEBOOK. So what intelligent and realistic explanations could support this finding? If you think about the mechanics of facebook and the trickle-down approach through friend networks on social media sites, you'll realize likely that a person with a large friend network of equally smart people and a high influence factor “liked” curly fries then the rest followed.

It’s important to distinguish between correlation and causation. It’s very easy to assume that an outlier in a dataset indicates a causal relationship. In reality, smart people are not more likely to like curly fries and it can be due to a unique coincidence independent of the data.

Lesson learned, and enjoy your Tuesday!

Data-Fully Yours,

Captain Dash