New research suggests using big data, particularly social media data, can lead to a biased representation of the data based on societal factors.
Striking new research out of Princeton University’s Center for Information Technology Policy and the University of North Carolina at Chapel Hill suggests that inferences based on how people use social media platforms like Twitter and Facebook should be reconsidered. The reason? These platforms represent skewed samples from which it is difficult to draw accurate conclusions.
[ Thank you MIT Sloan Management Review]
[ By Renee Boucher Ferguson | 07.17.13 ]
In her draft paper, Big Data: Pitfalls, Methods and Concepts for an Emergent Field, UNC professor and Princeton CITP fellow Zeynep Tufekci (@zeynep) compares the methodological challenges of developing socially-based big data insights using Twitter to biological testing on Drosophila flies, better known as fruit flies. Drosophila flies are usually chosen because they’re relatively easy to use in lab settings, easy to breed, have rapid and “stereotypical” life cycles, and the adults are pretty small. The problem? They’re not necessarily representative of non-lab (read: real-life) scenarios. Tufekci posits that the dominance of Twitter as the “model organism” for social media in big data analyses similarly skews analysis:
Each social media platform carries with it certain affordances which structure its social norms and interactions and may not be representative of other social media platforms, or general human social behavior …
Tufekci says that one of the biggest methodological dangers of big data analysis is “insufficient understanding of the underlying samples.” In her words,
It’s not enough to understand how many people have “liked” a Facebook status updated, clicked on a link, or “retweeted” a message, without having a sense of how many people saw and chose to — or not to — take that option. That kind of normalization is rarely done, or may even be actively decided against because the results start appearing more complex or more trivial.
On the conceptual side of the big data analysis challenge, Tufekci posits that more in-depth research needs to be done in order to deepen the understanding of exactly what a social media footprints mean — and what can legitimately be inferred from big data analysis of those footprints.
A case in point: while retweets or mentions are often equated as a measure of “influence,” the meaning of a retweet could actually be something far different than influence, ranging from “affirmation to denunciation to sarcasm to approval to disgust.”
Tufekci makes three additional points regarding conceptual analysis of big data that can be applied in a business setting:
When I asked Tufekci how she thinks her research applies to business managers using online and social media data, she said it’s important to keep in mind that more data does not necessarily mean more insight.
“A lot of big data research is done in an isolated, one-shot, single-method manner with no way to assess, interpret or contextualize the findings,” she said. “There is great potential for error and misunderstanding; worse, with a lot of money flowing into this space, there is a lot pressure to produce “results” and overlook the fact that methods that were not developed to study humans, and do not necessarily work the same way, but are being applied widely.
“The online imprints that create these large, aggregate datasets are not just mere ‘mirrors’ of human activity; rather, they are partial, filtered, distorted and complex reflections.”
More Reading: http://sloanreview.mit.edu/big-ideas/data-analytics/
i blog about the things I love: fitness, hacking work, tech, Experiences and anything holistic.
> Head of Product and VP Engineering