The dark and dangerous side of big data

United Telekinetics

I love the “big data” trend: the possibilities of  being able to fully collect and analyze user behavior are tremendous.

However, we must not take the human factor out of the equation. Big data is not all intelligent algorithms tirelessly seeking connections – Humans are very much the key part – the part of interpretation and translation of the results into real-world meaning. Even more – the dark charm of statistics can very easily cause us to forget the “real world” meanings and arrive very quickly at very wrong conclusions.

A well-known example is “Simpson’s Paradox“, aptly named after its discoverer Udny Yule :-). The phenomena basically states that is it very easy to fall into the wrong conclusion if you group different samples without looking for causality in the same time.

A few easy examples are in the Wikipedia entry, and in Product Management we actually encounter similar effects all the time, when performing A/B testing or when observing user behavior.

A recent example came across when I was alerted to strange behavior at a customer site, where the admins could not figure out why certain media content was far more popular than other content. We spent hours digging through web site analytics, building funnels for in and out flows, and wracking our brains.

Finally, we discovered that on certain conditions, the video player loads slowly. So slowly in fact, that users simple abandoned the page before the video even started player. This caused a dent in the data, totally unrelated to the actual media content, the page styling, etc cetera.

Being able to track all possible conditions, and all possible analytics, creates a huge pile of data. Machines  will help sorting through it, and will present possible correlations, but it is up to the human at the helm to use common sense and rule out the irrelevant correlations, and dig for more inputs (e.g. video player load time) when the results are not satisfactory.

Be aware of the “lurking variables” and use intuition and imagination to flush them out.

Finally, an example on a related topic, of how being presented with numbers can drive perception, even if the whole picture is not revealed. From the Washington Post: Gas prices are indeed soaring… until you take into account the (not-so-lurking) variable of inflation.

About these ads

One thought on “The dark and dangerous side of big data

  1. When you build an algorithm to do some analysis on big data you have to identify and eliminate the outliers before you can actually look at the results. The outliers, like the case you describe above, may shift your results significantly, to the point that they obstruct the real results. I’ve had some strange cases where the weekly trends were looking almost too good to be true… I then found that they were actually too good indeed: had a couple of days with huge number of users comping from a couple of North Ireland towns, and a couple of other days with large numbers from a remote village in Malaysia. Bogus/worms/attacks or whatever you want to call them…
    One problem i found is that it’s very hard to eliminate the outliers when you use standard tools like google analytics and the like.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s