Thoughts on Big Data

<< 2013-09-11 15:06 >>

Big Data has really caught on as a buzz word, even well outside the technology world, with journalists writing columns on its consequences for privacy, research, and so on. I'd argue that the Big Data buzz so far has underestimated the importance of this trend, and that its consequences for us all are far more profound than most people realize. I guess that requires a bit of explaining.

What is Big Data really?

Let's start with the term itself. Big Data is commonly defined as "data sets too big to process by conventional means," which is correct, but misleading, since what matters is not the size of the data sets, but the way they are used. Traditionally, data has been collected in databases and data warehouses to produce reports and nice visual dashboards for human analysis. The important new trend is going beyond that with machine learning, to have computers take over more (but not all) of the analysis, and thus make it possible to apply data analysis to entirely new uses.

The term "data science" has been coined for this new use of machine learning techniques to answer questions of key business importance, and this is really the crucial change. True Big Data is something only the biggest web companies and global corporations see, but data sets holding the answers to valuable business questions is something most organizations have. With Big Data techniques these can be put to use, to provide insights of considerable value.

Data science Venn diagram

Drew Conway made a data science Venn diagram showing what makes up data science. As you can see Big Data doesn't come into it at all. I think that's right. Big Data is just about scale, but the important thing is the analysis of the data. Still, when speaking about these things more broadly, you'll probably find yourself forced to call it Big Data so people will know what you're talking about.

But what kinds of data analysis do I mean? What are these "questions of of key business importance"? That's best explained with some real-world examples.

Examples of use

Below is a list of real-world uses of Big Data that I think illustrate quite well both how this differs from simply using old-fashioned reports, and also how it can yield very real business benefit.

Those are just the use cases I've read accounts of people actually doing. It's clear that there's a whole range of other applications that organizations either are doing or should be doing, like:

I could keep this list going more or less indefinitely, but I'd like to emphasize once again that what really makes this significant is that these uses of data are about answering key business questions. It's about working out how to either sell more by being slightly smarter, or reducing costs the same way. Basically, it boils down to being more productive, getting more out of the same resources, and that can have a huge impact both in business and government.


It should be pretty clear that as long as you have the necessary data sets there are potentially huge gains to be had here, provided you're able to successfully mine the data. And you have two potential hurdles right there. I predict that we'll see a number of changes in how companies do business in the future, motivated simply by a desire to gather more data. Loyalty cards are a trivial example of that, but I bet there will be more. Data is increasingly becoming a valuable commodity in its own right. Japan Rail, for example, recently proposed selling data about commuter habits to businesses.

The other hurdle is being able to actually mine the data. Doing so requires an understanding of business issues, the ability to massage and process data, and knowledge of maths and statistics. Each of these abilities is fairly rare, and people possessing all three are becoming worth their weight in gold. Particularly maths and statistics skills are becoming much more valuable, since they require significant effort to acquire. And we're not talking about simple algebra here, but graph theory, probability theory, linear algebra, calculus, etc. In fact, a key Big Data concept is what's known as the Big Data skills shortage, the gap between available and needed skills.

Implications for society

The implications for privacy from Google and Facebook and so on are generally recognized, but Big Data's implications for society are far wider. What if the recent upset over the IRS targeting conservative groups had been caused by a machine learning model putting huge weight on party affiliation? In that case, the IRS might have targeted Republicans without actually being aware of it. Would saying "but the numbers say Republicans really are more likely to cheat," be accepted as an explanation? Should it be?

This actually goes further. What if insurance companies were found to charge significantly more from some vulnerable ethnic group, in ways that correspond with common prejudices against that group? Is it enough to say "the model came up with this, not us"? What if it's found that the features the model attached weight to are obvious indicators of that ethnicity? This could easily happen by accident, causing massive loss of reputation without anyone even being aware of the problem before the scandal breaks. In fact, similar things have already happened.

Another issue is raised by Obama's second presidential campaign, in which staffers built a huge database of supporters and voters, called Project Narwhal. With 170 million people in it, and detailed information on the preferences of each person, it allowed the campaign to target marketing at these voters, mentioning only policies voters were likely to approve of. Policies less likely to find favour were passed over in silence. That's likely to be effective, but is it ethical? Similar techniques, on a smaller scale, were used in Norway this year.

As this type of analysis shows what it can do, all sorts of pressures are going to increase. One example is that if regulatory bodies decide they want to use Big Data techniques to work out what companies to check out, they'll quickly discover that most of the interesting data is held by the companies themselves. How long will it before the government demands that the companies share production data, so they can be used for prediction purposes? Similarly, how long before companies are going to start demanding background data from their business partners?

I think society will find itself facing a number of major issues here, and that's only going to grow in importance as machine learning and deep learning techniques become more widely used. It's also obvious that business and government needs to wake up to see the potential that lies here, and start thinking about how it applies to them. For people in IT the consequences are huge, and threaten to cause major changes to the entire field, but that's a subject for another post.

Similar posts

Big Data and the Semantic Web

This was the theme for ESWC 2013, so it's clearly a subject on people's minds

Read | 2013-10-13 12:00

How Big Data will change IT

One of the changes that Big Data is going to bring to the IT world is a new emphasis on information

Read | 2013-10-03 19:00

Impressions from Strata London 2013

The first thing that struck me about the conference was that it had to be a fairly new conference

Read | 2013-11-20 08:44


David - 2013-09-11 11:35:47

A perfect example of Big Data usage, that nearly anyone that uses the internet has been in contact with is the ads that appear on any websites. YouTube, for example, will show certain ads on their videos in one area and others in other areas. This is not such a big thing, because it's clear that the ads need to be in the user's native language, but what about when ads begin to pop-up relative to the videos you've been watching, videos searched, likes on Google+ or maybe even Google searches? In a way, it is a good thing that users get ads that they are maybe interested in, but it can become a problem when privacy is lost, and other companies, or governments, or malicious people got their hand on information like browsing history, search history, Facebook posts and actions, etc...

Robert Barta - 2013-09-12 02:28:57

A lot of politics here, and rightly so.

On a technical tangent I believe that the term "big data" has it's merits, although probably better coined as "annoyingly big and crappy data".

It is nothing new, really, but the problem to handle workload/size of data chunks on your existing infrastructure (single core, single host, single compartment, single machine, single cluster, single Internet) will naturally hit some technical boundary (space/time). To overcome that may rightly be called "big".

Add a comment

Name required
Email optional, not published
URL optional, published
Spam don't check this if you want to be posted
Not spam do check this if you want to be posted