Thoughts on Big Data
Big Data has really caught on as a buzz word, even well outside the technology world, with journalists writing columns on its consequences for privacy, research, and so on. I'd argue that the Big Data buzz so far has underestimated the importance of this trend, and that its consequences for us all are far more profound than most people realize. I guess that requires a bit of explaining.
What is Big Data really?
Let's start with the term itself. Big Data is commonly defined as "data sets too big to process by conventional means," which is correct, but misleading, since what matters is not the size of the data sets, but the way they are used. Traditionally, data has been collected in databases and data warehouses to produce reports and nice visual dashboards for human analysis. The important new trend is going beyond that with machine learning, to have computers take over more (but not all) of the analysis, and thus make it possible to apply data analysis to entirely new uses.
The term "data science" has been coined for this new use of machine learning techniques to answer questions of key business importance, and this is really the crucial change. True Big Data is something only the biggest web companies and global corporations see, but data sets holding the answers to valuable business questions is something most organizations have. With Big Data techniques these can be put to use, to provide insights of considerable value.
Data science Venn diagram
Drew Conway made a data science Venn diagram showing what makes up data science. As you can see Big Data doesn't come into it at all. I think that's right. Big Data is just about scale, but the important thing is the analysis of the data. Still, when speaking about these things more broadly, you'll probably find yourself forced to call it Big Data so people will know what you're talking about.
But what kinds of data analysis do I mean? What are these "questions of of key business importance"? That's best explained with some real-world examples.
Examples of use
Below is a list of real-world uses of Big Data that I think illustrate quite well both how this differs from simply using old-fashioned reports, and also how it can yield very real business benefit.
- Famously, the supermarket chain Target created a statistical model to predict when a customer became pregnant, based on nothing more than what she purchased.
- The US Department of Agriculture developed a statistical model to predict which bulls would have the most valuable sperm. The value of the sperm depends on qualities of the cows they father, such as how much milk they produce, longevity, etc etc. The input to the model, interestingly, is DNA samples, and using the model saves having to test the bulls by first doing test breeding and letting the offspring grow up to produce their own milk, so bulls can start breeding years earlier with this approach.
- Scientists have used machine learning to work out what side-effects you get by using two drugs at the same time, by analyzing reports of side-effects caused by drugs.
- Visa uses Big Data analytics to identify transactions that are likely fraudulent, and according to Visa this has saved them billions.
- Rio Salado College tries to predict which students are most likely to drop out, so that they can take measures to help these students stick it out, with potentially huge savings for both the university, the students, and society at large. Data science can take this even further, to judge which measures are most effective, and most effective for what groups of students.
- Data analytics is also starting to creep into basketball, soccer, movie programming, and even design of movie scripts.
Those are just the use cases I've read accounts of people actually doing. It's clear that there's a whole range of other applications that organizations either are doing or should be doing, like:
- Consider government organizations charged with oversight of regulatory compliance by businesses or individuals. They conduct inspections, but they need to know who to inspect. What if they could process their databases with machine learning techniques to predict who is most likely in breach of regulations? The savings and efficiency gain could be huge.
- Law enforcement is starting to use this kind of technique to predict where and by whom crimes are most likely to be committed. Google even uncovered a ring of Chinese car thieves. I haven't seen it stated explicitly, but you can be certain that counter-terrorism agencies have already started doing the same with surveillance and other data.
- Retailers use these techniques to understand the patterns in what people buy, to know what products to offer customers, to estimate what prices different people are willing to pay for the same product, to guess how much of a certain product needs to be on stores by a certain date given the expected weather, and so on and so forth.
- Much of what drives the web, such as Google's PageRank, Amazon recommendations, Twitter's trending tags, and so on, derive from similar techniques.
I could keep this list going more or less indefinitely, but I'd like to emphasize once again that what really makes this significant is that these uses of data are about answering key business questions. It's about working out how to either sell more by being slightly smarter, or reducing costs the same way. Basically, it boils down to being more productive, getting more out of the same resources, and that can have a huge impact both in business and government.
It should be pretty clear that as long as you have the necessary data sets there are potentially huge gains to be had here, provided you're able to successfully mine the data. And you have two potential hurdles right there. I predict that we'll see a number of changes in how companies do business in the future, motivated simply by a desire to gather more data. Loyalty cards are a trivial example of that, but I bet there will be more. Data is increasingly becoming a valuable commodity in its own right. Japan Rail, for example, recently proposed selling data about commuter habits to businesses.
The other hurdle is being able to actually mine the data. Doing so requires an understanding of business issues, the ability to massage and process data, and knowledge of maths and statistics. Each of these abilities is fairly rare, and people possessing all three are becoming worth their weight in gold. Particularly maths and statistics skills are becoming much more valuable, since they require significant effort to acquire. And we're not talking about simple algebra here, but graph theory, probability theory, linear algebra, calculus, etc. In fact, a key Big Data concept is what's known as the Big Data skills shortage, the gap between available and needed skills.
Implications for society
The implications for privacy from Google and Facebook and so on are generally recognized, but Big Data's implications for society are far wider. What if the recent upset over the IRS targeting conservative groups had been caused by a machine learning model putting huge weight on party affiliation? In that case, the IRS might have targeted Republicans without actually being aware of it. Would saying "but the numbers say Republicans really are more likely to cheat," be accepted as an explanation? Should it be?
This actually goes further. What if insurance companies were found to charge significantly more from some vulnerable ethnic group, in ways that correspond with common prejudices against that group? Is it enough to say "the model came up with this, not us"? What if it's found that the features the model attached weight to are obvious indicators of that ethnicity? This could easily happen by accident, causing massive loss of reputation without anyone even being aware of the problem before the scandal breaks. In fact, similar things have already happened.
Another issue is raised by Obama's second presidential campaign, in which staffers built a huge database of supporters and voters, called Project Narwhal. With 170 million people in it, and detailed information on the preferences of each person, it allowed the campaign to target marketing at these voters, mentioning only policies voters were likely to approve of. Policies less likely to find favour were passed over in silence. That's likely to be effective, but is it ethical? Similar techniques, on a smaller scale, were used in Norway this year.
As this type of analysis shows what it can do, all sorts of pressures are going to increase. One example is that if regulatory bodies decide they want to use Big Data techniques to work out what companies to check out, they'll quickly discover that most of the interesting data is held by the companies themselves. How long will it before the government demands that the companies share production data, so they can be used for prediction purposes? Similarly, how long before companies are going to start demanding background data from their business partners?
I think society will find itself facing a number of major issues here, and that's only going to grow in importance as machine learning and deep learning techniques become more widely used. It's also obvious that business and government needs to wake up to see the potential that lies here, and start thinking about how it applies to them. For people in IT the consequences are huge, and threaten to cause major changes to the entire field, but that's a subject for another post.
This was the theme for ESWC 2013, so it's clearly a subject on people's minds
Read | 2013-10-13 12:00
One of the changes that Big Data is going to bring to the IT world is a new emphasis on information
Read | 2013-10-03 19:00
David - 2013-09-11 11:35:47
A perfect example of Big Data usage, that nearly anyone that uses the internet has been in contact with is the ads that appear on any websites. YouTube, for example, will show certain ads on their videos in one area and others in other areas. This is not such a big thing, because it's clear that the ads need to be in the user's native language, but what about when ads begin to pop-up relative to the videos you've been watching, videos searched, likes on Google+ or maybe even Google searches? In a way, it is a good thing that users get ads that they are maybe interested in, but it can become a problem when privacy is lost, and other companies, or governments, or malicious people got their hand on information like browsing history, search history, Facebook posts and actions, etc...
Robert Barta - 2013-09-12 02:28:57
A lot of politics here, and rightly so.
On a technical tangent I believe that the term "big data" has it's merits, although probably better coined as "annoyingly big and crappy data".
It is nothing new, really, but the problem to handle workload/size of data chunks on your existing infrastructure (single core, single host, single compartment, single machine, single cluster, single Internet) will naturally hit some technical boundary (space/time). To overcome that may rightly be called "big".