What’s All the Fuss About Big Data?

We excerpt an interview with big data pioneer and computer scientist, Alex Pentland, via the Edge. Pentland is a leading thinker in computational social science and currently directs the Human Dynamics Laboratory at MIT.

While there is no exact definition of “big data” it tends to be characterized quantitatively and qualitatively differently from data commonly used by most organizations. Where regular data can be stored, processed and analyzed using common database tools and analytical engines, big data refers to vast collections of data that often lie beyond the realm of regular computation. So, often big data requires vast and specialized storage and enormous processing capabilities. Data sets that fall into the big data area cover such areas as climate science, genomics, particle physics, and computational social science.

Big data holds true promise. However, while storage and processing power now enable quick and efficient crunching of tera- and even petabytes of data, tools for comprehensive analysis and visualization lag behind.

[div class=attrib]Alex Pentland via the Edge:[end-div]

Recently I seem to have become MIT’s Big Data guy, with people like Tim O’Reilly and “Forbes” calling me one of the seven most powerful data scientists in the world. I’m not sure what all of that means, but I have a distinctive view about Big Data, so maybe it is something that people want to hear.

I believe that the power of Big Data is that it is information about people’s behavior instead of information about their beliefs. It’s about the behavior of customers, employees, and prospects for your new business. It’s not about the things you post on Facebook, and it’s not about your searches on Google, which is what most people think about, and it’s not data from internal company processes and RFIDs. This sort of Big Data comes from things like location data off of your cell phone or credit card, it’s the little data breadcrumbs that you leave behind you as you move around in the world.

What those breadcrumbs tell is the story of your life. It tells what you’ve chosen to do. That’s very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether you are the sort of person who will pay back loans. They can tell you if you’re likely to get diabetes.

They can do this because the sort of person you are is largely determined by your social context, so if I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your crowd. You can tell all sorts of things about a person, even though it’s not explicitly in the data, because people are so enmeshed in the surrounding social fabric that it determines the sorts of things that they think are normal, and what behaviors they will learn from each other.

As a consequence analysis of Big Data is increasingly about finding connections, connections with the people around you, and connections between people’s behavior and outcomes. You can see this in all sorts of places. For instance, one type of Big Data and connection analysis concerns financial data. Not just the flash crash or the Great Recession, but also all the other sorts of bubbles that occur. What these are is these are systems of people, communications, and decisions that go badly awry. Big Data shows us the connections that cause these events. Big data gives us the possibility of understanding how these systems of people and machines work, and whether they’re stable.

The notion that it is connections between people that is really important is key, because researchers have mostly been trying to understand things like financial bubbles using what is called Complexity Science or Web Science. But these older ways of thinking about Big Data leaves the humans out of the equation. What actually matters is how the people are connected together by the machines and how, as a whole, they create a financial market, a government, a company, and other social structures.

Because it is so important to understand these connections Asu Ozdaglar and I have recently created the MIT Center for Connection Science and Engineering, which spans all of the different MIT departments and schools. It’s one of the very first MIT-wide Centers, because people from all sorts of specialties are coming to understand that it is the connections between people that is actually the core problem in making transportation systems work well, in making energy grids work efficiently, and in making financial systems stable. Markets are not just about rules or algorithms; they’re about people and algorithms together.

Understanding these human-machine systems is what’s going to make our future social systems stable and safe. We are getting beyond complexity, data science and web science, because we are including people as a key part of these systems. That’s the promise of Big Data, to really understand the systems that make our technological society. As you begin to understand them, then you can build systems that are better. The promise is for financial systems that don’t melt down, governments that don’t get mired in inaction, health systems that actually work, and so on, and so forth.

The barriers to better societal systems are not about the size or speed of data. They’re not about most of the things that people are focusing on when they talk about Big Data. Instead, the challenge is to figure out how to analyze the connections in this deluge of data and come to a new way of building systems based on understanding these connections.

Changing The Way We Design Systems

With Big Data traditional methods of system building are of limited use. The data is so big that any question you ask about it will usually have a statistically significant answer. This means, strangely, that the scientific method as we normally use it no longer works, because almost everything is significant! As a consequence the normal laboratory-based question-and-answering process, the method that we have used to build systems for centuries, begins to fall apart.

Big data and the notion of Connection Science is outside of our normal way of managing things. We live in an era that builds on centuries of science, and our methods of building of systems, governments, organizations, and so on are pretty well defined. There are not a lot of things that are really novel. But with the coming of Big Data, we are going to be operating very much out of our old, familiar ballpark.

With Big Data you can easily get false correlations, for instance, “On Mondays, people who drive to work are more likely to get the flu.” If you look at the data using traditional methods, that may actually be true, but the problem is why is it true? Is it causal? Is it just an accident? You don’t know. Normal analysis methods won’t suffice to answer those questions. What we have to come up with is new ways to test the causality of connections in the real world far more than we have ever had to do before. We no can no longer rely on laboratory experiments; we need to actually do the experiments in the real world.

The other problem with Big Data is human understanding. When you find a connection that works, you’d like to be able to use it to build new systems, and that requires having human understanding of the connection. The managers and the owners have to understand what this new connection means. There needs to be a dialogue between our human intuition and the Big Data statistics, and that’s not something that’s built into most of our management systems today. Our managers have little concept of how to use big data analytics, what they mean, and what to believe.

In fact, the data scientists themselves don’t have much of intuition either…and that is a problem. I saw an estimate recently that said 70 to 80 percent of the results that are found in the machine learning literature, which is a key Big Data scientific field, are probably wrong because the researchers didn’t understand that they were overfitting the data. They didn’t have that dialogue between intuition and causal processes that generated the data. They just fit the model and got a good number and published it, and the reviewers didn’t catch it either. That’s pretty bad because if we start building our world on results like that, we’re going to end up with trains that crash into walls and other bad things. Management using Big Data is actually a radically new thing.

[div class=attrib]Read the entire article after the jump.[end-div]

[div class=attrib]Image courtesy of Techcrunch.[end-div]