Tag Archives: big data

The Internet 0f Th1ngs

Google-search-IoT

Technologist Marc Goodman describes a not too distant future in which all our appliances, tools, products… anything and everything is plugged into the so-called Internet of Things (IoT). The IoT describes a world where all things are connected to everything else, making for a global mesh of intelligent devices from your connected car and your WiFi enabled sneakers to your smartwatch and home thermostat. You may well believe it advantageous to have your refrigerator ping the local grocery store when it runs out of fresh eggs and milk or to have your toilet auto-call a local plumber when it gets stopped-up.

But, as our current Internet shows us — let’s call it the Internet of People — not all is rosy in this hyper-connected, 24/7, always-on digital ocean. What are you to do when hackers attack all your home appliances in a “denial of home service attack (DohS)”, or when your every move inside your home is scrutinized, collected, analyzed and sold to the nearest advertiser, or when your cooktop starts taking and sharing selfies with the neighbors?

Goodman’s new book on this important subject, excerpted here, is titled Future Crimes.

From the Guardian:

If we think of today’s internet metaphorically as about the size of a golf ball, tomorrow’s will be the size of the sun. Within the coming years, not only will every computer, phone and tablet be online, but so too will every car, house, dog, bridge, tunnel, cup, clock, watch, pacemaker, cow, streetlight, bridge, tunnel, pipeline, toy and soda can. Though in 2013 there were only 13bn online devices, Cisco Systems has estimated that by 2020 there will be 50bn things connected to the internet, with room for exponential growth thereafter. As all of these devices come online and begin sharing data, they will bring with them massive improvements in logistics, employee efficiency, energy consumption, customer service and personal productivity.

This is the promise of the internet of things (IoT), a rapidly emerging new paradigm of computing that, when it takes off, may very well change the world we live in forever.

The Pew Research Center defines the internet of things as “a global, immersive, invisible, ambient networked computing environment built through the continued proliferation of smart sensors, cameras, software, databases, and massive data centres in a world-spanning information fabric”. Back in 1999, when the term was first coined by MIT researcher Kevin Ashton, the technology did not exist to make the IoT a reality outside very controlled environments, such as factory warehouses. Today we have low-powered, ultra-cheap computer chips, some as small as the head of a pin, that can be embedded in an infinite number of devices, some for mere pennies. These miniature computing devices only need milliwatts of electricity and can run for years on a minuscule battery or small solar cell. As a result, it is now possible to make a web server that fits on a fingertip for $1.

The microchips will receive data from a near-infinite range of sensors, minute devices capable of monitoring anything that can possibly be measured and recorded, including temperature, power, location, hydro-flow, radiation, atmospheric pressure, acceleration, altitude, sound and video. They will activate miniature switches, valves, servos, turbines and engines – and speak to the world using high-speed wireless data networks. They will communicate not only with the broader internet but with each other, generating unfathomable amounts of data. The result will be an always-on “global, immersive, invisible, ambient networked computing environment”, a mere prelude to the tidal wave of change coming next.

In the future all objects may be smart

The broad thrust sounds rosy. Because chips and sensors will be embedded in everyday objects, we will have much better information and convenience in our lives. Because your alarm clock is connected to the internet, it will be able to access and read your calendar. It will know where and when your first appointment of the day is and be able to cross-reference that information against the latest traffic conditions. Light traffic, you get to sleep an extra 10 minutes; heavy traffic, and you might find yourself waking up earlier than you had hoped.

When your alarm does go off, it will gently raise the lights in the house, perhaps turn up the heat or run your bath. The electronic pet door will open to let Fido into the backyard for his morning visit, and the coffeemaker will begin brewing your coffee. You won’t have to ask your kids if they’ve brushed their teeth; the chip in their toothbrush will send a message to your smartphone letting you know the task is done. As you walk out the door, you won’t have to worry about finding your keys; the beacon sensor on the key chain makes them locatable to within two inches. It will be as if the Jetsons era has finally arrived.

While the hype-o-meter on the IoT has been blinking red for some time, everything described above is already technically feasible. To be certain, there will be obstacles, in particular in relation to a lack of common technical standards, but a wide variety of companies, consortia and government agencies are hard at work to make the IoT a reality. The result will be our transition from connectivity to hyper-connectivity, and like all things Moore’s law related, it will be here sooner than we realise.

The IoT means that all physical objects in the future will be assigned an IP address and be transformed into information technologies. As a result, your lamp, cat or pot plant will be part of an IT network. Things that were previously silent will now have a voice, and every object will be able to tell its own story and history. The refrigerator will know exactly when it was manufactured, the names of the people who built it, what factory it came from, and the day it left the assembly line, arrived at the retailer, and joined your home network. It will keep track of every time its door has been opened and which one of your kids forgot to close it. When the refrigerator’s motor begins to fail, it can signal for help, and when it finally dies, it will tell us how to disassemble its parts and best recycle them. Buildings will know every person who has ever worked there, and streetlights every car that has ever driven by.

All of these objects will communicate with each other and have access to the massive processing and storage power of the cloud, further enhanced by additional mobile and social networks. In the future all objects may become smart, in fact much smarter than they are today, and as these devices become networked, they will develop their own limited form of sentience, resulting in a world in which people, data and things come together. As a consequence of the power of embedded computing, we will see billions of smart, connected things joining a global neural network in the cloud.

In this world, the unknowable suddenly becomes knowable. For example, groceries will be tracked from field to table, and restaurants will keep tabs on every plate, what’s on it, who ate from it, and how quickly the waiters are moving it from kitchen to customer. As a result, when the next E coli outbreak occurs, we won’t have to close 500 eateries and wonder if it was the chicken or beef that caused the problem. We will know exactly which restaurant, supplier and diner to contact to quickly resolve the problem. The IoT and its billions of sensors will create an ambient intelligence network that thinks, senses and feels and contributes profoundly to the knowable universe.

Things that used to make sense suddenly won’t, such as smoke detectors. Why do most smoke detectors do nothing more than make loud beeps if your life is in mortal danger because of fire? In the future, they will flash your bedroom lights to wake you, turn on your home stereo, play an MP3 audio file that loudly warns, “Fire, fire, fire.” They will also contact the fire department, call your neighbours (in case you are unconscious and in need of help), and automatically shut off flow to the gas appliances in the house.

The byproduct of the IoT will be a living, breathing, global information grid, and technology will come alive in ways we’ve never seen before, except in science fiction movies. As we venture down the path toward ubiquitous computing, the results and implications of the phenomenon are likely to be mind-blowing. Just as the introduction of electricity was astonishing in its day, it eventually faded into the background, becoming an imperceptible, omnipresent medium in constant interaction with the physical world. Before we let this happen, and for all the promise of the IoT, we must ask critically important questions about this brave new world. For just as electricity can shock and kill, so too can billions of connected things networked online.

One of the central premises of the IoT is that everyday objects will have the capacity to speak to us and to each other. This relies on a series of competing communications technologies and protocols, many of which are eminently hackable. Take radio-frequency identification (RFID) technology, considered by many the gateway to the IoT. Even if you are unfamiliar with the name, chances are you have already encountered it in your life, whether it’s the security ID card you use to swipe your way into your office, your “wave and pay” credit card, the key to your hotel room, your Oyster card.

Even if you don’t use an RFID card for work, there’s a good chance you either have it or will soon have it embedded in the credit card sitting in your wallet. Hackers have been able to break into these as well, using cheap RFID readers available on eBay for just $50, tools that allow an attacker to wirelessly capture a target’s credit card number, expiration date and security code. Welcome to pocket picking 2.0.

More productive and more prison-like

A much rarer breed of hacker targets the physical elements that make up a computer system, including the microchips, electronics, controllers, memory, circuits, components, transistors and sensors – core elements of the internet of things. These hackers attack a device’s firmware, the set of computer instructions present on every electronic device we encounter, including TVs, mobile phones, game consoles, digital cameras, network routers, alarm systems, CCTVs, USB drives, traffic lights, gas station pumps and smart home management systems. Before we add billions of hackable things and communicate with hackable data transmission protocols, important questions must be asked about the risks for the future of security, crime, terrorism, warfare and privacy.

In the same way our every move online can be tracked, recorded, sold and monetised today, so too will that be possible in the near future in the physical world. Real space will become just like cyberspace. With the widespread adoption of more networked devices, what people do in their homes, cars, workplaces, schools and communities will be subjected to increased monitoring and analysis by the corporations making these devices. Of course these data will be resold to advertisers, data brokers and governments, providing an unprecedented view into our daily lives. Unfortunately, just like our social, mobile, locational and financial information, our IoT data will leak, providing further profound capabilities to stalkers and other miscreants interested in persistently tracking us. While it would certainly be possible to establish regulations and build privacy protocols to protect consumers from such activities, the greater likelihood is that every IoT-enabled device, whether an iron, vacuum, refrigerator, thermostat or lightbulb, will come with terms of service that grant manufacturers access to all your data. More troublingly, while it may be theoretically possible to log off in cyberspace, in your well-connected smart home there will be no “opt-out” provision.

We may find ourselves interacting with thousands of little objects around us on a daily basis, each collecting seemingly innocuous bits of data 24/7, information these things will report to the cloud, where it will be processed, correlated, and reviewed. Your smart watch will reveal your lack of exercise to your health insurance company, your car will tell your insurer of your frequent speeding, and your dustbin will tell your local council that you are not following local recycling regulations. This is the “internet of stool pigeons”, and though it may sound far-fetched, it’s already happening. Progressive, one of the largest US auto insurance companies, offers discounted personalised rates based on your driving habits. “The better you drive, the more you can save,” according to its advertising. All drivers need to do to receive the lower pricing is agree to the installation of Progressive’s Snapshot black-box technology in their cars and to having their braking, acceleration and mileage persistently tracked.

The IoT will also provide vast new options for advertisers to reach out and touch you on every one of your new smart connected devices. Every time you go to your refrigerator to get ice, you will be presented with ads for products based on the food your refrigerator knows you’re most likely to buy. Screens too will be ubiquitous, and marketers are already planning for the bounty of advertising opportunities. In late 2013, Google sent a letter to the Securities and Exchange Commission noting, “we and other companies could [soon] be serving ads and other content on refrigerators, car dashboards, thermostats, glasses and watches, to name just a few possibilities.”

Knowing that Google can already read your Gmail, record your every web search, and track your physical location on your Android mobile phone, what new powerful insights into your personal life will the company develop when its entertainment system is in your car, its thermostat regulates the temperature in your home, and its smart watch monitors your physical activity?

Not only will RFID and other IoT communications technologies track inanimate objects, they will be used for tracking living things as well. The British government has considered implanting RFID chips directly under the skin of prisoners, as is common practice with dogs. School officials across the US have begun embedding RFID chips in student identity cards, which pupils are required to wear at all times. In Contra Costa County, California, preschoolers are now required to wear basketball-style jerseys with electronic tracking devices built in that allow teachers and administrators to know exactly where each student is. According to school district officials, the RFID system saves “3,000 labour hours a year in tracking and processing students”.

Meanwhile, the ability to track employees, how much time they take for lunch, the length of their toilet breaks and the number of widgets they produce will become easy. Moreover, even things such as words typed per minute, eye movements, total calls answered, respiration, time away from desk and attention to detail will be recorded. The result will be a modern workplace that is simultaneously more productive and more prison-like.

At the scene of a suspected crime, police will be able to interrogate the refrigerator and ask the equivalent of, “Hey, buddy, did you see anything?” Child social workers will know there haven’t been any milk or nappies in the home, and the only thing stored in the fridge has been beer for the past week. The IoT also opens up the world for “perfect enforcement”. When sensors are everywhere and all data is tracked and recorded, it becomes more likely that you will receive a moving violation for going 26 miles per hour in a 25-mile-per-hour zone and get a parking ticket for being 17 seconds over on your meter.

The former CIA director David Petraeus has noted that the IoT will be “transformational for clandestine tradecraft”. While the old model of corporate and government espionage might have involved hiding a bug under the table, tomorrow the very same information might be obtained by intercepting in real time the data sent from your Wi-Fi lightbulb to the lighting app on your smart phone. Thus the devices you thought were working for you may in fact be on somebody else’s payroll, particularly that of Crime, Inc.

A network of unintended consequences

For all the untold benefits of the IoT, its potential downsides are colossal. Adding 50bn new objects to the global information grid by 2020 means that each of these devices, for good or ill, will be able to potentially interact with the other 50bn connected objects on earth. The result will be 2.5 sextillion potential networked object-to-object interactions – a network so vast and complex it can scarcely be understood or modelled. The IoT will be a global network of unintended consequences and black swan events, ones that will do things nobody ever planned. In this world, it is impossible to know the consequences of connecting your home’s networked blender to the same information grid as an ambulance in Tokyo, a bridge in Sydney, or a Detroit auto manufacturer’s production line.

The vast levels of cyber crime we currently face make it abundantly clear we cannot even adequately protect the standard desktops and laptops we presently have online, let alone the hundreds of millions of mobile phones and tablets we are adding annually. In what vision of the future, then, is it conceivable that we will be able to protect the next 50bn things, from pets to pacemakers to self-driving cars? The obvious reality is that we cannot.

Our technological threat surface area is growing exponentially and we have no idea how to defend it effectively. The internet of things will become nothing more than the Internet of things to be hacked.

Read the entire article here.

Image courtesy of Google Search.

Big Data Knows What You Do and When

Data scientists are getting to know more about you and your fellow urban dwellers as you move around your neighborhood and your city. As smartphones and cell towers become more ubiquitous and  data collection and analysis gathers pace researchers (and advertisers) will come to know your daily habits and schedule rather intimately. So, questions from a significant other along the lines of, “and, where were you at 11:15 last night?” may soon be consigned to history.

From Technology Review:

Mobile phones have generated enormous insight into the human condition thanks largely to the study of the data they produce. Mobile phone companies record the time of each call, the caller and receiver ids, as well as the locations of the cell towers involved, among other things.

The combined data from millions of people produces some fascinating new insights in the nature of our society.

Anthropologists have crunched it to reveal human reproductive strategiesa universal law of commuting and even the distribution of wealth in Africa.

Today, computer scientists have gone one step further by using mobile phone data to map the structure of cities and how people use them throughout the day. “These results point towards the possibility of a new, quantitative classification of cities using high resolution spatio-temporal data,” say Thomas Louail at the Institut de Physique Théorique in Paris and a few pals.

They say their work is part of a new science of cities that aims to objectively measure and understand the nature of large population centers.

These guys begin with a database of mobile phone calls made by people in the 31 Spanish cities that have populations larger than 200,000. The data consists of the number of unique individuals using a given cell tower (whether making a call or not) for each hour of the day over almost two months.

Given the area that each tower covers, Louail and co work out the density of individuals in each location and how it varies throughout the day. And using this pattern, they search for “hotspots” in the cities where the density of individuals passes some specially chosen threshold at certain times of the day.

The results reveal some fascinating patterns in city structure. For a start, every city undergoes a kind of respiration in which people converge into the center and then withdraw on a daily basis, almost like breathing. And this happens in all cities. This “suggests the existence of a single ‘urban rhythm’ common to all cities,” says Louail and co.

During the week, the number of phone users peaks at about midday and then again at about 6 p.m. During the weekend the numbers peak a little later: at 1 p.m. and 8 p.m. Interestingly, the second peak starts about an hour later in western cities, such as Sevilla and Cordoba.

The data also reveals that small cities tend to have a single center that becomes busy during the day, such as the cities of Salamanca and Vitoria.

But it also shows that the number of hotspots increases with city size; so-called polycentric cities include Spain’s largest, such as Madrid, Barcelona, and Bilboa.

That could turn out to be useful for automatically classifying cities.

Read the entire article here.

Business Decison-Making Welcomes Science

data-visualization-ayasdi

It is likely that business will never eliminate gut instinct from the decision-making process. However, as data, now big data, increasingly pervades every crevice of every organization, the use of data-driven decisions will become the norm. As this happens, more and more businesses find themselves employing data scientists to help filter, categorize, mine and analyze these mountains of data in meaningful ways.

The caveat, of course, is that data, big data and an even bigger reliance on that data requires subject matter expertise and analysts with critical thinking skills and sound judgement — data cannot be used blindly.

From Technology review:

Throughout history, innovations in instrumentation—the microscope, the telescope, and the cyclotron—have repeatedly revolutionized science by improving scientists’ ability to measure the natural world. Now, with human behavior increasingly reliant on digital platforms like the Web and mobile apps, technology is effectively “instrumenting” the social world as well. The resulting deluge of data has revolutionary implications not only for social science but also for business decision making.

As enthusiasm for “big data” grows, skeptics warn that overreliance on data has pitfalls. Data may be biased and is almost always incomplete. It can lead decision makers to ignore information that is harder to obtain, or make them feel more certain than they should. The risk is that in managing what we have measured, we miss what really matters—as Vietnam-era Secretary of Defense Robert McNamara did in relying too much on his infamous body count, and as bankers did prior to the 2007–2009 financial crisis in relying too much on flawed quantitative models.

The skeptics are right that uncritical reliance on data alone can be problematic. But so is overreliance on intuition or ideology. For every Robert McNamara, there is a Ron Johnson, the CEO whose disastrous tenure as the head of JC Penney was characterized by his dismissing data and evidence in favor of instincts. For every flawed statistical model, there is a flawed ideology whose inflexibility leads to disastrous results.

So if data is unreliable and so is intuition, what is a responsible decision maker supposed to do? While there is no correct answer to this question—the world is too complicated for any one recipe to apply—I believe that leaders across a wide range of contexts could benefit from a scientific mind-set toward decision making.

A scientific mind-set takes as its inspiration the scientific method, which at its core is a recipe for learning about the world in a systematic, replicable way: start with some general question based on your experience; form a hypothesis that would resolve the puzzle and that also generates a testable prediction; gather data to test your prediction; and finally, evaluate your hypothesis relative to competing hypotheses.

The scientific method is largely responsible for the astonishing increase in our understanding of the natural world over the past few centuries. Yet it has been slow to enter the worlds of politics, business, policy, and marketing, where our prodigious intuition for human behavior can always generate explanations for why people do what they do or how to make them do something different. Because these explanations are so plausible, our natural tendency is to want to act on them without further ado. But if we have learned one thing from science, it is that the most plausible explanation is not necessarily correct. Adopting a scientific approach to decision making requires us to test our hypotheses with data.

While data is essential for scientific decision making, theory, intuition, and imagination remain important as well—to generate hypotheses in the first place, to devise creative tests of the hypotheses that we have, and to interpret the data that we collect. Data and theory, in other words, are the yin and yang of the scientific method—theory frames the right questions, while data answers the questions that have been asked. Emphasizing either at the expense of the other can lead to serious mistakes.

Also important is experimentation, which doesn’t mean “trying new things” or “being creative” but quite specifically the use of controlled experiments to tease out causal effects. In business, most of what we observe is correlation—we do X and Y happens—but often what we want to know is whether or not X caused Y. How many additional units of your new product did your advertising campaign cause consumers to buy? Will expanded health insurance coverage cause medical costs to increase or decline? Simply observing the outcome of a particular choice does not answer causal questions like these: we need to observe the difference between choices.

Replicating the conditions of a controlled experiment is often difficult or impossible in business or policy settings, but increasingly it is being done in “field experiments,” where treatments are randomly assigned to different individuals or communities. For example, MIT’s Poverty Action Lab has conducted over 400 field experiments to better understand aid delivery, while economists have used such experiments to measure the impact of online advertising.

Although field experiments are not an invention of the Internet era—randomized trials have been the gold standard of medical research for decades—digital technology has made them far easier to implement. Thus, as companies like Facebook, Google, Microsoft, and Amazon increasingly reap performance benefits from data science and experimentation, scientific decision making will become more pervasive.

Nevertheless, there are limits to how scientific decision makers can be. Unlike scientists, who have the luxury of withholding judgment until sufficient evidence has accumulated, policy makers or business leaders generally have to act in a state of partial ignorance. Strategic calls have to be made, policies implemented, reward or blame assigned. No matter how rigorously one tries to base one’s decisions on evidence, some guesswork will be required.

Exacerbating this problem is that many of the most consequential decisions offer only one opportunity to succeed. One cannot go to war with half of Iraq and not the other just to see which policy works out better. Likewise, one cannot reorganize the company in several different ways and then choose the best. The result is that we may never know which good plans failed and which bad plans worked.

Read the entire article here.

Image: Screenshot of Iris, Ayasdi’s data-visualization tool. Courtesy of Ayasdi / Wired.

Meta-Research: Discoveries From Research on Discoveries

Discoveries through scientific research don’t just happen in the lab. Many of course do. Some discoveries now come through data analysis of research papers. Here, sophisticated data mining tools and semantic software sift through hundreds of thousands of research papers looking for patterns and links that would otherwise escape the eye of human researchers.

From Technology Review:

Software that read tens of thousands of research papers and then predicted new discoveries about the workings of a protein that’s key to cancer could herald a faster approach to developing new drugs.

The software, developed in a collaboration between IBM and Baylor College of Medicine, was set loose on more than 60,000 research papers that focused on p53, a protein involved in cell growth, which is implicated in most cancers. By parsing sentences in the documents, the software could build an understanding of what is known about enzymes called kinases that act on p53 and regulate its behavior; these enzymes are common targets for cancer treatments. It then generated a list of other proteins mentioned in the literature that were probably undiscovered kinases, based on what it knew about those already identified. Most of its predictions tested so far have turned out to be correct.

“We have tested 10,” Olivier Lichtarge of Baylor said Tuesday. “Seven seem to be true kinases.” He presented preliminary results of his collaboration with IBM at a meeting on the topic of Cognitive Computing held at IBM’s Almaden research lab.

Lichtarge also described an earlier test of the software in which it was given access to research literature published prior to 2003 to see if it could predict p53 kinases that have been discovered since. The software found seven of the nine kinases discovered after 2003.

“P53 biology is central to all kinds of disease,” says Lichtarge, and so it seemed to be the perfect way to show that software-generated discoveries might speed up research that leads to new treatments. He believes the results so far show that to be true, although the kinase-hunting experiments are yet to be reviewed and published in a scientific journal, and more lab tests are still planned to confirm the findings so far. “Kinases are typically discovered at a rate of one per year,” says Lichtarge. “The rate of discovery can be vastly accelerated.”

Lichtarge said that although the software was configured to look only for kinases, it also seems capable of identifying previously unidentified phosphatases, which are enzymes that reverse the action of kinases. It can also identify other types of protein that may interact with p53.

The Baylor collaboration is intended to test a way of extending a set of tools that IBM researchers already offer to pharmaceutical companies. Under the banner of accelerated discovery, text-analyzing tools are used to mine publications, patents, and molecular databases. For example, a company in search of a new malaria drug might use IBM’s tools to find molecules with characteristics that are similar to existing treatments. Because software can search more widely, it might turn up molecules in overlooked publications or patents that no human would otherwise find.

“We started working with Baylor to adapt those capabilities, and extend it to show this process can be leveraged to discover new things about p53 biology,” says Ying Chen, a researcher at IBM Research Almaden.

It typically takes between $500 million and $1 billion dollars to develop a new drug, and 90 percent of candidates that begin the journey don’t make it to market, says Chen. The cost of failed drugs is cited as one reason that some drugs command such high prices (see “A Tale of Two Drugs”).

Software that read tens of thousands of research papers and then predicted new discoveries about the workings of a protein that’s key to cancer could herald a faster approach to developing new drugs.

The software, developed in a collaboration between IBM and Baylor College of Medicine, was set loose on more than 60,000 research papers that focused on p53, a protein involved in cell growth, which is implicated in most cancers. By parsing sentences in the documents, the software could build an understanding of what is known about enzymes called kinases that act on p53 and regulate its behavior; these enzymes are common targets for cancer treatments. It then generated a list of other proteins mentioned in the literature that were probably undiscovered kinases, based on what it knew about those already identified. Most of its predictions tested so far have turned out to be correct.

“We have tested 10,” Olivier Lichtarge of Baylor said Tuesday. “Seven seem to be true kinases.” He presented preliminary results of his collaboration with IBM at a meeting on the topic of Cognitive Computing held at IBM’s Almaden research lab.

Lichtarge also described an earlier test of the software in which it was given access to research literature published prior to 2003 to see if it could predict p53 kinases that have been discovered since. The software found seven of the nine kinases discovered after 2003.

“P53 biology is central to all kinds of disease,” says Lichtarge, and so it seemed to be the perfect way to show that software-generated discoveries might speed up research that leads to new treatments. He believes the results so far show that to be true, although the kinase-hunting experiments are yet to be reviewed and published in a scientific journal, and more lab tests are still planned to confirm the findings so far. “Kinases are typically discovered at a rate of one per year,” says Lichtarge. “The rate of discovery can be vastly accelerated.”

Lichtarge said that although the software was configured to look only for kinases, it also seems capable of identifying previously unidentified phosphatases, which are enzymes that reverse the action of kinases. It can also identify other types of protein that may interact with p53.

The Baylor collaboration is intended to test a way of extending a set of tools that IBM researchers already offer to pharmaceutical companies. Under the banner of accelerated discovery, text-analyzing tools are used to mine publications, patents, and molecular databases. For example, a company in search of a new malaria drug might use IBM’s tools to find molecules with characteristics that are similar to existing treatments. Because software can search more widely, it might turn up molecules in overlooked publications or patents that no human would otherwise find.

“We started working with Baylor to adapt those capabilities, and extend it to show this process can be leveraged to discover new things about p53 biology,” says Ying Chen, a researcher at IBM Research Almaden.

It typically takes between $500 million and $1 billion dollars to develop a new drug, and 90 percent of candidates that begin the journey don’t make it to market, says Chen. The cost of failed drugs is cited as one reason that some drugs command such high prices (see “A Tale of Two Drugs”).

Lawrence Hunter, director of the Center for Computational Pharmacology at the University of Colorado Denver, says that careful empirical confirmation is needed for claims that the software has made new discoveries. But he says that progress in this area is important, and that such tools are desperately needed.

The volume of research literature both old and new is now so large that even specialists can’t hope to read everything that might help them, says Hunter. Last year over one million new articles were added to the U.S. National Library of Medicine’s Medline database of biomedical research papers, which now contains 23 million items. Software can crunch through massive amounts of information and find vital clues in unexpected places. “Crucial bits of information are sometimes isolated facts that are only a minor point in an article but would be really important if you can find it,” he says.

Read the entire article here.

Big Bad Data; Growing Discrimination

You may be an anonymous data point online, but it does not follow that you’ll not still be a victim of personal discrimination. As technology to gather and track your every move online steadily improves so do the opportunities to misuse that information. Many of us are already unwitting participants in the growing internet filter bubble — a phenomenon that amplifies our personal tastes, opinions and shopping habits by pre-screening and delivering only more of the same based on our online footprints. Many argue that this is benign and even beneficial — after all isn’t it wonderful when Google’s ad network pops up product recommendations for you on “random” websites based on your previous searches, or isn’t it that much more effective when news organizations only deliver stories based on your previous browsing history, interests, affiliations or demographic?

Not so. We are in ever-increasing danger of allowing others to control what we see and hear online. So kiss discovery and serendipity goodbye. More troubling still, beyond the ability to deliver personalized experiences online, as corporations gather more and more data from and about you, they can decide if you are of value. While your data may be aggregated and anonymized, the results can still help a business target you, or not, whether you are explicitly identified by name or not.

So, perhaps your previous online shopping history divulged a proclivity for certain medications; well, kiss goodbye to that pre-existing health condition waiver. Or, perhaps the online groups that you belong to are rather left-of-center or way out in left-field; well, say hello to a smaller annual bonus from your conservative employer. Perhaps, the news or social groups that you subscribe to don’t align very well with the values of your landlord or prospective employer. Or, perhaps, Amazon will not allow you to shop online any more because the company knows your annual take-home pay and that you are a potential credit risk. You get the idea.

Without adequate safe-guards and controls those who gather the data about you will be in the driver’s seat. Whereas, put simply, it should be the other way around — you should own the data that describes who you are and what your do, and you should determine who gets to see it and how it’s used. Welcome to the age of Big (Bad) Data and the new age of data-driven discrimination.

From Technology Review:

Data analytics are being used to implement a subtle form of discrimination, while anonymous data sets can be mined to reveal health data and other private information, a Microsoft researcher warned this morning at MIT Technology Review’s EmTech conference.

Kate Crawford, principal researcher at Microsoft Research, argued that these problems could be addressed with new legal approaches to the use of personal data.

In a new paper, she and a colleague propose a system of “due process” that would give people more legal rights to understand how data analytics are used in determinations made against them, such as denial of health insurance or a job. “It’s the very start of a conversation about how to do this better,” Crawford, who is also a visiting professor at the MIT Center for Civic Media, said in an interview before the event. “People think ‘big data’ avoids the problem of discrimination, because you are dealing with big data sets, but in fact big data is being used for more and more precise forms of discrimination—a form of data redlining.”

During her talk this morning, Crawford added that with big data, “you will never know what those discriminations are, and I think that’s where the concern begins.”

Health data is particularly vulnerable, the researcher says. Search terms for disease symptoms, online purchases of medical supplies, and even the RFID tags on drug packaging can provide websites and retailers with information about a person’s health.

As Crawford and Jason Schultz, a professor at New York University Law School, wrote in their paper: “When these data sets are cross-referenced with traditional health information, as big data is designed to do, it is possible to generate a detailed picture about a person’s health, including information a person may never have disclosed to a health provider.”

And a recent Cambridge University study, which Crawford alluded to during her talk, found that “highly sensitive personal attributes”— including sexual orientation, personality traits, use of addictive substances, and even parental separation—are highly predictable by analyzing what people click on to indicate they “like” on Facebook. The study analyzed the “likes” of 58,000 Facebook users.

Similarly, purchasing histories, tweets, and demographic, location, and other information gathered about individual Web users, when combined with data from other sources, can result in new kinds of profiles that an employer or landlord might use to deny someone a job or an apartment.

In response to such risks, the paper’s authors propose a legal framework they call “big data due process.” Under this concept, a person who has been subject to some determination—whether denial of health insurance, rejection of a job or housing application, or an arrest—would have the right to learn how big data analytics were used.

This would entail the sorts of disclosure and cross-examination rights that are already enshrined in the legal systems of the United States and many other nations. “Before there can be greater social acceptance of big data’s role in decision-making, especially within government, it must also appear fair, and have an acceptable degree of predictability, transparency, and rationality,” the authors write.

Data analytics can also get things deeply wrong, Crawford notes. Even the formerly successful use of Google search terms to identify flu outbreaks failed last year, when actual cases fell far short of predictions. Increased flu-related media coverage and chatter about the flu in social media were mistaken for signs of people complaining they were sick, leading to the overestimates.  “This is where social media data can get complicated,” Crawford said.

Read the entire article here.

Big Data and Your Career

If you’re a professional or like networking, but shun Facebook, then chances are good that you hang-out on LinkedIn. And, as you do, the company is trawling through your personal data and that of hundreds of millions of other members to turn human resources and career planning into a science — all with the help of big data.

From the Washington Post:

Every second, more than two people join LinkedIn’s network of 238 million members.

They are head hunters in search of talent. They are the talent in search of a job. And sometimes, the career site for the professional class is just a hangout for the well-connected worker.

LinkedIn, using complex, carefully concocted algorithms, analyzes their profiles and site behavior to steer them to opportunity. And corporations parse that data to set business strategy. As the network grows moment by moment, LinkedIn’s rich trove of information also grows more detailed and more comprehensive.

It’s big data meeting human resources. And that data, core to LinkedIn’s potential, could catapult the company beyond building careers and into the realms of education, urban development and economic policy.

Chief executive Jeff Weiner put it this way in a recent blog post: “Our ultimate dream is to develop the world’s first economic graph,” a sort of digital map of skills, workers and jobs across the global economy.

Ambitions, in other words, that are a far cry from the industry’s early stabs at modernizing the old-fashioned jobs board (think ­Monster.com and CareerBuilder).

So far, LinkedIn’s data-driven strategy appears to be working: It turned its highest-ever profit in the second quarter, $364 million, and its stock price has grown sixfold since its 2011 initial public offering. Because its workforce has doubled in a year, it’s fast outgrowing its Mountain View headquarters, just down the street from Google. In 2014, it’ll move into Yahoo’s neighborhood with a new campus in Sunnyvale.

The company makes money three ways: members who pay for premium access; ad sales; and its gold mine, a suite of products created by its talent solutions division and sold to corporate clients, which accounted for $205 million in revenue last quarter.

When LinkedIn staffers talk about their network and products, they often refer to an “ecosystem.” It’s an apt metaphor, because the value of their offerings would seem to rely heavily on equilibrium.

LinkedIn’s usefulness to recruiters is deeply contingent on the quality and depth of its membership base. And its usefulness to members depends on the quality of their experience on the site. LinkedIn’s success, then, depends largely on its ability to do more than just amass new members. The company must get its users to maintain comprehensive, up-to-date profiles, and it must give them a reason to visit the site frequently.

To engage members, the company has deployed new strategies on all fronts: a redesigned site; stuff to read from the likes of Bill Gates, Jack Welch and Richard Branson; new mobile applications; status updates; targeted aggregated news stories and more.

By throwing more and more at users, of course, LinkedIn risks undermining the very thing that’s made it the go-to site for recruiters: a mass of high-quality candidates, sorted and evaluated and offered up.

“I think there’s a chance of people getting tired of it and checking out of it,” said Chris Collins, director of Cornell University’s Center for Advanced Human Resource Studies.

Read the entire article here.

Image courtesy of Telegraph / LinkedIn.

Leadership and the Tyranny of Big Data

“There are three kinds of lies: lies, damned lies, and statistics”, goes the adage popularized by author Mark Twain.

Most people take for granted that numbers can be persuasive — just take a look at your bank balance. Also, most accept the notion that data can be used, misused, misinterpreted, re-interpreted and distorted to support or counter almost any argument. Just listen to a politician quote polling numbers and then hear an opposing politician make a contrary argument using the very same statistics. Or, better still, familiarize yourself with pseudo-science of economics.

Authors Kenneth Cukier (data editor for The Economist) and Viktor Mayer-Schönberger (professor of Internet governance) examine this phenomenon in their book Big Data: A Revolution That Will Transform How We Live, Work, and Think. They eloquently present the example of Robert McNamara, U.S. defense secretary during the Vietnam war, who in(famously) used his detailed spreadsheets — including daily body count — to manage and measure progress. Following the end of the war, many U.S. generals later described this over-reliance on numbers as misguided dictatorship that led many to make ill-informed decisions — based solely on numbers — and to fudge their figures.

This classic example leads them to a timely and important caution: as the range and scale of big data becomes ever greater, and while it may offer us great benefits, it can and will be used to mislead.

From Technology review:

Big data is poised to transform society, from how we diagnose illness to how we educate children, even making it possible for a car to drive itself. Information is emerging as a new economic input, a vital resource. Companies, governments, and even individuals will be measuring and optimizing everything possible.

But there is a dark side. Big data erodes privacy. And when it is used to make predictions about what we are likely to do but haven’t yet done, it threatens freedom as well. Yet big data also exacerbates a very old problem: relying on the numbers when they are far more fallible than we think. Nothing underscores the consequences of data analysis gone awry more than the story of Robert McNamara.

McNamara was a numbers guy. Appointed the U.S. secretary of defense when tensions in Vietnam rose in the early 1960s, he insisted on getting data on everything he could. Only by applying statistical rigor, he believed, could decision makers understand a complex situation and make the right choices. The world in his view was a mass of unruly information that—if delineated, denoted, demarcated, and quantified—could be tamed by human hand and fall under human will. McNamara sought Truth, and that Truth could be found in data. Among the numbers that came back to him was the “body count.”

McNamara developed his love of numbers as a student at Harvard Business School and then as its youngest assistant professor at age 24. He applied this rigor during the Second World War as part of an elite Pentagon team called Statistical Control, which brought data-driven decision making to one of the world’s largest bureaucracies. Before this, the military was blind. It didn’t know, for instance, the type, quantity, or location of spare airplane parts. Data came to the rescue. Just making armament procurement more efficient saved $3.6 billion in 1943. Modern war demanded the efficient allocation of resources; the team’s work was a stunning success.

At war’s end, the members of this group offered their skills to corporate America. The Ford Motor Company was floundering, and a desperate Henry Ford II handed them the reins. Just as they knew nothing about the military when they helped win the war, so too were they clueless about making cars. Still, the so-called “Whiz Kids” turned the company around.

McNamara rose swiftly up the ranks, trotting out a data point for every situation. Harried factory managers produced the figures he demanded—whether they were correct or not. When an edict came down that all inventory from one car model must be used before a new model could begin production, exasperated line managers simply dumped excess parts into a nearby river. The joke at the factory was that a fellow could walk on water—atop rusted pieces of 1950 and 1951 cars.

McNamara epitomized the hyper-rational executive who relied on numbers rather than sentiments, and who could apply his quantitative skills to any industry he turned them to. In 1960 he was named president of Ford, a position he held for only a few weeks before being tapped to join President Kennedy’s cabinet as secretary of defense.

As the Vietnam conflict escalated and the United States sent more troops, it became clear that this was a war of wills, not of territory. America’s strategy was to pound the Viet Cong to the negotiation table. The way to measure progress, therefore, was by the number of enemy killed. The body count was published daily in the newspapers. To the war’s supporters it was proof of progress; to critics, evidence of its immorality. The body count was the data point that defined an era.

McNamara relied on the figures, fetishized them. With his perfectly combed-back hair and his flawlessly knotted tie, McNamara felt he could comprehend what was happening on the ground only by staring at a spreadsheet—at all those orderly rows and columns, calculations and charts, whose mastery seemed to bring him one standard deviation closer to God.

In 1977, two years after the last helicopter lifted off the rooftop of the U.S. embassy in Saigon, a retired Army general, Douglas Kinnard, published a landmark survey called The War Managers that revealed the quagmire of quantification. A mere 2 percent of America’s generals considered the body count a valid way to measure progress. “A fake—totally worthless,” wrote one general in his comments. “Often blatant lies,” wrote another. “They were grossly exaggerated by many units primarily because of the incredible interest shown by people like McNamara,” said a third.

Read the entire article after the jump.

Image: Robert McNamara at a cabinet meeting, 22 Nov 1967. Courtesy of Wikipedia / Public domain.

Big Data and Even Bigger Problems

First a definition. Big data: typically a collection of large and complex datasets that are too cumbersome to process and analyze using traditional computational approaches and database applications. Usually the big data moniker will be accompanied by an IT vendor’s pitch for shiny new software (and possible hardware) solution able to crunch through petabytes (one petabyte is a million gigabytes) of data and produce a visualizable result that mere mortals can decipher.

Many companies see big data and related solutions as a panacea to a range of business challenges: customer service, medical diagnostics, product development, shipping and logistics, climate change studies, genomic analysis and so on. A great example was the last U.S. election. Many political wonks — from both sides of the aisle — agreed that President Obama was significantly aided in his won re-election with the help of big data. So, with that in mind, many are now looking at more important big data problems.

From Technology Review:

As chief scientist for President Obama’s reëlection effort, Rayid Ghani helped revolutionize the use of data in politics. During the final 18 months of the campaign, he joined a sprawling team of data and software experts who sifted, collated, and combined dozens of pieces of information on each registered U.S. voter to discover patterns that let them target fund-raising appeals and ads.

Now, with Obama again ensconced in the Oval Office, some veterans of the campaign’s data squad are applying lessons from the campaign to tackle social issues such as education and environmental stewardship. Edgeflip, a startup Ghani founded in January with two other campaign members, plans to turn the ad hoc data analysis tools developed for Obama for America into software that can make nonprofits more effective at raising money and recruiting volunteers.

Ghani isn’t the only one thinking along these lines. In Chicago, Ghani’s hometown and the site of Obama for America headquarters, some campaign members are helping the city make available records of utility usage and crime statistics so developers can build apps that attempt to improve life there. It’s all part of a bigger idea to engineer social systems by scanning the numerical exhaust from mundane activities for patterns that might bear on everything from traffic snarls to human trafficking. Among those pursuing such humanitarian goals are startups like DataKind as well as large companies like IBM, which is redrawing bus routes in Ivory Coast (see “African Bus Routes Redrawn Using Cell-Phone Data”), and Google, with its flu-tracking software (see “Sick Searchers Help Track Flu”).

Ghani, who is 35, has had a longstanding interest in social causes, like tutoring disadvantaged kids. But he developed his data-mining savvy during 10 years as director of analytics at Accenture, helping retail chains forecast sales, creating models of consumer behavior, and writing papers with titles like “Data Mining for Business Applications.”

Before joining the Obama campaign in July 2011, Ghani wasn’t even sure his expertise in machine learning and predicting online prices could have an impact on a social cause. But the campaign’s success in applying such methods on the fly to sway voters is now recognized as having been potentially decisive in the election’s outcome (see “A More Perfect Union”).

“I realized two things,” says Ghani. “It’s doable at the massive scale of the campaign, and that means it’s doable in the context of other problems.”

At Obama for America, Ghani helped build statistical models that assessed each voter along five axes: support for the president; susceptibility to being persuaded to support the president; willingness to donate money; willingness to volunteer; and likelihood of casting a vote. These models allowed the campaign to target door knocks, phone calls, TV spots, and online ads to where they were most likely to benefit Obama.

One of the most important ideas he developed, dubbed “targeted sharing,” now forms the basis of Edgeflip’s first product. It’s a Facebook app that prompts people to share information from a nonprofit, but only with those friends predicted to respond favorably. That’s a big change from the usual scattershot approach of posting pleas for money or help and hoping they’ll reach the right people.

Edgeflip’s app, like the one Ghani conceived for Obama, will ask people who share a post to provide access to their list of friends. This will pull in not only friends’ names but also personal details, like their age, that can feed models of who is most likely to help.

Say a hurricane strikes the southeastern United States and the Red Cross needs clean-up workers. The app would ask Facebook users to share the Red Cross message, but only with friends who live in the storm zone, are young and likely to do manual labor, and have previously shown interest in content shared by that user. But if the same person shared an appeal for donations instead, he or she would be prompted to pass it along to friends who are older, live farther away, and have donated money in the past.

Michael Slaby, a senior technology official for Obama who hired Ghani for the 2012 election season, sees great promise in the targeted sharing technique. “It’s one of the most compelling innovations to come out of the campaign,” says Slaby. “It has the potential to make online activism much more efficient and effective.”

For instance, Ghani has been working with Fidel Vargas, CEO of the Hispanic Scholarship Fund, to increase that organization’s analytical savvy. Vargas thinks social data could predict which scholarship recipients are most likely to contribute to the fund after they graduate. “Then you’d be able to give away scholarships to qualified students who would have a higher probability of giving back,” he says. “Everyone would be much better off.”

Ghani sees a far bigger role for technology in the social sphere. He imagines online petitions that act like open-source software, getting passed around and improved. Social programs, too, could get constantly tested and improved. “I can imagine policies being designed a lot more collaboratively,” he says. “I don’t know if the politicians are ready to deal with it.” He also thinks there’s a huge amount of untapped information out there about childhood obesity, gang membership, and infant mortality, all ready for big data’s touch.

Read the entire article here.

Inforgraphic courtesy of visua.ly. See the original here.

Big Data at the Personal Level

Stephen Wolfram, physicist, mathematician and complexity theorist, has taken big data ideas to an entirely new level — he’s quantifying himself and his relationships. He calls this discipline personal analytics.

While examining every phone call and computer keystroke he’s made may be rather useful to the FBI or to marketers, it is not until that personal data is tracked for physiological and medical purposes that it could become extremely valuable. But then again who wants their every move tracked 24 hours a day, even for medical science?

From ars technica:

Don’t be surprised if Stephen Wolfram, the renowned complexity theorist, software company CEO, and night owl, wants to schedule a work call with you at 9 p.m. In fact, after a decade of logging every phone call he makes, Wolfram knows the exact probability he’ll be on the phone with someone at that time: 39 percent.

Wolfram, a British-born physicist who earned a doctorate at age 20, is obsessed with data and the rules that explain it. He is the creator of the software Mathematica and of Wolfram Alpha, the nerdy “computational knowledge engine” that can tell you the distance to the moon right now, in units including light-seconds.

Now Wolfram wants to apply the same techniques to people’s personal data, an idea he calls “personal analytics.” He started with himself. In a blog post last year, Wolfram disclosed and analyzed a detailed record of his life stretching back three decades, including documents, hundreds of thousands of e-mails, and 10 years of computer keystrokes, a tally of which is e-mailed to him each morning so he can track his productivity the day before.

Last year, his company released its first consumer product in this vein, called Personal Analytics for Facebook. In under a minute, the software generates a detailed study of a person’s relationships and behavior on the site. My own report was revealing enough. It told me which friend lives at the highest latitude (Wicklow, Ireland) and the lowest (Brisbane, Australia), the percentage who are married (76.7 percent), and everyone’s local time. More of my friends are Scorpios than any other sign of the zodiac.

It looks just like a dashboard for your life, which Wolfram says is exactly the point. In a phone call that was recorded and whose start and stop time was entered into Wolfram’s life log, he discussed why personal analytics will make people more efficient at work and in their personal lives.

What do you typically record about yourself?

E-mails, documents, and normally, if I was in front of my computer, it would be recording keystrokes. I have a motion sensor for the room that records when I pace up and down. Also a pedometer, and I am trying to get an eye-tracking system set up, but I haven’t done that yet. Oh, and I’ve been wearing a sensor to measure my posture.

Do you think that you’re the most quantified person on the planet?

I couldn’t imagine that that was the case until maybe a year ago, when I collected together a bunch of this data and wrote a blog post on it. I was expecting that there would be people who would come forward and say, “Gosh, I’ve got way more than you.” But nobody’s come forward. I think by default that may mean I’m it, so to speak.

You coined this term “personal analytics.” What does it mean?

There’s organizational analytics, which is looking at an organization and trying to understand what the data says about its operation. Personal analytics is what you can figure out applying analytics to the person, to understand the operation of the person.

Read the entire article after the jump.

Image courtesy of Stephen Wolfram.

Google’s AI

The collective IQ of Google, the company, inched up a few notches in January of 2013 when they hired Ray Kurzweil. Over the coming years if the work of Kurzweil, and many of his colleagues, pays off the company’s intelligence may surge significantly. This time though it will be thanks to their work on artificial intelligence (AI), machine learning and (very) big data.

From  Technology Review:

When Ray Kurzweil met with Google CEO Larry Page last July, he wasn’t looking for a job. A respected inventor who’s become a machine-intelligence futurist, Kurzweil wanted to discuss his upcoming book How to Create a Mind. He told Page, who had read an early draft, that he wanted to start a company to develop his ideas about how to build a truly intelligent computer: one that could understand language and then make inferences and decisions on its own.

It quickly became obvious that such an effort would require nothing less than Google-scale data and computing power. “I could try to give you some access to it,” Page told Kurzweil. “But it’s going to be very difficult to do that for an independent company.” So Page suggested that Kurzweil, who had never held a job anywhere but his own companies, join Google instead. It didn’t take Kurzweil long to make up his mind: in January he started working for Google as a director of engineering. “This is the culmination of literally 50 years of my focus on artificial intelligence,” he says.

Kurzweil was attracted not just by Google’s computing resources but also by the startling progress the company has made in a branch of AI called deep learning. Deep-learning software attempts to mimic the activity in layers of neurons in the neocortex, the wrinkly 80 percent of the brain where thinking occurs. The software learns, in a very real sense, to recognize patterns in digital representations of sounds, images, and other data.

The basic idea—that software can simulate the neocortex’s large array of neurons in an artificial “neural network”—is decades old, and it has led to as many disappointments as breakthroughs. But because of improvements in mathematical formulas and increasingly powerful computers, computer scientists can now model many more layers of virtual neurons than ever before.

With this greater depth, they are producing remarkable advances in speech and image recognition. Last June, a Google deep-learning system that had been shown 10 million images from YouTube videos proved almost twice as good as any previous image recognition effort at identifying objects such as cats. Google also used the technology to cut the error rate on speech recognition in its latest Android mobile software. In October, Microsoft chief research officer Rick Rashid wowed attendees at a lecture in China with a demonstration of speech software that transcribed his spoken words into English text with an error rate of 7 percent, translated them into Chinese-language text, and then simulated his own voice uttering them in Mandarin. That same month, a team of three graduate students and two professors won a contest held by Merck to identify molecules that could lead to new drugs. The group used deep learning to zero in on the molecules most likely to bind to their targets.

Google in particular has become a magnet for deep learning and related AI talent. In March the company bought a startup cofounded by Geoffrey Hinton, a University of Toronto computer science professor who was part of the team that won the Merck contest. Hinton, who will split his time between the university and Google, says he plans to “take ideas out of this field and apply them to real problems” such as image recognition, search, and natural-language understanding, he says.

All this has normally cautious AI researchers hopeful that intelligent machines may finally escape the pages of science fiction. Indeed, machine intelligence is starting to transform everything from communications and computing to medicine, manufacturing, and transportation. The possibilities are apparent in IBM’s Jeopardy!-winning Watson computer, which uses some deep-learning techniques and is now being trained to help doctors make better decisions. Microsoft has deployed deep learning in its Windows Phone and Bing voice search.

Extending deep learning into applications beyond speech and image recognition will require more conceptual and software breakthroughs, not to mention many more advances in processing power. And we probably won’t see machines we all agree can think for themselves for years, perhaps decades—if ever. But for now, says Peter Lee, head of Microsoft Research USA, “deep learning has reignited some of the grand challenges in artificial intelligence.”

Building a Brain

There have been many competing approaches to those challenges. One has been to feed computers with information and rules about the world, which required programmers to laboriously write software that is familiar with the attributes of, say, an edge or a sound. That took lots of time and still left the systems unable to deal with ambiguous data; they were limited to narrow, controlled applications such as phone menu systems that ask you to make queries by saying specific words.

Neural networks, developed in the 1950s not long after the dawn of AI research, looked promising because they attempted to simulate the way the brain worked, though in greatly simplified form. A program maps out a set of virtual neurons and then assigns random numerical values, or “weights,” to connections between them. These weights determine how each simulated neuron responds—with a mathematical output between 0 and 1—to a digitized feature such as an edge or a shade of blue in an image, or a particular energy level at one frequency in a phoneme, the individual unit of sound in spoken syllables.

Programmers would train a neural network to detect an object or phoneme by blitzing the network with digitized versions of images containing those objects or sound waves containing those phonemes. If the network didn’t accurately recognize a particular pattern, an algorithm would adjust the weights. The eventual goal of this training was to get the network to consistently recognize the patterns in speech or sets of images that we humans know as, say, the phoneme “d” or the image of a dog. This is much the same way a child learns what a dog is by noticing the details of head shape, behavior, and the like in furry, barking animals that other people call dogs.

But early neural networks could simulate only a very limited number of neurons at once, so they could not recognize patterns of great complexity. They languished through the 1970s.

In the mid-1980s, Hinton and others helped spark a revival of interest in neural networks with so-called “deep” models that made better use of many layers of software neurons. But the technique still required heavy human involvement: programmers had to label data before feeding it to the network. And complex speech or image recognition required more computer power than was then available.

Finally, however, in the last decade ­Hinton and other researchers made some fundamental conceptual breakthroughs. In 2006, Hinton developed a more efficient way to teach individual layers of neurons. The first layer learns primitive features, like an edge in an image or the tiniest unit of speech sound. It does this by finding combinations of digitized pixels or sound waves that occur more often than they should by chance. Once that layer accurately recognizes those features, they’re fed to the next layer, which trains itself to recognize more complex features, like a corner or a combination of speech sounds. The process is repeated in successive layers until the system can reliably recognize phonemes or objects.

Read the entire fascinating article following the jump.

Image courtesy of Wired.

Tracking and Monetizing Your Every Move

Your movements are valuable — but not in the way you may think. Mobile technology companies are moving rapidly to exploit the vast amount of data collected from the billions of mobile devices. This data is extremely valuable to an array of organizations, including urban planners, retailers, and travel and transportation marketers. And, of course, this raises significant privacy concerns. Many believe that when the data is used collectively it preserves user anonymity. However, if correlated with other data sources it could be used to discover a range of unintended and previously private information, relating both to individuals and to groups.

From MIT Technology Review:

Wireless operators have access to an unprecedented volume of information about users’ real-world activities, but for years these massive data troves were put to little use other than for internal planning and marketing.

This data is under lock and key no more. Under pressure to seek new revenue streams (see “AT&T Looks to Outside Developers for Innovation”), a growing number of mobile carriers are now carefully mining, packaging, and repurposing their subscriber data to create powerful statistics about how people are moving about in the real world.

More comprehensive than the data collected by any app, this is the kind of information that, experts believe, could help cities plan smarter road networks, businesses reach more potential customers, and health officials track diseases. But even if shared with the utmost of care to protect anonymity, it could also present new privacy risks for customers.

Verizon Wireless, the largest U.S. carrier with more than 98 million retail customers, shows how such a program could come together. In late 2011, the company changed its privacy policy so that it could share anonymous and aggregated subscriber data with outside parties. That made possible the launch of its Precision Market Insights division last October.

The program, still in its early days, is creating a natural extension of what already happens online, with websites tracking clicks and getting a detailed breakdown of where visitors come from and what they are interested in.

Similarly, Verizon is working to sell demographics about the people who, for example, attend an event, how they got there or the kinds of apps they use once they arrive. In a recent case study, says program spokeswoman Debra Lewis, Verizon showed that fans from Baltimore outnumbered fans from San Francisco by three to one inside the Super Bowl stadium. That information might have been expensive or difficult to obtain in other ways, such as through surveys, because not all the people in the stadium purchased their own tickets and had credit card information on file, nor had they all downloaded the Super Bowl’s app.

Other telecommunications companies are exploring similar ideas. In Europe, for example, Telefonica launched a similar program last October, and the head of this new business unit gave the keynote address at new industry conference on “big data monetization in telecoms” in January.

“It doesn’t look to me like it’s a big part of their [telcos’] business yet, though at the same time it could be,” says Vincent Blondel, an applied mathematician who is now working on a research challenge from the operator Orange to analyze two billion anonymous records of communications between five million customers in Africa.

The concerns about making such data available, Blondel says, are not that individual data points will leak out or contain compromising information but that they might be cross-referenced with other data sources to reveal unintended details about individuals or specific groups (see “How Access to Location Data Could Trample Your Privacy”).

Already, some startups are building businesses by aggregating this kind of data in useful ways, beyond what individual companies may offer. For example, AirSage, an Atlanta, Georgia, a company founded in 2000, has spent much of the last decade negotiating what it says are exclusive rights to put its hardware inside the firewalls of two of the top three U.S. wireless carriers and collect, anonymize, encrypt, and analyze cellular tower signaling data in real time. Since AirSage solidified the second of these major partnerships about a year ago (it won’t specify which specific carriers it works with), it has been processing 15 billion locations a day and can account for movement of about a third of the U.S. population in some places to within less than 100 meters, says marketing vice president Andrea Moe.

As users’ mobile devices ping cellular towers in different locations, AirSage’s algorithms look for patterns in that location data—mostly to help transportation planners and traffic reports, so far. For example, the software might infer that the owners of devices that spend time in a business park from nine to five are likely at work, so a highway engineer might be able to estimate how much traffic on the local freeway exit is due to commuters.

Other companies are starting to add additional layers of information beyond cellular network data. One customer of AirSage is a relatively small San Francisco startup, Streetlight Data which recently raised $3 million in financing backed partly by the venture capital arm of Deutsche Telekom.

Streetlight buys both cellular network and GPS navigation data that can be mined for useful market research. (The cellular data covers a larger number of people, but the GPS data, collected by mapping software providers, can improve accuracy.) Today, many companies already build massive demographic and behavioral databases on top of U.S. Census information about households to help retailers choose where to build new stores and plan marketing budgets. But Streetlight’s software, with interactive, color-coded maps of neighborhoods and roads, offers more practical information. It can be tied to the demographics of people who work nearby, commute through on a particular highway, or are just there for a visit, rather than just supplying information about who lives in the area.

Read the entire article following the jump.

Image: mobile devices. Courtesy of W3.org

Big Data Versus Talking Heads

With the election in the United States now decided, the dissection of the result is well underway. And, perhaps the biggest winner of all is the science of big data. Yes, mathematical analysis of vast quantities of demographic and polling data won over the voodoo proclamations and gut felt predictions of the punditocracy. Now, that’s a result truly worth celebrating.

[div class=attrib]From ReadWriteWeb:[end-div]

Political pundits, mostly Republican, went into a frenzy when Nate Silver, a New York Times pollster and stats blogger, predicted that Barack Obama would win reelection.

But Silver was right and the pundits were wrong – and the impact of this goes way beyond politics.

Silver won because, um, science. As ReadWrite’s own Dan Rowinski noted,  Silver’s methodology is all based on data. He “takes deep data sets and applies logical analytical methods” to them. It’s all just numbers.

Silver runs a blog called FiveThirtyEight, which is licensed by the Times. In 2008 he called the presidential election with incredible accuracy, getting 49 out of 50 states right. But this year he rolled a perfect score, 50 out of 50, even nailing the margins in many cases. His uncanny accuracy on this year’s election represents what Rowinski calls a victory of “logic over punditry.”

In fact it’s bigger than that. Bear in mind that before turning his attention to politics in 2007 and 2008, Silver was using computer models to make predictions about baseball. What does it mean when some punk kid baseball nerd can just wade into politics and start kicking butt on all these long-time “experts” who have spent their entire lives covering politics?

It means something big is happening.

Man Versus Machine

This is about the triumph of machines and software over gut instinct.

The age of voodoo is over. The era of talking about something as a “dark art” is done. In a world with big computers and big data, there are no dark arts.

And thank God for that. One by one, computers and the people who know how to use them are knocking off these crazy notions about gut instinct and intuition that humans like to cling to. For far too long we’ve applied this kind of fuzzy thinking to everything, from silly stuff like sports to important stuff like medicine.

Someday, and I hope it’s soon, we will enter the age of intelligent machines, when true artificial intellgence becomes a reality, and when we look back on the late 20th and early 21st century it will seem medieval in its simplicity and reliance on superstition.

What most amazes me is the backlash and freak-out that occurs every time some “dark art” gets knocked over in a particular domain. Watch Moneyball (or read the book) and you’ll see the old guard (in that case, baseball scouts) grow furious as they realize that computers can do their job better than they can. (Of course it’s not computers; it’s people who know how to use computers.)

We saw the same thing when IBM’s Deep Blue defeated Garry Kasparov in 1997. We saw it when Watson beat humans at Jeopardy.

It’s happening in advertising, which used to be a dark art but is increasingly a computer-driven numbers game. It’s also happening in my business, the news media, prompting the same kind of furor as happened with the baseball scouts in Moneyball.

[div class=attrib]Read the entire article following the jump.[end-div]

[div class=attrib]Political pundits, Left to right: Mark Halperin, David Brooks, Jon Stewart, Tim Russert, Matt Drudge, John Harris & Jim VandeHei, Rush Limbaugh, Sean Hannity, Chris Matthews, Karl Rove. Courtesy of Telegraph.[end-div]

What’s All the Fuss About Big Data?

We excerpt an interview with big data pioneer and computer scientist, Alex Pentland, via the Edge. Pentland is a leading thinker in computational social science and currently directs the Human Dynamics Laboratory at MIT.

While there is no exact definition of “big data” it tends to be characterized quantitatively and qualitatively differently from data commonly used by most organizations. Where regular data can be stored, processed and analyzed using common database tools and analytical engines, big data refers to vast collections of data that often lie beyond the realm of regular computation. So, often big data requires vast and specialized storage and enormous processing capabilities. Data sets that fall into the big data area cover such areas as climate science, genomics, particle physics, and computational social science.

Big data holds true promise. However, while storage and processing power now enable quick and efficient crunching of tera- and even petabytes of data, tools for comprehensive analysis and visualization lag behind.

[div class=attrib]Alex Pentland via the Edge:[end-div]

Recently I seem to have become MIT’s Big Data guy, with people like Tim O’Reilly and “Forbes” calling me one of the seven most powerful data scientists in the world. I’m not sure what all of that means, but I have a distinctive view about Big Data, so maybe it is something that people want to hear.

I believe that the power of Big Data is that it is information about people’s behavior instead of information about their beliefs. It’s about the behavior of customers, employees, and prospects for your new business. It’s not about the things you post on Facebook, and it’s not about your searches on Google, which is what most people think about, and it’s not data from internal company processes and RFIDs. This sort of Big Data comes from things like location data off of your cell phone or credit card, it’s the little data breadcrumbs that you leave behind you as you move around in the world.

What those breadcrumbs tell is the story of your life. It tells what you’ve chosen to do. That’s very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether you are the sort of person who will pay back loans. They can tell you if you’re likely to get diabetes.

They can do this because the sort of person you are is largely determined by your social context, so if I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your crowd. You can tell all sorts of things about a person, even though it’s not explicitly in the data, because people are so enmeshed in the surrounding social fabric that it determines the sorts of things that they think are normal, and what behaviors they will learn from each other.

As a consequence analysis of Big Data is increasingly about finding connections, connections with the people around you, and connections between people’s behavior and outcomes. You can see this in all sorts of places. For instance, one type of Big Data and connection analysis concerns financial data. Not just the flash crash or the Great Recession, but also all the other sorts of bubbles that occur. What these are is these are systems of people, communications, and decisions that go badly awry. Big Data shows us the connections that cause these events. Big data gives us the possibility of understanding how these systems of people and machines work, and whether they’re stable.

The notion that it is connections between people that is really important is key, because researchers have mostly been trying to understand things like financial bubbles using what is called Complexity Science or Web Science. But these older ways of thinking about Big Data leaves the humans out of the equation. What actually matters is how the people are connected together by the machines and how, as a whole, they create a financial market, a government, a company, and other social structures.

Because it is so important to understand these connections Asu Ozdaglar and I have recently created the MIT Center for Connection Science and Engineering, which spans all of the different MIT departments and schools. It’s one of the very first MIT-wide Centers, because people from all sorts of specialties are coming to understand that it is the connections between people that is actually the core problem in making transportation systems work well, in making energy grids work efficiently, and in making financial systems stable. Markets are not just about rules or algorithms; they’re about people and algorithms together.

Understanding these human-machine systems is what’s going to make our future social systems stable and safe. We are getting beyond complexity, data science and web science, because we are including people as a key part of these systems. That’s the promise of Big Data, to really understand the systems that make our technological society. As you begin to understand them, then you can build systems that are better. The promise is for financial systems that don’t melt down, governments that don’t get mired in inaction, health systems that actually work, and so on, and so forth.

The barriers to better societal systems are not about the size or speed of data. They’re not about most of the things that people are focusing on when they talk about Big Data. Instead, the challenge is to figure out how to analyze the connections in this deluge of data and come to a new way of building systems based on understanding these connections.

Changing The Way We Design Systems

With Big Data traditional methods of system building are of limited use. The data is so big that any question you ask about it will usually have a statistically significant answer. This means, strangely, that the scientific method as we normally use it no longer works, because almost everything is significant!  As a consequence the normal laboratory-based question-and-answering process, the method that we have used to build systems for centuries, begins to fall apart.

Big data and the notion of Connection Science is outside of our normal way of managing things. We live in an era that builds on centuries of science, and our methods of building of systems, governments, organizations, and so on are pretty well defined. There are not a lot of things that are really novel. But with the coming of Big Data, we are going to be operating very much out of our old, familiar ballpark.

With Big Data you can easily get false correlations, for instance, “On Mondays, people who drive to work are more likely to get the flu.” If you look at the data using traditional methods, that may actually be true, but the problem is why is it true? Is it causal? Is it just an accident? You don’t know. Normal analysis methods won’t suffice to answer those questions. What we have to come up with is new ways to test the causality of connections in the real world far more than we have ever had to do before. We no can no longer rely on laboratory experiments; we need to actually do the experiments in the real world.

The other problem with Big Data is human understanding. When you find a connection that works, you’d like to be able to use it to build new systems, and that requires having human understanding of the connection. The managers and the owners have to understand what this new connection means. There needs to be a dialogue between our human intuition and the Big Data statistics, and that’s not something that’s built into most of our management systems today. Our managers have little concept of how to use big data analytics, what they mean, and what to believe.

In fact, the data scientists themselves don’t have much of intuition either…and that is a problem. I saw an estimate recently that said 70 to 80 percent of the results that are found in the machine learning literature, which is a key Big Data scientific field, are probably wrong because the researchers didn’t understand that they were overfitting the data. They didn’t have that dialogue between intuition and causal processes that generated the data. They just fit the model and got a good number and published it, and the reviewers didn’t catch it either. That’s pretty bad because if we start building our world on results like that, we’re going to end up with trains that crash into walls and other bad things. Management using Big Data is actually a radically new thing.

[div class=attrib]Read the entire article after the jump.[end-div]

[div class=attrib]Image courtesy of Techcrunch.[end-div]