Tag Archives: data

MondayMap: Food Rhythms

rhythm-of-food-screenshot

OK, I admit it. Today’s article is not strictly about a map, but I couldn’t resist these fascinating data visualizations. The graphic show some of the patterns and trends that can be derived from the vast mountains of data gathered from Google searches. A group of designers and data scientists from Truth & Beauty teamed up with Google News Labs to produce a portfolio of charts that show food and drink related searches over the last 12 years.

The visual above shows a clear spike in cocktail related searches in December (for entertaining during holiday season). Interestingly Searches for a “Tom Collins” have increased since 2004 whereas those for “Martini” have decreased in number. A more recent phenomenon on the cocktail scene seems to be the “Moscow Mule”.

Since most of the searches emanated in the United States the resulting charts show some fascinating changes in the nation’s collective nutritional mood. While some visualizations confirm the obvious — fruit searches peak when in season; pizza is popular year round — some  specific insights are more curious:

  • Orange Jell-O [“jelly” for my British readers] is popular for US Thanksgiving.
  • Tamale searches peak around Christmas.
  • Pumpkin spice latte searches increase in the fall, but searches are peaking earlier each year.
  • Superfood searches are up; fat-free searches are down.
  • Nacho searches peak around Super Bowl Sunday.
  • Cauliflower may be the new Kale.

You can check out much more from this gorgeous data visualization project at The Rhythm of Food.

Image: Screenshot from Rhythm of Food. Courtesy: Rhythm of Food.

Nuclear Codes and Floppy Disks

Floppy_disksSometimes a good case can be made for remaining a technological Luddite; sometimes eschewing the latest-and-greatest technical gizmo may actually work for you.

 

Take the case of the United States’ nuclear deterrent. A recent report on CBS 60 Minutes showed us how part of the computer system responsible for launch control of US intercontinental ballistic missiles (ICBM) still uses antiquated 8-inch floppy disks. This part of the national defense is so old and arcane it’s actually more secure than most contemporary computing systems and communications infrastructure. So, next time your internet-connected, cloud-based tablet or laptop gets hacked consider reverting to a pre-1980s device.

From ars technica:

In a report that aired on April 27, CBS 60 Minutes correspondent Leslie Stahl expressed surprise that part of the computer system responsible for controlling the launch of the Minuteman III intercontinental ballistic missiles relied on data loaded from 8-inch floppy disks. Most of the young officers stationed at the launch control center had never seen a floppy disk before they became “missileers.”

An Air Force officer showed Stahl one of the disks, marked “Top Secret,” which is used with the computer that handles what was once called the Strategic Air Command Digital Network (SACDIN), a communication system that delivers launch commands to US missile forces. Beyond the floppies, a majority of the systems in the Wyoming US Air Force launch control center (LCC) Stahl visited dated back to the 1960s and 1970s, offering the Air Force’s missile forces an added level of cyber security, ICBM forces commander Major General Jack Weinstein told 60 Minutes.

“A few years ago we did a complete analysis of our entire network,” Weinstein said. “Cyber engineers found out that the system is extremely safe and extremely secure in the way it’s developed.”

However, not all of the Minuteman launch control centers’ aging hardware is an advantage. The analog phone systems, for example, often make it difficult for the missileers to communicate with each other or with their base. The Air Force commissioned studies on updating the ground-based missile force last year, and it’s preparing to spend $19 million this year on updates to the launch control centers. The military has also requested $600 million next year for further improvements.

Read the entire article here.

Image: Various floppy disks. Courtesy: George George Chernilevsky,  2009 / Wikipedia.

Business Decison-Making Welcomes Science

data-visualization-ayasdi

It is likely that business will never eliminate gut instinct from the decision-making process. However, as data, now big data, increasingly pervades every crevice of every organization, the use of data-driven decisions will become the norm. As this happens, more and more businesses find themselves employing data scientists to help filter, categorize, mine and analyze these mountains of data in meaningful ways.

The caveat, of course, is that data, big data and an even bigger reliance on that data requires subject matter expertise and analysts with critical thinking skills and sound judgement — data cannot be used blindly.

From Technology review:

Throughout history, innovations in instrumentation—the microscope, the telescope, and the cyclotron—have repeatedly revolutionized science by improving scientists’ ability to measure the natural world. Now, with human behavior increasingly reliant on digital platforms like the Web and mobile apps, technology is effectively “instrumenting” the social world as well. The resulting deluge of data has revolutionary implications not only for social science but also for business decision making.

As enthusiasm for “big data” grows, skeptics warn that overreliance on data has pitfalls. Data may be biased and is almost always incomplete. It can lead decision makers to ignore information that is harder to obtain, or make them feel more certain than they should. The risk is that in managing what we have measured, we miss what really matters—as Vietnam-era Secretary of Defense Robert McNamara did in relying too much on his infamous body count, and as bankers did prior to the 2007–2009 financial crisis in relying too much on flawed quantitative models.

The skeptics are right that uncritical reliance on data alone can be problematic. But so is overreliance on intuition or ideology. For every Robert McNamara, there is a Ron Johnson, the CEO whose disastrous tenure as the head of JC Penney was characterized by his dismissing data and evidence in favor of instincts. For every flawed statistical model, there is a flawed ideology whose inflexibility leads to disastrous results.

So if data is unreliable and so is intuition, what is a responsible decision maker supposed to do? While there is no correct answer to this question—the world is too complicated for any one recipe to apply—I believe that leaders across a wide range of contexts could benefit from a scientific mind-set toward decision making.

A scientific mind-set takes as its inspiration the scientific method, which at its core is a recipe for learning about the world in a systematic, replicable way: start with some general question based on your experience; form a hypothesis that would resolve the puzzle and that also generates a testable prediction; gather data to test your prediction; and finally, evaluate your hypothesis relative to competing hypotheses.

The scientific method is largely responsible for the astonishing increase in our understanding of the natural world over the past few centuries. Yet it has been slow to enter the worlds of politics, business, policy, and marketing, where our prodigious intuition for human behavior can always generate explanations for why people do what they do or how to make them do something different. Because these explanations are so plausible, our natural tendency is to want to act on them without further ado. But if we have learned one thing from science, it is that the most plausible explanation is not necessarily correct. Adopting a scientific approach to decision making requires us to test our hypotheses with data.

While data is essential for scientific decision making, theory, intuition, and imagination remain important as well—to generate hypotheses in the first place, to devise creative tests of the hypotheses that we have, and to interpret the data that we collect. Data and theory, in other words, are the yin and yang of the scientific method—theory frames the right questions, while data answers the questions that have been asked. Emphasizing either at the expense of the other can lead to serious mistakes.

Also important is experimentation, which doesn’t mean “trying new things” or “being creative” but quite specifically the use of controlled experiments to tease out causal effects. In business, most of what we observe is correlation—we do X and Y happens—but often what we want to know is whether or not X caused Y. How many additional units of your new product did your advertising campaign cause consumers to buy? Will expanded health insurance coverage cause medical costs to increase or decline? Simply observing the outcome of a particular choice does not answer causal questions like these: we need to observe the difference between choices.

Replicating the conditions of a controlled experiment is often difficult or impossible in business or policy settings, but increasingly it is being done in “field experiments,” where treatments are randomly assigned to different individuals or communities. For example, MIT’s Poverty Action Lab has conducted over 400 field experiments to better understand aid delivery, while economists have used such experiments to measure the impact of online advertising.

Although field experiments are not an invention of the Internet era—randomized trials have been the gold standard of medical research for decades—digital technology has made them far easier to implement. Thus, as companies like Facebook, Google, Microsoft, and Amazon increasingly reap performance benefits from data science and experimentation, scientific decision making will become more pervasive.

Nevertheless, there are limits to how scientific decision makers can be. Unlike scientists, who have the luxury of withholding judgment until sufficient evidence has accumulated, policy makers or business leaders generally have to act in a state of partial ignorance. Strategic calls have to be made, policies implemented, reward or blame assigned. No matter how rigorously one tries to base one’s decisions on evidence, some guesswork will be required.

Exacerbating this problem is that many of the most consequential decisions offer only one opportunity to succeed. One cannot go to war with half of Iraq and not the other just to see which policy works out better. Likewise, one cannot reorganize the company in several different ways and then choose the best. The result is that we may never know which good plans failed and which bad plans worked.

Read the entire article here.

Image: Screenshot of Iris, Ayasdi’s data-visualization tool. Courtesy of Ayasdi / Wired.

5 Billion Infractions per Day

New reports suggest that the NSA (National Security Agency) is collecting and analyzing over 5 billion records per day from mobile phones worldwide. That’s a vast amount of data covering lots of people — presumably over 99.9999 percent innocent people.

Yet, the nation yawns and continues to soak in the latest shenanigans on Duck Dynasty. One wonders if Uncle Si and his cohorts are being tracked as well. Probably.

From the Washington Post:

The National Security Agency is gathering nearly 5 billion records a day on the whereabouts of cellphones around the world, according to top-secret documents and interviews with U.S. intelligence officials, enabling the agency to track the movements of individuals — and map their relationships — in ways that would have been previously unimaginable.

The records feed a vast database that stores information about the locations of at least hundreds of millions of devices, according to the officials and the documents, which were provided by former NSA contractor Edward Snowden. New projects created to analyze that data have provided the intelligence community with what amounts to a mass surveillance tool.

The NSA does not target Americans’ location data by design, but the agency acquires a substantial amount of information on the whereabouts of domestic cellphones “incidentally,” a legal term that connotes a foreseeable but not deliberate result.

One senior collection manager, speaking on the condition of anonymity but with permission from the NSA, said “we are getting vast volumes” of location data from around the world by tapping into the cables that connect mobile networks globally and that serve U.S. cellphones as well as foreign ones. Additionally, data are often collected from the tens of millions of Americans who travel abroad with their cellphones every year.

In scale, scope and potential impact on privacy, the efforts to collect and analyze location data may be unsurpassed among the NSA surveillance programs that have been disclosed since June. Analysts can find cellphones anywhere in the world, retrace their movements and expose hidden relationships among the people using them.

U.S. officials said the programs that collect and analyze location data are lawful and intended strictly to develop intelligence about foreign targets.

Robert Litt, general counsel for the Office of the Director of National Intelligence, which oversees the NSA, said “there is no element of the intelligence community that under any authority is intentionally collecting bulk cellphone location information about cellphones in the United States.”

The NSA has no reason to suspect that the movements of the overwhelming majority of cellphone users would be relevant to national security. Rather, it collects locations in bulk because its most powerful analytic tools — known collectively as CO-TRAVELER — allow it to look for unknown associates of known intelligence targets by tracking people whose movements intersect.

Still, location data, especially when aggregated over time, are widely regarded among privacy advocates as uniquely sensitive. Sophisticated mathematical tech­niques enable NSA analysts to map cellphone owners’ relationships by correlating their patterns of movement over time with thousands or millions of other phone users who cross their paths. Cellphones broadcast their locations even when they are not being used to place a call or send a text message.

Read the entire article here.

Image: Duck Dynasty show promotional still. Courtesy of Wikipedia / A&E.

UnGoogleable: The Height of Cool

So, it is no longer a surprise — our digital lives are tracked, correlated, stored and examined. The NSA (National Security Agency) does it to determine if you are an unsavory type; Google does it to serve you better information and ads; and, a whole host of other companies do it to sell you more things that you probably don’t need and for a price that you can’t afford. This of course raises deep and troubling questions about privacy. With this in mind, some are taking ownership of the issue and seeking to erase themselves from the vast digital Orwellian eye. However, to some being untraceable online is a fashion statement, rather than a victory for privacy.

From the Guardian:

“The chicest thing,” said fashion designer Phoebe Philo recently, “is when you don’t exist on Google. God, I would love to be that person!”

Philo, creative director of Céline, is not that person. As the London Evening Standard put it: “Unfortunately for the famously publicity-shy London designer – Paris born, Harrow-on-the-Hill raised – who has reinvented the way modern women dress, privacy may well continue to be a luxury.” Nobody who is oxymoronically described as “famously publicity-shy” will ever be unGoogleable. And if you’re not unGoogleable then, if Philo is right, you can never be truly chic, even if you were born in Paris. And if you’re not truly chic, then you might as well die – at least if you’re in fashion.

If she truly wanted to disappear herself from Google, Philo could start by changing her superb name to something less diverting. Prize-winning novelist AM Homes is an outlier in this respect. Google “am homes” and you’re in a world of blah US real estate rather than cutting-edge literature. But then Homes has thought a lot about privacy, having written a play about the most famously private person in recent history, JD Salinger, and had him threaten to sue her as a result.

And Homes isn’t the only one to make herself difficult to detect online. UnGoogleable bands are 10 a penny. The New York-based band !!! (known verbally as “chick chick chick” or “bang bang bang” – apparently “Exclamation point, exclamation point, exclamation point” proved too verbose for their meagre fanbase) must drive their business manager nuts. As must the band Merchandise, whose name – one might think – is a nominalist satire of commodification by the music industry. Nice work, Brad, Con, John and Rick.

 

If Philo renamed herself online as Google Maps or @, she might make herself more chic.

Welcome to anonymity chic – the antidote to an online world of exhibitionism. But let’s not go crazy: anonymity may be chic, but it is no business model. For years XXX Porn Site, my confusingly named alt-folk combo, has remained undiscovered. There are several bands called Girls (at least one of them including, confusingly, dudes) and each one has worried – after a period of chic iconoclasm – that such a putatively cool name means no one can find them online.

But still, maybe we should all embrace anonymity, given this week’s revelations that technology giants cooperated in Prism, a top-secret system at the US National Security Agency that collects emails, documents, photos and other material for secret service agents to review. It has also been a week in which Lindsay Mills, girlfriend of NSA whistleblower Edward Snowden, has posted on her blog (entitled: “Adventures of a world-traveling, pole-dancing super hero” with many photos showing her performing with the Waikiki Acrobatic Troupe) her misery that her fugitive boyfriend has fled to Hong Kong. Only a cynic would suggest that this blog post might help the Waikiki Acrobating Troupe veteran’s career at this – serious face – difficult time. Better the dignity of silent anonymity than using the internet for that.

Furthermore, as social media diminishes us with not just information overload but the 24/7 servitude of liking, friending and status updating, this going under the radar reminds us that we might benefit from withdrawing the labour on which the founders of Facebook, Twitter and Instagram have built their billions. “Today our intense cultivation of a singular self is tied up in the drive to constantly produce and update,” argues Geert Lovink, research professor of interactive media at the Hogeschool van Amsterdam and author of Networks Without a Cause: A Critique of Social Media. “You have to tweet, be on Facebook, answer emails,” says Lovink. “So the time pressure on people to remain present and keep up their presence is a very heavy load that leads to what some call the psychopathology of online.”

Internet evangelists such as Clay Shirky and Charles Leadbeater hoped for something very different from this pathologised reality. In Shirky’s Here Comes Everybody and Leadbeater’s We-Think, both published in 2008, the nascent social media were to echo the anti-authoritarian, democratising tendencies of the 60s counterculture. Both men revelled in the fact that new web-based social tools helped single mothers looking online for social networks and pro-democracy campaigners in Belarus. Neither sufficiently realised that these tools could just as readily be co-opted by The Man. Or, if you prefer, Mark Zuckerberg.

Not that Zuckerberg is the devil in this story. Social media have changed the way we interact with other people in line with what the sociologist Zygmunt Bauman wrote in Liquid Love. For us “liquid moderns”, who have lost faith in the future, cannot commit to relationships and have few kinship ties, Zuckerberg created a new way of belonging, one in which we use our wits to create provisional bonds loose enough to stop suffocation, but tight enough to give a needed sense of security now that the traditional sources of solace (family, career, loving relationships) are less reliable than ever.

Read the entire article here.

Big Data and Even Bigger Problems

First a definition. Big data: typically a collection of large and complex datasets that are too cumbersome to process and analyze using traditional computational approaches and database applications. Usually the big data moniker will be accompanied by an IT vendor’s pitch for shiny new software (and possible hardware) solution able to crunch through petabytes (one petabyte is a million gigabytes) of data and produce a visualizable result that mere mortals can decipher.

Many companies see big data and related solutions as a panacea to a range of business challenges: customer service, medical diagnostics, product development, shipping and logistics, climate change studies, genomic analysis and so on. A great example was the last U.S. election. Many political wonks — from both sides of the aisle — agreed that President Obama was significantly aided in his won re-election with the help of big data. So, with that in mind, many are now looking at more important big data problems.

From Technology Review:

As chief scientist for President Obama’s reëlection effort, Rayid Ghani helped revolutionize the use of data in politics. During the final 18 months of the campaign, he joined a sprawling team of data and software experts who sifted, collated, and combined dozens of pieces of information on each registered U.S. voter to discover patterns that let them target fund-raising appeals and ads.

Now, with Obama again ensconced in the Oval Office, some veterans of the campaign’s data squad are applying lessons from the campaign to tackle social issues such as education and environmental stewardship. Edgeflip, a startup Ghani founded in January with two other campaign members, plans to turn the ad hoc data analysis tools developed for Obama for America into software that can make nonprofits more effective at raising money and recruiting volunteers.

Ghani isn’t the only one thinking along these lines. In Chicago, Ghani’s hometown and the site of Obama for America headquarters, some campaign members are helping the city make available records of utility usage and crime statistics so developers can build apps that attempt to improve life there. It’s all part of a bigger idea to engineer social systems by scanning the numerical exhaust from mundane activities for patterns that might bear on everything from traffic snarls to human trafficking. Among those pursuing such humanitarian goals are startups like DataKind as well as large companies like IBM, which is redrawing bus routes in Ivory Coast (see “African Bus Routes Redrawn Using Cell-Phone Data”), and Google, with its flu-tracking software (see “Sick Searchers Help Track Flu”).

Ghani, who is 35, has had a longstanding interest in social causes, like tutoring disadvantaged kids. But he developed his data-mining savvy during 10 years as director of analytics at Accenture, helping retail chains forecast sales, creating models of consumer behavior, and writing papers with titles like “Data Mining for Business Applications.”

Before joining the Obama campaign in July 2011, Ghani wasn’t even sure his expertise in machine learning and predicting online prices could have an impact on a social cause. But the campaign’s success in applying such methods on the fly to sway voters is now recognized as having been potentially decisive in the election’s outcome (see “A More Perfect Union”).

“I realized two things,” says Ghani. “It’s doable at the massive scale of the campaign, and that means it’s doable in the context of other problems.”

At Obama for America, Ghani helped build statistical models that assessed each voter along five axes: support for the president; susceptibility to being persuaded to support the president; willingness to donate money; willingness to volunteer; and likelihood of casting a vote. These models allowed the campaign to target door knocks, phone calls, TV spots, and online ads to where they were most likely to benefit Obama.

One of the most important ideas he developed, dubbed “targeted sharing,” now forms the basis of Edgeflip’s first product. It’s a Facebook app that prompts people to share information from a nonprofit, but only with those friends predicted to respond favorably. That’s a big change from the usual scattershot approach of posting pleas for money or help and hoping they’ll reach the right people.

Edgeflip’s app, like the one Ghani conceived for Obama, will ask people who share a post to provide access to their list of friends. This will pull in not only friends’ names but also personal details, like their age, that can feed models of who is most likely to help.

Say a hurricane strikes the southeastern United States and the Red Cross needs clean-up workers. The app would ask Facebook users to share the Red Cross message, but only with friends who live in the storm zone, are young and likely to do manual labor, and have previously shown interest in content shared by that user. But if the same person shared an appeal for donations instead, he or she would be prompted to pass it along to friends who are older, live farther away, and have donated money in the past.

Michael Slaby, a senior technology official for Obama who hired Ghani for the 2012 election season, sees great promise in the targeted sharing technique. “It’s one of the most compelling innovations to come out of the campaign,” says Slaby. “It has the potential to make online activism much more efficient and effective.”

For instance, Ghani has been working with Fidel Vargas, CEO of the Hispanic Scholarship Fund, to increase that organization’s analytical savvy. Vargas thinks social data could predict which scholarship recipients are most likely to contribute to the fund after they graduate. “Then you’d be able to give away scholarships to qualified students who would have a higher probability of giving back,” he says. “Everyone would be much better off.”

Ghani sees a far bigger role for technology in the social sphere. He imagines online petitions that act like open-source software, getting passed around and improved. Social programs, too, could get constantly tested and improved. “I can imagine policies being designed a lot more collaboratively,” he says. “I don’t know if the politicians are ready to deal with it.” He also thinks there’s a huge amount of untapped information out there about childhood obesity, gang membership, and infant mortality, all ready for big data’s touch.

Read the entire article here.

Inforgraphic courtesy of visua.ly. See the original here.

You Are a Google Datapoint

At first glance Google’s aim to make all known information accessible and searchable seems to be a fundamentally worthy goal, and in keeping with its “Do No Evil” mantra. Surely, giving all people access to the combined knowledge of the human race can do nothing but good, intellectually, politically and culturally.

However, what if that information includes you? After all, you are information: from the sequence of bases in your DNA, to the food you eat and the products you purchase, to your location and your planned vacations, your circle of friends and colleagues at work, to what you say and write and hear and see. You are a collection of datapoints, and if you don’t market and monetize them, someone else will.

Google continues to extend its technology boundaries and its vast indexed database of information. Now with the introduction of Google Glass the company extends its domain to a much more intimate level. Glass gives Google access to data on your precise location; it can record what you say and the sounds around you; it can capture what you are looking at and make it instantly shareable over the internet. Not surprisingly, this raises numerous concerns over privacy and security, and not only for the wearer of Google Glass. While active opt-in / opt-out features would allow a user a fair degree of control over how and what data is collected and shared with Google, it does not address those being observed.

So, beware the next time you are sitting in a Starbucks or shopping in a mall or riding the subway, you may be being recorded and your digital essence distributed over the internet. Perhaps, someone somewhere will even be making money from you. While the Orwellian dystopia of government surveillance and control may still be a nightmarish fiction, corporate snooping and monetization is no less troubling. Remember, to some, you are merely a datapoint (care of Google), a publication (via Facebook), and a product (courtesy of Twitter).

From the Telegraph:

In the online world – for now, at least – it’s the advertisers that make the world go round. If you’re Google, they represent more than 90% of your revenue and without them you would cease to exist.

So how do you reconcile the fact that there is a finite amount of data to be gathered online with the need to expand your data collection to keep ahead of your competitors?

There are two main routes. Firstly, try as hard as is legally possible to monopolise the data streams you already have, and hope regulators fine you less than the profit it generated. Secondly, you need to get up from behind the computer and hit the streets.

Google Glass is the first major salvo in an arms race that is going to see increasingly intrusive efforts made to join up our real lives with the digital businesses we have become accustomed to handing over huge amounts of personal data to.

The principles that underpin everyday consumer interactions – choice, informed consent, control – are at risk in a way that cannot be healthy. Our ability to walk away from a service depends on having a choice in the first place and knowing what data is collected and how it is used before we sign up.

Imagine if Google or Facebook decided to install their own CCTV cameras everywhere, gathering data about our movements, recording our lives and joining up every camera in the land in one giant control room. It’s Orwellian surveillance with fluffier branding. And this isn’t just video surveillance – Glass uses audio recording too. For added impact, if you’re not content with Google analysing the data, the person can share it to social media as they see fit too.

Yet that is the reality of Google Glass. Everything you see, Google sees. You don’t own the data, you don’t control the data and you definitely don’t know what happens to the data. Put another way – what would you say if instead of it being Google Glass, it was Government Glass? A revolutionary way of improving public services, some may say. Call me a cynic, but I don’t think it’d have much success.

More importantly, who gave you permission to collect data on the person sitting opposite you on the Tube? How about collecting information on your children’s friends? There is a gaping hole in the middle of the Google Glass world and it is one where privacy is not only seen as an annoying restriction on Google’s profit, but as something that simply does not even come into the equation. Google has empowered you to ignore the privacy of other people. Bravo.

It’s already led to reactions in the US. ‘Stop the Cyborgs’ might sound like the rallying cry of the next Terminator film, but this is the start of a campaign to ensure places of work, cafes, bars and public spaces are no-go areas for Google Glass. They’ve already produced stickers to put up informing people that they should take off their Glass.

They argue, rightly, that this is more than just a question of privacy. There’s a real issue about how much decision making is devolved to the display we see, in exactly the same way as the difference between appearing on page one or page two of Google’s search can spell the difference between commercial success and failure for small businesses. We trust what we see, it’s convenient and we don’t question the motives of a search engine in providing us with information.

The reality is very different. In abandoning critical thought and decision making, allowing ourselves to be guided by a melee of search results, social media and advertisements we do risk losing a part of what it is to be human. You can see the marketing already – Glass is all-knowing. The issue is that to be all-knowing, it needs you to help it be all-seeing.

Read the entire article after the jump.

Image: Google’s Sergin Brin wearing Google Glass. Courtesy of CBS News.

You as a Data Strip Mine: What Facebook Knows

China, India, Facebook. With its 900 million member-citizens Facebook is the third largest country on the planet, ranked by population. This country has some benefits: no taxes, freedom to join and/or leave, and of course there’s freedom to assemble and a fair degree of free speech.

However, Facebook is no democracy. In fact, its data privacy policies and personal data mining might well put it in the same league as the Stalinist Soviet Union or cold war East Germany.

A fascinating article by Tom Simonite excerpted below sheds light on the data collection and data mining initiatives underway or planned at Facebook.

[div class=attrib]From Technology Review:[end-div]

If Facebook were a country, a conceit that founder Mark Zuckerberg has entertained in public, its 900 million members would make it the third largest in the world.

It would far outstrip any regime past or present in how intimately it records the lives of its citizens. Private conversations, family photos, and records of road trips, births, marriages, and deaths all stream into the company’s servers and lodge there. Facebook has collected the most extensive data set ever assembled on human social behavior. Some of your personal information is probably part of it.

And yet, even as Facebook has embedded itself into modern life, it hasn’t actually done that much with what it knows about us. Now that the company has gone public, the pressure to develop new sources of profit (see “The Facebook Fallacy) is likely to force it to do more with its hoard of information. That stash of data looms like an oversize shadow over what today is a modest online advertising business, worrying privacy-conscious Web users (see “Few Privacy Regulations Inhibit Facebook”) and rivals such as Google. Everyone has a feeling that this unprecedented resource will yield something big, but nobody knows quite what.

Heading Facebook’s effort to figure out what can be learned from all our data is Cameron Marlow, a tall 35-year-old who until recently sat a few feet away from ­Zuckerberg. The group Marlow runs has escaped the public attention that dogs Facebook’s founders and the more headline-grabbing features of its business. Known internally as the Data Science Team, it is a kind of Bell Labs for the social-networking age. The group has 12 researchers—but is expected to double in size this year. They apply math, programming skills, and social science to mine our data for insights that they hope will advance Facebook’s business and social science at large. Whereas other analysts at the company focus on information related to specific online activities, Marlow’s team can swim in practically the entire ocean of personal data that Facebook maintains. Of all the people at Facebook, perhaps even including the company’s leaders, these researchers have the best chance of discovering what can really be learned when so much personal information is compiled in one place.

Facebook has all this information because it has found ingenious ways to collect data as people socialize. Users fill out profiles with their age, gender, and e-mail address; some people also give additional details, such as their relationship status and mobile-phone number. A redesign last fall introduced profile pages in the form of time lines that invite people to add historical information such as places they have lived and worked. Messages and photos shared on the site are often tagged with a precise location, and in the last two years Facebook has begun to track activity elsewhere on the Internet, using an addictive invention called the “Like” button. It appears on apps and websites outside Facebook and allows people to indicate with a click that they are interested in a brand, product, or piece of digital content. Since last fall, Facebook has also been able to collect data on users’ online lives beyond its borders automatically: in certain apps or websites, when users listen to a song or read a news article, the information is passed along to Facebook, even if no one clicks “Like.” Within the feature’s first five months, Facebook catalogued more than five billion instances of people listening to songs online. Combine that kind of information with a map of the social connections Facebook’s users make on the site, and you have an incredibly rich record of their lives and interactions.

“This is the first time the world has seen this scale and quality of data about human communication,” Marlow says with a characteristically serious gaze before breaking into a smile at the thought of what he can do with the data. For one thing, Marlow is confident that exploring this resource will revolutionize the scientific understanding of why people behave as they do. His team can also help Facebook influence our social behavior for its own benefit and that of its advertisers. This work may even help Facebook invent entirely new ways to make money.

Contagious Information

Marlow eschews the collegiate programmer style of Zuckerberg and many others at Facebook, wearing a dress shirt with his jeans rather than a hoodie or T-shirt. Meeting me shortly before the company’s initial public offering in May, in a conference room adorned with a six-foot caricature of his boss’s dog spray-painted on its glass wall, he comes across more like a young professor than a student. He might have become one had he not realized early in his career that Web companies would yield the juiciest data about human interactions.

In 2001, undertaking a PhD at MIT’s Media Lab, Marlow created a site called Blogdex that automatically listed the most “contagious” information spreading on weblogs. Although it was just a research project, it soon became so popular that Marlow’s servers crashed. Launched just as blogs were exploding into the popular consciousness and becoming so numerous that Web users felt overwhelmed with information, it prefigured later aggregator sites such as Digg and Reddit. But Marlow didn’t build it just to help Web users track what was popular online. Blogdex was intended as a scientific instrument to uncover the social networks forming on the Web and study how they spread ideas. Marlow went on to Yahoo’s research labs to study online socializing for two years. In 2007 he joined Facebook, which he considers the world’s most powerful instrument for studying human society. “For the first time,” Marlow says, “we have a microscope that not only lets us examine social behavior at a very fine level that we’ve never been able to see before but allows us to run experiments that millions of users are exposed to.”

Marlow’s team works with managers across Facebook to find patterns that they might make use of. For instance, they study how a new feature spreads among the social network’s users. They have helped Facebook identify users you may know but haven’t “friended,” and recognize those you may want to designate mere “acquaintances” in order to make their updates less prominent. Yet the group is an odd fit inside a company where software engineers are rock stars who live by the mantra “Move fast and break things.” Lunch with the data team has the feel of a grad-student gathering at a top school; the typical member of the group joined fresh from a PhD or junior academic position and prefers to talk about advancing social science than about Facebook as a product or company. Several members of the team have training in sociology or social psychology, while others began in computer science and started using it to study human behavior. They are free to use some of their time, and Facebook’s data, to probe the basic patterns and motivations of human behavior and to publish the results in academic journals—much as Bell Labs researchers advanced both AT&T’s technologies and the study of fundamental physics.

It may seem strange that an eight-year-old company without a proven business model bothers to support a team with such an academic bent, but ­Marlow says it makes sense. “The biggest challenges Facebook has to solve are the same challenges that social science has,” he says. Those challenges include understanding why some ideas or fashions spread from a few individuals to become universal and others don’t, or to what extent a person’s future actions are a product of past communication with friends. Publishing results and collaborating with university researchers will lead to findings that help Facebook improve its products, he adds.

Social Engineering

Marlow says his team wants to divine the rules of online social life to understand what’s going on inside Facebook, not to develop ways to manipulate it. “Our goal is not to change the pattern of communication in society,” he says. “Our goal is to understand it so we can adapt our platform to give people the experience that they want.” But some of his team’s work and the attitudes of Facebook’s leaders show that the company is not above using its platform to tweak users’ behavior. Unlike academic social scientists, Facebook’s employees have a short path from an idea to an experiment on hundreds of millions of people.

In April, influenced in part by conversations over dinner with his med-student girlfriend (now his wife), Zuckerberg decided that he should use social influence within Facebook to increase organ donor registrations. Users were given an opportunity to click a box on their Timeline pages to signal that they were registered donors, which triggered a notification to their friends. The new feature started a cascade of social pressure, and organ donor enrollment increased by a factor of 23 across 44 states.

Marlow’s team is in the process of publishing results from the last U.S. midterm election that show another striking example of Facebook’s potential to direct its users’ influence on one another. Since 2008, the company has offered a way for users to signal that they have voted; Facebook promotes that to their friends with a note to say that they should be sure to vote, too. Marlow says that in the 2010 election his group matched voter registration logs with the data to see which of the Facebook users who got nudges actually went to the polls. (He stresses that the researchers worked with cryptographically “anonymized” data and could not match specific users with their voting records.)

This is just the beginning. By learning more about how small changes on Facebook can alter users’ behavior outside the site, the company eventually “could allow others to make use of Facebook in the same way,” says Marlow. If the American Heart Association wanted to encourage healthy eating, for example, it might be able to refer to a playbook of Facebook social engineering. “We want to be a platform that others can use to initiate change,” he says.

Advertisers, too, would be eager to know in greater detail what could make a campaign on Facebook affect people’s actions in the outside world, even though they realize there are limits to how firmly human beings can be steered. “It’s not clear to me that social science will ever be an engineering science in a way that building bridges is,” says Duncan Watts, who works on computational social science at Microsoft’s recently opened New York research lab and previously worked alongside Marlow at Yahoo’s labs. “Nevertheless, if you have enough data, you can make predictions that are better than simply random guessing, and that’s really lucrative.”

[div class=attrib]Read the entire article after the jump.[end-div]

[div class=attrib]Image courtesy of thejournal.ie / abracapocus_pocuscadabra (Flickr).[end-div]

Google: Please Don’t Be Evil

Google has been variously praised and derided for its corporate manta, “Don’t Be Evil”. For those who like to believe that Google has good intentions recent events strain these assumptions. The company was found to have been snooping on and collecting data from personal Wi-Fi routers. Is this the case of a lone-wolf or a corporate strategy?

[div class=attrib]From Slate:[end-div]

Was Google’s snooping on home Wi-Fi users the work of a rogue software engineer? Was it a deliberate corporate strategy? Was it simply an honest-to-goodness mistake? And which of these scenarios should we wish for—which would assuage your fears about the company that manages so much of our personal data?

These are the central questions raised by a damning FCC report on Google’s Street View program that was released last weekend. The Street View scandal began with a revolutionary idea—Larry Page wanted to snap photos of every public building in the world. Beginning in 2007, the search company’s vehicles began driving on streets in the United States (and later Europe, Canada, Mexico, and everywhere else), collecting a stream of images to feed into Google Maps.

While developing its Street View cars, Google’s engineers realized that the vehicles could also be used for “wardriving.” That’s a sinister-sounding name for the mainly noble effort to map the physical location of the world’s Wi-Fi routers. Creating a location database of Wi-Fi hotspots would make Google Maps more useful on mobile devices—phones without GPS chips could use the database to approximate their physical location, while GPS-enabled devices could use the system to speed up their location-monitoring systems. As a privacy matter, there was nothing unusual about wardriving. By the time Google began building its system, several startups had already created their own Wi-Fi mapping databases.

But Google, unlike other companies, wasn’t just recording the location of people’s Wi-Fi routers. When a Street View car encountered an open Wi-Fi network—that is, a router that was not protected by a password—it recorded all the digital traffic traveling across that router. As long as the car was within the vicinity, it sucked up a flood of personal data: login names, passwords, the full text of emails, Web histories, details of people’s medical conditions, online dating searches, and streaming music and movies.

Imagine a postal worker who opens and copies one letter from every mailbox along his route. Google’s sniffing was pretty much the same thing, except instead of one guy on one route it was a whole company operating around the world. The FCC report says that when French investigators looked at the data Google collected, they found “an exchange of emails between a married woman and man, both seeking an extra-marital relationship” and “Web addresses that revealed the sexual preferences of consumers at specific residences.” In the United States, Google’s cars collected 200 gigabytes of such data between 2008 and 2010, and they stopped only when regulators discovered the practice.

Why did Google collect all this data? What did it want to do with people’s private information? Was collecting it a mistake? Was it the inevitable result of Google’s maximalist philosophy about public data—its aim to collect and organize all of the world’s information?

Google says the answer to that final question is no. In its response to the FCC and its public blog posts, the company says it is sorry for what happened, and insists that it has established a much stricter set of internal policies to prevent something like this from happening again. The company characterizes the collection of Wi-Fi payload data as the idea of one guy, an engineer who contributed code to the Street View program. In the FCC report, he’s called Engineer Doe. On Monday, the New York Times identified him as Marius Milner, a network programmer who created Network Stumbler, a popular Wi-Fi network detection tool. The company argues that Milner—for reasons that aren’t really clear—slipped the snooping code into the Street View program without anyone else figuring out what he was up to. Nobody else on the Street View team wanted to collect Wi-Fi data, Google says—they didn’t think it would be useful in any way, and, in fact, the data was never used for any Google product.

Should we believe Google’s lone-coder theory? I have a hard time doing so. The FCC report points out that Milner’s “design document” mentions his intention to collect and analyze payload data, and it also highlights privacy as a potential concern. Though Google’s privacy team never reviewed the program, many of Milner’s colleagues closely reviewed his source code. In 2008, Milner told one colleague in an email that analyzing the Wi-Fi payload data was “one of my to-do items.” Later, he ran a script to count the Web addresses contained in the collected data and sent his results to an unnamed “senior manager.” The manager responded as if he knew what was going on: “Are you saying that these are URLs that you sniffed out of Wi-Fi packets that we recorded while driving?” Milner responded by explaining exactly where the data came from. “The data was collected during the daytime when most traffic is at work,” he said.

[div class=attrib]Read the entire article after the jump.[end-div]

[div class=attrib]Image courtesy of Fastcompany.[end-div]

Job of the Future: Personal Data Broker

Pause for a second, and think of all the personal data that companies have amassed about you. Then think about the billions that these companies make in trading this data to advertisers, information researchers and data miners. There are credit bureaus with details of your financial history since birth; social networks with details of everything you and your friends say and (dis)like; GPS-enabled services that track your every move; search engines that trawl your searches, medical companies with your intimate health data, security devices that monitor your movements, and online retailers with all your purchase transactions and wish-lists.

Now think of a business model that puts you in charge of your own personal data. This may not be as far fetched as it seems, especially as the backlash grows against the increasing consolidation of personal data in the hands of an ever smaller cadre of increasingly powerful players.

[div class=attrib]From Technology Review:[end-div]

Here’s a job title made for the information age: personal data broker.

Today, people have no choice but to give away their personal information—sometimes in exchange for free networking on Twitter or searching on Google, but other times to third-party data-aggregation firms without realizing it at all.

“There’s an immense amount of value in data about people,” says Bernardo Huberman, senior fellow at HP Labs. “That data is being collected all the time. Anytime you turn on your computer, anytime you buy something.”

Huberman, who directs HP Labs’ Social Computing Research Group, has come up with an alternative—a marketplace for personal information—that would give individuals control of and compensation for the private tidbits they share, rather than putting it all in the hands of companies.

In a paper posted online last week, Huberman and coauthor Christina Aperjis propose something akin to a New York Stock Exchange for personal data. A trusted market operator could take a small cut of each transaction and help arrive at a realistic price for a sale.

“There are two kinds of people. Some people who say, ‘I’m not going to give you my data at all, unless you give me a million bucks.’ And there are a lot of people who say, ‘I don’t care, I’ll give it to you for little,’ ” says Huberman. He’s tested this the academic way, through experiments that involved asking men and women to share how much they weigh for a payment.

On his proposed market, a person who highly values her privacy might chose an option to sell her shopping patterns for $10, but at a big risk of not finding a buyer. Alternately, she might sell the same data for a guaranteed payment of 50 cents. Or she might opt out and keep her privacy entirely.

You won’t find any kind of opportunity like this today. But with Internet companies making billions of dollars selling our information, fresh ideas and business models that promise users control over their privacy are gaining momentum. Startups like Personal and Singly are working on these challenges already. The World Economic Forum recently called an individual’s data an emerging “asset class.”

Huberman is not the first to investigate a personal data marketplace, and there would seem to be significant barriers—like how to get companies that already collect data for free to participate. But, he says, since the pricing options he outlines gauge how a person values privacy and risk, they address at least two big obstacles to making such a market function.

[div class=attrib]Read the entire article after the jump.[end-div]

Culturomics

[div class=attrib]From the Wall Street Journal:[end-div]

Can physicists produce insights about language that have eluded linguists and English professors? That possibility was put to the test this week when a team of physicists published a paper drawing on Google’s massive collection of scanned books. They claim to have identified universal laws governing the birth, life course and death of words.

The paper marks an advance in a new field dubbed “Culturomics”: the application of data-crunching to subjects typically considered part of the humanities. Last year a group of social scientists and evolutionary theorists, plus the Google Books team, showed off the kinds of things that could be done with Google’s data, which include the contents of five-million-plus books, dating back to 1800.

Published in Science, that paper gave the best-yet estimate of the true number of words in English—a million, far more than any dictionary has recorded (the 2002 Webster’s Third New International Dictionary has 348,000). More than half of the language, the authors wrote, is “dark matter” that has evaded standard dictionaries.

The paper also tracked word usage through time (each year, for instance, 1% of the world’s English-speaking population switches from “sneaked” to “snuck”). It also showed that we seem to be putting history behind us more quickly, judging by the speed with which terms fall out of use. References to the year “1880” dropped by half in the 32 years after that date, while the half-life of “1973” was a mere decade.

In the new paper, Alexander Petersen, Joel Tenenbaum and their co-authors looked at the ebb and flow of word usage across various fields. “All these different words are battling it out against synonyms, variant spellings and related words,” says Mr. Tenenbaum. “It’s an inherently competitive, evolutionary environment.”

When the scientists analyzed the data, they found striking patterns not just in English but also in Spanish and Hebrew. There has been, the authors say, a “dramatic shift in the birth rate and death rates of words”: Deaths have increased and births have slowed.

English continues to grow—the 2011 Culturonomics paper suggested a rate of 8,500 new words a year. The new paper, however, says that the growth rate is slowing. Partly because the language is already so rich, the “marginal utility” of new words is declining: Existing things are already well described. This led them to a related finding: The words that manage to be born now become more popular than new words used to get, possibly because they describe something genuinely new (think “iPod,” “Internet,” “Twitter”).

Higher death rates for words, the authors say, are largely a matter of homogenization. The explorer William Clark (of Lewis & Clark) spelled “Sioux” 27 different ways in his journals (“Sieoux,” “Seaux,” “Souixx,” etc.), and several of those variants would have made it into 19th-century books. Today spell-checking programs and vigilant copy editors choke off such chaotic variety much more quickly, in effect speeding up the natural selection of words. (The database does not include the world of text- and Twitter-speak, so some of the verbal chaos may just have shifted online.)

[div class=attrib]Read the entire article here.[end-div]

Data, data, data: It’s Everywhere

Cities are one of the most remarkable and peculiar inventions of our species. They provide billions in the human family a framework for food, shelter and security. Increasingly, cities are becoming hubs in a vast data network where public officials and citizens mine and leverage vast amounts of information.

[div class=attrib]Krystal D’Costa for Scientific American:[end-div]

Once upon a time there was a family that lived in homes raised on platforms in the sky. They had cars that flew and sorta drove themselves. Their sidewalks carried them to where they needed to go. Video conferencing was the norm, as were appliances which were mostly automated. And they had a robot that cleaned and dispensed sage advice.

I was always a huge fan of the Jetsons. The family dynamics I could do without—Hey, Jane, you clearly had outside interests. You totally could have pursued them, and rocked at it too!—but they were a social reflection of the times even while set in the future, so that is what it is. But their lives were a technological marvel! They could travel by tube, electronic arms dressed them (at the push of the button), and Rosie herself was astounding. If it rained, the Superintendent could move their complex to a higher altitude to enjoy the sunshine! Though it’s a little terrifying to think that Mr. Spacely could pop up on video chat at any time. Think about your boss having that sort of access. Scary, right?

The year 2062 used to seem impossibly far away. But as the setting for the space-age family’s adventures looms on the horizon, even the tech-expectant Jetsons would have to agree that our worlds are perhaps closer than we realize. The moving sidewalks and push button technology (apps, anyone?) have been realized, we’re developing cars that can drive themselves, and we’re on our way to building more Rosie-like AI. Heck, we’re even testing the limits of personal flight. No joke. We’re even working to build a smarter electrical grid, one that would automatically adjust home temperatures and more accurately measure usage.

Sure, we have a ways to go just yet, but we’re more than peering over the edge. We’ve taken the first big step in revolutionizing our management of data.

The September special issue of Scientific American focuses on the strengths of urban centers. Often disparaged for congestion, pollution, and perceived apathy, cities have a history of being vilified. And yet, they’re also seats of innovation. The Social Nexus explores the potential awaiting to be unleashed by harnessing data.

If there’s one thing cities have an abundance of, it’s data. Number of riders on the subway, parking tickets given in a certain neighborhood, number of street fairs, number of parking facilities, broken parking meters—if you can imagine it, chances are the City has the data available, and it’s now open for you to review, study, compare, and shape, so that you can help built a city that’s responsive to your needs.

[div class=attrib]More from theSource here.[end-div]

[div class=attrib]Image courtesy of Wikipedia / Creative Commons.[end-div]