The title of this post is not a political statement, but will simply be a statistical truth, if the not-so-recent revelations lead to a step away from “just-in-case” wholesale data collection by the NSA.
Following the last post, about high-performance computing, it’s fair to ask where cheaper data storage and computing power will lead to in terms of applications. And here, the NSA is no different from the 60% (some say more than 90%) of businesses who will use the massive amounts of storage and computing capability to end up with quack science.
When you analyse terabytes of data, you will definitely find something. The question isn’t just whether that’s useful. It’s whether you would have learned more if you had done a lot less analysis on a lot less data. As a practical matter, most often you will end up learning less.
The first phenomenon is that correlation often gets misinterpreted as causation. If it rains whenever the roads are wet, we know that wet roads don’t cause rain, but we know so from other data (not from the correlation!). Nothing in the correlation between the incidence of rain and of wet roads tells you which is the cause and which is the effect (or in fact whether they are both caused by a third phenomenon). When your client wants you to find ‘something’, this is easy to forget, and Big Data gives you lots of opportunity to show stuff that is ‘there’ in the data: Real correlations that explain nothing. We see it all the time: Newspaper articles suggesting poverty (rather than the consequences thereof) increases cancer risk, academics insisting that strong political institutions cause of economic growth (who knows if it’s not the other way round?), and business plans stating that a large and growing market must be a great opportunity for start-ups (often true, often false, never questioned). If you are tempted to please rather than being brutally intellectually honest, big data gives you all the tools you will ever need to to pass the wool to your client, so they can pull it over their own eyes. You can tell them whatever they want to hear and will have the data to support it, using real correlations of no explanatory value. Most people will never pick you up on it.
The other problem is that even where correlation is meaningful, its marginal predictive value can be zero once we have a lot of data already. Collecting correlated data is an activity with rapidly diminishing returns.
For example, the best basketball players are — with overwhelming probability — taller than 6’10”. However, this information is not useful, not even if you are the team manager. Because you will have already looked at statistics like points, assists, rebounds, and win shares etc., player size will not add any information (in other words, after looking at the other data, you aren’t going to discount a player because of his size). Size will correlate with performance, and mostly the player you will have selected by other means will turn out to be taller than 6’10”, but size should never be used as a predictor of basketball skill. It’s a mathematical fact. Using it would rather be like saying “I don’t hire any female mathematicians”, or “I don’t hire any male art directors” based on national exam, hiring, earnings, or whatever other statistics. Of course you know better than to do that. Instead, you look at each individual candidate’s CV and grades. Simple.
The problem starts once abstract statistics form part of “your approach” without a mental model of the real world. You stop looking which correlations are actually useful and which ones don’t add informational value to what you already have. Every time, you will overlook the 5-foot midfielder who would have been so easy to spot (because he outruns and out-dribbles everyone)!
More data isn’t better data, unless you know a priori hat you can’t do with less. And whether you can or can’t is a question of having a prior model. Without that, looking at more data is not likely to lead you to a better model, and the more you add, the worse it gets.
The best way of looking for truffles is not by starting to scout all the forests of the world. Because if you do, all the wrong positives (things that look like truffles but aren’t) will at best waste your time, and at worst prevent you from ever finding the real thing. The problem is that once you start down this wrong track, the wrong positives will keep the people at the top excited, and will keep the funding going.
So we should not be surprised at what the NSA Chief says, “You Need the Haystack To Find the Needle”, even though it is obviously not just wrong, but even counter-productive to the NSA’s own objectives. (a) To find the needle, you need to first spend a lot of time guessing what is the smallest possible haystack that could hide the needle. (b) Once you have started on the wrong track and have an army of needle-searchers to feed, they have an interest in telling you that they need all the hay in the world to rummage around in. Not because they are devious, but because they love that kind of job.
The movie Minority Report, where psychics could get people arrested simply because they believed they were likely to commit certain crimes, thankfully is science fiction. But the threat is always there. When the “psychics” don’t find anything, they get carried away, showing “false positives” carrots up to the bosses’ faces, who — sitting in the same boat of having to find something — laud them for their ‘good work’ of finding false positives (“well, we didn’t find anything this time, but with so many signals, we must be onto something“). It’s easy to picture the meeting where the “Big Data” collection was decided: Isn’t it obvious that “more data is better, and if it’s cheap, we’ll have it”? The only people in the room who could have told them otherwise were the serious statisticians (with also a healthy dose of understanding of organisational behavior, so “psychics” may be as good a word as any), but their own stature in the organisation stood to benefit from the (wrong) decision to start by gathering all the hay there is. Perhaps a small minority spoke up nonetheless, but not enough to convince the career generals that feeding on too much information may be institutionally unhealthy, not to mention deeply disconcerting to the public if they ever found out.
Ironically, if the NSA stopped wholesale meta-data collection, people wouldn’t just feel safer for not being snooped wholesale, they would also be justified in feeling safer in believing that the NSA knows how to construct models and read numbers on potential threats.