The title of this post is not a political statement, but will simply be a statistical truth, if the not-so-recent revelations lead to a step away from “just-in-case” wholesale data collection by the NSA.
Following the last post, about high-performance computing, it’s fair to ask where cheaper data storage and computing power will lead to in terms of applications. And here, the NSA is no different from the 60% (some say more than 90%) of businesses who will use the massive amounts of storage and computing capability to end up with quack science.
When you analyse terabytes of data, you will definitely find something. The question isn’t whether what you have found is useful “yes or no”. It’s whether what you found would have been more useful or less useless if you had done less analysis on less data. As a practical matter, most often the answer to that question is negative: beyond a certain point, more data will steer you away from finding the truth you are after, not towards it.
It isn’t just that correlation often gets misinterpreted as causation, although it kind of starts with that, so we’ll repeat the often-preached truth here: If it rains whenever the roads are wet. We know that wet roads don’t cause rain, but we know so from experience that lies outside the data set which captured the correlation. The correlation itself does not tell you which is the cause and which is the effect (or in fact whether they are both caused by a third phenomenon). It is convenient to forget about this when your client wants you to find ‘something’. Big Data gives you lots of opportunity to show stuff that is ‘there’: real correlations which explain nothing. We see this incentive misguiding people all the time: Newspaper articles suggesting poverty (rather than the consequences thereof) increases cancer risk, academics insisting that strong political institutions cause of economic growth (who knows if it’s not the other way round?), and business plans stating that a large and growing market must be a great opportunity for start-ups (often true, often false, never questioned). If you are tempted to please rather than being brutally intellectually honest, big data gives you all the tools you will ever need to to pass the wool to your client, so they can pull it over their own eyes. You can tell them a good story, using real correlations of no explanatory value. You then suggest there is some kind of causal connection reflected in the correlation. This causal connection may or may not be there. No-one will ever know, and most people will never pick you up on that hidden but invalid assumption.
The other problem is to do with the information content of correlation. Even where there is no causation, knowing one variable gives you some marginal information about a correlated variable. It’s just that we forget that often that marginal value of information can be zero, and it tends to be zero if you know a lot of other things about these variables already. In other words, adding more data to an already gargantuan stash of data often has zero marginal value, while the costs of evaluating all possible connections in the data increase non-linearly with the size of the data mountain.
For example, the best basketball players are taller on average. So a tall person might be a better basketball player on average. Saying that is ok if you know nothing else about that person already. However, if you have statistics like points, assists, rebounds, and win shares etc., player size will not add any information at all. Size correlates with performance, and the players you have selected by other means will turn out to be taller than 6’10”, but size should never be used as a predictor of basketball skill when you already know a lot of other stuff about the person. Using it would rather be like saying “I don’t hire any female mathematicians”, or “I don’t hire any male art directors” based on national exam, hiring, earnings, or other statistics. Of course we do not do that, and instead look at data other than gender to find out about the person. And — this is key — thereafter, i.e. once you have looked at the person closely, gender does not tell you anything about that person’s skill any more. Notice how many sensationalist newspaper articles quoting statistics might have misguided you in this particular way (i.e.: admitting it’s not about correlation, but claiming that the correlation contains valuabile information even when it does not)?
More data isn’t better data, unless you know a priori that you can’t do with less, meaning you know that the marginal information is non-zero. And to know this, you need to have a prior model of how the world may work. Without that, looking at more data is not likely to lead you to a posterior model which is simply wrong. And the more data you look at, the more likely you end up with a wrong posterior model, simply because the marginal value of any part of the data is converging to zero.
The best way of looking for truffles is not by starting to scout all the forests of the world. Because if you do, all the wrong positives (things that look like truffles but aren’t) will at best waste your time, and consume your physical and mental resources, actually reducing your probability of finding the real thing. The problem is that once you start down this wrong track, the wrong positives will keep the people at the top excited (alon the lines of “where there’s smoke, there’s fire”, they will argue “where there’s a correlation, there is information”), and the top guys will keep the funding flowing.
We should not be surprised when the NSA Chief says, “You Need the Haystack To Find the Needle”, but we should know that it is not just wrong, but even counter-productive to the NSA’s own objectives. (a) To find the needle, you need to first spend a lot of time guessing which part of the haystack doesn’t add value and simply throw it away. (b) Once you have started on the wrong track and have an army of needle-searchers to feed, they have an interest in telling you that they are onto something, and all they need is more hay and more people to sift through it.
The movie Minority Report, where psychics could get people arrested simply because they believed they were likely to commit certain crimes, thankfully is science fiction. But the threat is always there. When the “psychics” don’t find anything, they get carried away, showing “false positives” carrots up to the bosses’ faces, who — sitting in the same boat of having to find something — laud them for their ‘good work’ of finding false positives (“well, we didn’t find anything this time, but with so many signals, we must be onto something“). When your job is to prevent terrorism, things are particularly pernicious, because the claim that “we prevented an attack which would otherwise have happened” can never be verified or falsified. In other words, you can never prove that their model was wrong, their data misinterpreted, and their effort wasted, because you can never prove tht there was going to be no terrorist attack anyway.
It’s easy to picture the meeting where the “Big Data” collection was decided: Isn’t it obvious that “more data is better, and if it’s cheap, we’ll have it”? The only people in the room who could have told them otherwise were the serious statisticians (with also a healthy dose of understanding of organisational behavior). But once their numbers are drowned out by the “data scientists”, things get tricky. Perhaps a small minority spoke up nonetheless, but not enough to convince the career generals that feeding on too much information may be counterproductive.
Ironically, if the NSA stopped wholesale meta-data collection which (amonst other things) Snowden goes on about, people wouldn’t just feel safer for not being snooped wholesale, they would also be justified in feeling safer in believing that the NSA knows how to construct models and read numbers on potential threats.