Part 1: Why Big Data can spell Big Trouble
Confusing significant-looking results with meaningful results in statistics, and how this gets worse in the age of Big Data is an issue which is difficult to get across. And because equations don’t usually do it, I will be sticking mostly to examples.
But do allow me one equation to start with. It’s not a very complex one, but an interesting one to contemplate nonetheless
1+1=2. Now the thing is, 1+1=2 is not some deep, universal truth. It is a trivial consequence of how we have defined these symbols. What do I mean by that? In the world of symbols this equation is true, not because one thing and another thing equals two things, but because the meaning of the symbol “2” is defined by that equation. We simply have decided to call the result “two” and defined the meaning of “two” precisely that way. If we had decided to call it tnejn (Maltese for “two”), the equation is still true because now tneijn is the name of the result of the operation 1+1. If we had decided to call it Viltvodl, the equation would still be true by definition of Viltvodl. It becomes a tautology. The problems only start when we ask what it actually means. Stanislav Lem goes only as far as calling mathematical truths ‘empty’. I would go further, and call them meaningless, because they all ultimately follow from how we defined the symbols, and from nothing else. The fact that some of these truths can be applied very, very often to the real world is actually an amazing fact. Witness the title of a book by Nobel Laureate Eugene Wigner: “The Unreasonable Effectiveness of Mathematics”.
In the real world, the word “one” has a slightly different meaning from that in pure mathematics, and the same goes for the word “two”. In the real world, therefore, one plus one does not always equal two. We don’t always get two things when we add two things to each other. One cloud and one other cloud may result in any number of clouds. One river added to another river may give any number of rivers, often just one.
We know, intuitively, when adding two things will produce two things in the real world, and when it may not. We learn this from such early time in childhood that we stop thinking about when exactly the equation is applicable or not. We just know. One apple and another apple equals two apples, practically always.
However, when it comes to statistics, our brains are not only not trained to intuitively know what is applicable and what is not, they will be actively misled by the way we are hardwired to react by evolution. The crux is: things that have an element of mathematical truth can become dangerous when misapplied to the wrong situation. Fundamentally, this talk is about how often we fail to understand what statistical significance really means in a situation, if we even remember to ask the question. Worse, our intuition actively misleads us. But enough with the theory, from now we will focus on examples.
Clearly, Big Data has become ubiquitous. It is easy to collect, and we all do it. We save all our e-mails, our music and reading libraries are often too large to ever listen to or read everything even once. Yet sometimes when we are looking for things we know are there, we still can’t find them.
Big Data promises to solve that and bigger issues for big business and big government, in virtually any domain – crime, public health, predicting what songs or books you may like, or helping you through the perils of dating – just by crunching the numbers.
It has created celebrated successes, from Google’s search engine to the Jeopardy-winning Watson of IBM. So does a title like “The Trouble with Big Data” make me not look like a Luddite? Like the people who were mortally afraid of steam locomotives? Mankind was ready for the steam locomotive after all, pretty much from day one.
Of course the steam locomotive was made for a purpose, and Big Data is here for a purpose too, namely to make you find and discover things. It will definitely live up to that promise but Big Data creates illusions:
- Big Data can find us too many things, even things that aren’t real effects –false positives’.
- The false positives become more seductive, appearing much more statistically significant than they are.
- We will miss more of those things we were hoping to find – ‘false negatives’.
Add to that the messy issue of human incentives.
Positives are much harder to spot and accept as false when they look like compelling outliers. The confluence of Big Data and big pressure for conclusions creates skewed incentives. We may badly want to believe the results, especially when media attention, funding, profits, and jobs are riding on us finding results. There is a serious temptation to fool ourselves and our bosses and customers into believing the compelling outliers tell us something important.
I will show you examples from drugs testing, sales management, investment management, and gambling. In each case,
- the more we search, the more likely our positives are false;
- the more we search, the more compelling the false positives;
- the more we search, the more the true positives are hidden by the loud noise of the compelling false ones;
- human incentives reinforce the first three problems.
And the biggest temptation as we enter the big data age is to let the machines do the hard work of the analysis for us. When we do that, the ‘false positives’ and ‘false negatives’ can increase exponentially. I want to illustrate how misuses can, and most likely will happen most of the time.
This is going to have implications in our daily lives – in terms of the effectiveness of the medication we take, the success of our investments, and in the everyday choices we make in buying products. Even in terms of national security. The public debate is often framed in terms of a trade-off between privacy and national security. Yet it misses a central point: As the surveillance data grows, the incidence of false negatives can grow quickly, and false negatives cost lives.
Big Data heightens other risks too, including the risk of making the most elementary statistical mistake of using data that was collected for a different purpose, simply because it’s there. We know that this leads to sample error, sample bias, and the confusion between correlation and causation.
The elimination of sample error and sample bias, as well as the introduction of proper ‘control variables’ requires us to design proper statistical experiments, and to collect specific data for specific questions. Generic data, simply because it is more and more available, will almost always tempt us in the other direction.
Why spears are better than drag-nets
Fishing with a spear is time-consuming, requires human labor, practice, and planning. Fishing with a drag net allows us to let a heavy-duty trawler do the hard work. When fishing for things to learn from data, Big Data lets us cast our nets far and wide in the search for knowledge we don’t yet have, and let the computer do the work. Simply having the ability to crunch bigger amounts of data naturally tempts us to do so.
But there are consequences. The most important principle I want to explain is that while Big Data does not yield more wisdom the wider you cast your drag-net, Big Data can yield more wisdom, more correct results, when you fish with a spear.
As an example of a drag-net approach, drug companies can save money by cheaply simulating millions of compounds by computer, instead of testing dozens in the laboratory at large expense. The one drug that has come top of the pile against a million other candidates must be far better than one that has come up best in a test against only a dozen alternatives, right? Surely it’s going to be a blockbuster and help everyone.
It a tempting conclusion, but precisely wrong. Remember that in a drugs trial there are three possible outcomes:
- We end up choosing the right drug for the right reasons
- We end up choosing the right drug for the wrong reasons
- We end up choosing the wrong drug for the wrong reasons (and perhaps, to boot, perfectly good drugs were discarded).
To see how our intuition deceives us into thinking a bigger experiment is always better, let’s construct a parlor game which works very similarly to the drug trials:
The crowd gets to ‘test’ 100 coins, by flipping them 100 times each. Consider this a simple model for a drugs test. We are looking for a loaded coin, one that produced ‘heads’ more often than 50% of the time. That, if we find it, we will consider to be an ‘effective drug’.
So we are testing 100 coins to see if any are loaded, in the same way the drugs company tests 1000s of drugs to see if any of them work. A white-coat is taking notes and presents the statistics in the end. Let’s say it turns out that one coin comes up ‘heads’ 62 times and ‘tails’ 38 times. Can I persuade people to ‘buy’ the idea that the coin is loaded, that the outcome will likely be ‘heads’ again? What should people’s confidence that the drug will continue to work?
In this example, people should in fact not be misled into thinking that the best coin was in any way special. Mathematically the ‘best’ coin out of 100 should come up heads approximately 62% of the time. It does so by sheer luck, because we tried so many coins that an outlier on that scale is expected even if all the coins are fair. The ‘best’ coin in this example is like an ineffective drug, chosen for the wrong and only reason that we tested too many. By choosing the best out of many after the experiment, we choose a coin that ‘seems’ more effective than it is. For a single coin, 62/100 heads is a 2.33σ event, highly significant. For the best out of 100 tries, it’s entirely expected, and tells us nothing about that particular coin.
Now I take the experiment one step further. In the next version, let’s say I am God, and I actually know that one coin is in fact loaded, and will – on average – show heads 62% of the time, even if we tested it a million times. But it’s hidden amongst 99 ‘wrong’ fair ones (ineffective drugs). It turns out that in this experiment the chance that the right one will also come out best in the test is only 44%. That’s because the accidentally best out of the remaining 99 fair coins will ‘trump’ the biased one in this experiment more often than not.
The chance is more than 50:50 that we will confuse the drug producing the ‘best outcome’ with the ‘best drug’. So far so bad. But if I am a drugs company, and my incentive is to sell that coin to someone, can I make the wrong coin ‘look’ even better?
Sure. What if we test the ‘right’ coin hidden in a sample of 999 others? There is a perfectly good drug in the sample of 1,000. But the chance of actually finding it is now down to 19%, because the best coin of 1,000 will score even higher on average than the best out of 100. It will very likely be ‘heads’ approximately 66% of the time. Remember, the ‘true’ good one will average only 62%. So this coin has to get a little lucky in the trial to beat the best of the 999 fair ones. Of course it gets worse if the good coin is hidden in a sample of 9,999 fair ones. All this means that – as I increase my search-space – my chances of selling a drug for good money go up, while the chances that I am selling an effective one actually go down. The worst outcome – selecting the wrong drug for the wrong reason, and discarding a perfectly effective on – is becoming more and more likely, and the seeming significance of the fluke performer will actually increase, the larger I make the haystack within which I am searching for it.
So, if your instinct was to prefer the drug that was tested against a million others over the one that was tested in expensive tests against a dozen, you fell for a mathematical confidence trick. You are paying more for a less effective drug. You are betting on a past winner who was just lucky. When selecting investment managers, terror suspects, or product sales strategies, the same game is played all the time.
Remember the last time you sat down with your investment advisor. Perhaps he had screened 3,000 funds by past performance statistics, and has discarded all that didn’t come out close to the top. If so, would you be more convinced that he has done his homework than if he told you he had personally done extensive, deep-dive research on only 20 funds, precisely because he did not want to be swayed by performance alone? What if he told you that it is really hard in practice to separate skill from luck, and that the only way of doing so was to go and meet the managers, read their CVs, get references, listen to their approach, interview their risk manager, and check their police records? Of course you, as professionals, know the right answer to that question, but as humans we are easily deceived. It’s easy to say to our customer, and to ourselves: “But this one came out top of 3,000, not 20”.
On this slide, I have crated totally random funds. They are all monkeys. By construction, I have built them to have zero skill. The ‘best’ out of 113 is really nothing but the luckiest out of 113. Although he starts to look like he really got something, doesn’t he?
Do you actually ask the question “what confidence level can I have that the results were not down to luck”?
Selecting promising candidates before you do a search for performance is a really big trick: If you only ever looked at one manager who you think should be promising, and then you check and he turns out to have an information ratio of 1.0 over five years, that’s very significant. If you look at 20 with no prior belief about whether they are any good, and the best shows that very same track record of an information ratio of 1.0 maintained over five years, it actually means nothing any more. That very same record is now indistinguishable from monkey’s luck. If you look at 3,000 instead of 20, you are almost guaranteed that for every one who is actually consistently good, you have 30 monkeys with a similar track record, and you won’t be able to statistically distinguish between skill and luck at all. The more your search, the less you find.
Evolution has hardwired us to see anything as significant, if the stakes are high enough
In horse-racing, we see a disproportionate amount of money riding on a favorite, in investment management, we see a lot of investors chasing past returns. What it all points to is an evolutionary program that is at work in our emotional responses to past experiences. “A burnt child dreads the fire”, “Once bitten, twice shy”, “Fool me once, shame on you; fool me twice, shame on me”. These and similar aphorisms attest to that human trait that we are ready to learn from one single bad experience, as long as the stakes are high enough.
We repeat what has worked, and avoid things that caused us harm. Clearly, this works well for animals and toddlers. A child who has burnt its fingers on the stove isn’t going back there, and that’s a good thing. A dog that has seen another dog coming out of a tunnel looking really stressed isn’t going into that tunnel (yes, this has been experimentally tested!). But this intuition, honed by evolution, isn’t so good any more when we look at really, really large amounts of data. What works well on a small sample can fail on a larger one. Remember the coins. Remember the drugs.
Perhaps you know superstitious people who, having gotten really lucky one day, are trying to replay the day step by step every time. Getting up on the same foot, having the same breakfast, leaving the lights on (like they did that day) and so on. They don’t know what caused them to be lucky, but it must have been something they did. So they try to do everything precisely the way they did on the day they got lucky. When the stakes are high, we do the same: somebody took that drug and got cured. It doesn’t matter that there may be thousands of other reasons why they and no-one else got cured, but we must have that drug. And our brains shut out the other 1,000 factors that may have cured that other person. As long as our level of desperation is high enough, anyone of us will demand that drug, about which the statistics teach us nothing at all.
Evolution has also hardwired us to see anything as significant, if it’s unlikely by itself
Remember the coins. Perhaps you have an ‘intuition’ that 62/100 ‘heads’ is significant. A 1-in-100 event. So it’s significant if it happens when I only tried once, but not when I tried it 100 times.
I want you to think of something incredibly unlikely that happened to you once. Can you think of a one-in-a-million event that you were party to or witnessed? Perhaps you met someone from your school-days in a far-away place. Perhaps someone you haven’t talked to in a long time suddenly called you on the phone right the minute you happened to think of them. Whatever it is, it should be something that made you think “it’s amazing I was there”, or “why me?” Don’t worry, I’m not going to ask any of you what that experience was. I’m going to pause for a little because I want you to really try hard and think of your own one-in-a-million event.
Remember the coins? A one-in-a-million coin is quite likely if you have tested a million. Equally, the fact that unlikely events occur, even to you specifically, isn’t that unlikely since millions and millions of other possible unlikely events did not happen to you. It’s like you are flipping millions of coins all the time, but because you don’t think about things that way, you are amazed when one comes up heads 22 times in a row. Our personal experience, the sensation of surprise (and perhaps delight) obscures the simple fact that millions of other possible coincidences we didn’t even think of did not happen. And one-in-a-million events have to happen when there are millions that don’t. You don’t have to have super-natural powers to attract incredibly improbably events. Think of how many people you know. Is it likely that you would never meet any of them ever in a place you didn’t expect? It just happened to be that one, but it had to be someone, and it definitely does not mean it was not random chance.
What is important here is how subjectively significant these events seems to us when they occur. Why is it important? Because it explains why we are prone to fall victim to the drugs trial and coin flipping deception. If something specific is unlikely to happen in isolation (a coin showing heads 65 times out of 100), we see it as unlikely to be by chance, we attribute significance, even if it was in a sample where the same thing didn’t happen thousands of times already. Simply put, because with Big Data we can test millions of hypotheses at once, a few random ones will be outliers that seem like they just can’t be coincidence, yet they are. Unlike the divergence between the mathematical ‘definition’ of “two” and our everyday ‘meaning’ of “two”, the divergence between what we perceive as significant and what actually is significant (in terms of being something that was unlikely to be random) is very, very large.
I will tell you my own amazing experience: I was in the army, on guard in a field exercise in the middle of literally nowhere. At 3 a.m. in the morning I witnessed a meteor, bright as a half-moon, brilliantly coloured, with a flickering tail, lasting for about five seconds. The sighting was later confirmed by astronomers. Perhaps less than 1 in a million people ever get to see a meteor of that size with their naked eye in their entire lifetime. I was lucky enough to have been (a) awake at 3 a.m. that day, (b) within a 100km radius of the event, (c) outdoors, (d) in a place where it’s dark enough to notice, and (e) assigned to be out there specifically looking for unusual stuff.
I hope by now you realize that this is not at all amazing that something as unlikely as that should happen to me. All of you will have had similarly or even more unlikely events occur to you in your life. Just different ones. I bet if some of you told their stories right now, many of them would be much more amazing.
Had I predicted the day before that I’m going to see that meteor, noted down what I was expecting to see, and then had actually seen it, that would be truly amazing. Like flipping one coin once and it ending up resting on its edge. But I hadn’t seen it coming, millions and millions of amazingly unlikely things could have occurred to me for a long time, until one of them finally did.
It does not matter if it’s drugs testing, terrorist search, investment advice, or something else, we think that the best (or worst) of a million samples just has to have something that the other things don’t have. Worse, when the stakes are high and your job, funding, or your health are on the line, the temptation to believe easily gets the better of you, your boss, your family, or your customers.
In summary, our brains and hormones make us susceptible to these misconceptions:
- Anything unlikely happening is significant, even if it already failed to happen millions of times and then just happens once.
- If the stakes are really high, we play it ‘safe’ and see connections where they are unlikely. It’s our survival instinct. We do things ‘just in case’ it turns out like last time. We become superstitious, often without noticing.
A bigger haystack doesn’t make finding needles easier
Our subjective perception of the significance of outliers we find isn’t the only problem. With Big Data we do not just get more, and more significant outliers, but also more false negatives than we would in a smaller search, and we get tempted to believe that the whole thing is very scientific, because it’s mathematical. Alas, it is a case of mathematics wrongly applied. “A little knowledge is a dangerous thing”.
With Big Data we can stage not only ‘experiments’ where something will always appear to work, but will have done so by dumb luck with exorbitantly overwhelming probability, but also where the stuff we are really after is hidden by the impressive outliers that are statistical flukes.
The ‘p-Value’ as luck-value
The mathematical handle on our perceptual biases is called p-Value. It can be between zero and 100%, and it expresses the probability that an event might have occurred by dumb luck alone. It’s a ‘luck-value’.
And here’s the thing: The p-Value of you getting the Nobel prize is near zero, because this doesn’t happen by luck alone. Also, the p-Value of you winning the lottery some time is close to zero, not because – if it happens – it wouldn’t be by luck, but it is unlikely to happen in the first place. So the p-Value is the probability that it does happen and happens by luck alone. However, the p-Value of something that is just as unlikely as you winning the lottery, of anything that unlikely to happen to you is near one. It’s almost bound to happen, by random luck alone.
It’s not important here how we calculate it, the point is we can. If, for example, the p-Value of a drug showing very high effectiveness in a trial is high, it means it’s probable that it was effective only by coincidence, only that one time. The main reason would be that we considered too many candidates, so one of them just had to be an outlier in that experiment.
The fact that the probability of a successfully tested drug being ineffective depends on how many other drugs we have checked is quite counter-intuitive. A completely different example is a lot more intuitive: you don’t trust the advice of a person who comes to you with 100 ideas each day and claims fame when one of them has worked incredibly well. But where’s the difference? There isn’t one. The idea wasn’t ‘good’. It got lucky. One of them had to. But how counter-intuitive does this statement sound?: The fact that I did not expect the meteor is the very reason that witnessing it is not amazing. But by now you will understand what I mean by that, even if your intuition protests against it. It’s the same with Big Data searches: When we find really amazing things we didn’t expect in really large amounts of data, there’s a good chance it’s not significant.
The bigger the haystack, the more likely we are to find really lucky monkeys. Lucky monkeys sell. Just ask your investment adviser. Ask the NSA analyst who is under pressure to produce ‘leads’. Ask a drugs researches. The temptation to produce lucky monkeys simply by increasing the sample is very real in almost any enterprise.
If there was only one problem with Big Data, it is that we lack an intuition for p-Value, the probability that something that appears significant may have been expected by chance alone, simply because we considered too many competing possibilities.
p-Value and Big Data errors: Three supermarket managers
Imagine three supermarket managers. Their boss will meet each one of them the next day, and they all feel under pressure to come up with strategies on how to improve sales.
The first, on the night before his meeting, goes into his data-records and finds he has weekly sales numbers for all 1,000 products he stocks, dating back for a full year.
He’s hot at Excel, and with 52,000 data points he’s got some stuff to work with. He works out all the correlations, what products’ sales go up and down together or not, in the hope of placing products in proximity to each other in a clever way. This takes about half an hour or so on my home PC, so the guy thinks he is in for a real discovery because the others aren’t so scientifically minded and haven’t got so much data to work with.
He finds that the sales of nappies and beer are incredibly correlated. The correlation is 56%, and he works out that even a correlation as low as 23% is less than 5% likely, so this is incredibly statistically significant to him. In fact, 56% should only happen once in a million times. It is a 4σ-event, incredibly unlikely to occur by chance alone.
Clearly, in this supermarket, there is something about nappies and beer being sold together, so from now on he will place nappies and beer together.
What the guy is not so hot on is realising that there are 499,500 ways of pairing 1,000 products, and there is therefore a 50:50 chance of finding at least one correlation that is likely to only happen once in a million years. The point is that the store in the next village will have a 56% correlation on something completely different (perhaps brooms and donuts). It is literally telling us nothing about nappies and beer (or brooms and donuts). One pair had to have a correlation that high or near there because so many other pairs were considered and didn’t. It’s totally arbitrary which pair comes out on top.
Of course his boss isn’t conversant on p-Value, is incredibly impressed, struggling a little to understand the possible connection between nappies and beer, but hey, he remembers his MBA course and has learned not to argue with the significance of a 4σ event. He authorizes the beer-and-nappy promotion, and promotes the store manager to local area manager.
The problem with this store manager is manifold:
- He didn’t design an experiment. Instead, he used data that was there already, collected for an unspecific purpose.
- He expected the data to provide intelligence, but there was no thinking of his own that could have been confirmed or rejected by the data.
- He didn’t work out what would actually constitute statistical significance. The p-Value of finding a correlation of around 56% was way too high (which is a bad thing). If he had known that, he would have (correctly) assigned no special significance to the pair of nappies and beer, and would have realized that a sea of significant correlations may be hidden by this and other outliers.
The second manager thought ahead. He saw the meeting coming a year earlier, and implemented the following idea. He ran weekly promotions, of a different product every week, throughout the whole year. Then he checks which week the overall store sales were highest. He concludes that the product that was promoted as a loss-leader in the best-selling week is the promotion that attracted the most people into the store, and suggests to his manager that he should keep that product on permanent discount. It happens to be yoghurt, which was on promotion just before Christmas.
This store manager did a few things better:
- He designed an experiment, with the intention to record data that was specific to the question he wanted answered. He only had to collect 52 data points, all relevant to his ‘search’. He knew what he was looking for before gathering the data.
- Unlike the previous manager, who looked for 499,500 possible answers in a data-set of only 52,000 data-points, this guy was looking for the best of 52 out of 52. It’s quite a bit more likely that a real outlier (say two standard deviations from the normal sales numbers) can actually be there for a real reason. He lowered the p-Value of any potential outlier number, and therefore increased the chance of finding something that wasn’t simply there by dumb luck.
However, there are still quite a few problems with his ‘experiment’:
- He has 52 data points in term of weekly overall store sales and is looking for one out of 52 possible answers. Most likely, there is still luck involved in which was the best-selling week. The p-Value of his answer still isn’t great.
- He has a sample problem. Before Christmas or Easter people buy a lot more anyway. He ends up seeing a causal connection where there most probably isn’t any.
The second store manager ends up putting yoghurt on permanent promotion, because that is what happened to be on promotion the week before Christmas when sales sky-rocketed. He also decides to put a very high margin on milk, because it was on special discount the week before New Year, and store sales were rubbish, so he concludes that supermarket shoppers don’t care about the price of milk. His manager, knowing that milk and bread is something people watch carefully when deciding where to shop, is not impressed. Also, the correlations are not nearly as impressive as those of the first store manager: Sales in the pre-Christmas yoghurt-week were only 2.5 standard deviations above normal. And because a 2.5σ event is a far cry from the incredibly surprising 4σ of a 56% correlation between nappies and beer, the boss feels the second guy is fiddling with 52 numbers and trying to make a big story out of it. The store manager leaves the meeting somewhat deflated. How can his boss not appreciate the foresight of designing an experiment one year earlier and then seeing it through faithfully for the whole year?
The third store manager has a panic-attack three months before the meeting. He realises that he hasn’t collected any data at all. He has 12 weeks to make up for it. So he designs a fiendishly clever experiment: Every week he promotes a different product as a loss-leader. Half the time (six weeks in total) he places whatever product that happens to be in its usual position, and half the time he confuses people by placing the promotional product in a corner far back in the store, so people have to look for it. He has one question, and one question only: “Do promotions work better when the promoted product is placed where people expect it, or when they have to search for it?” He has a hunch that the searching ‘buys’ him more in-store-time of shoppers, who then end up browsing the store for longer and buying things they hadn’t planned on buying.
He finds that the average of the sales figures in the weeks when the featured, discounted product is placed at a new location differs from the average of the sales figures when it is normally placed by an amount that is 1.8 times larger than the standard deviation of the sales figures during the 12 weeks. Because we are effectively looking at the averages of six different experiments, this is highly significant: it can happen by random chance only 0.1% of the time (once in 1000 trials). That’s a very convincing p-Value. What he has done is to ask a very specific question from very small data (six weeks one way, six weeks the other), and he got a statistically very significant answer. More importantly, he got an answer to a precise hypothesis to which the only possible answer was yes or no. This guy is a true statistician.
- He collected data highly specific to the question. He couldn’t do otherwise. He didn’t have any data, and not much time left.
- He asked a precise yes/no question, testing the hypothesis that random, prominent placement of the featured product would enhance overall store sales
- He eliminated sample bias to an extent by mixing up product, constructing six weeks in one way, each with different featured products, and six weeks the other, again each with different product. So the influence that a specific choice of promotional product might have had is reduced.
- He only ran one
There were still shortcomings:
- He had to admit he got lucky. He only had time for one experiment, to test one hypothesis. True, it was one he strongly believed in, but if the numbers had turned out differently, he would have had nothing to show.
- He could not explain to his boss that a 1.8 standard deviations’ difference in sales is highly significant, or that what he was running really amounted to six independent experiments. He enraged his boss by ‘getting all pseudo-scientific and uptight about twelve numbers’. Worse, his boss confused 1.8 standard deviations with a 1.8σ event, concluding that the third guy’s findings were the least statistically significant of all. In fact, there is a 7% chance of a 1.8σ event occurring in a single experiment, but if the average of 6 experiments is 1.8σ, that chance is only 0.1%.
His boss thought the guy is utterly uncommercial and promoted him into an office job, collecting invoices for the accounts department, where his obvious lack of common sense could do no further damage.
Proper statisticians are more obsessed with proper design of experiments than with almost anything else, and there is a reason for that. They are on a spear-fishing expedition, and even if not every spear strike will produce fish, you don’t want to strike out randomly.
Part 2: The Rules of How to avoid Big Data Problems
- First ask the question, then collect the data
Let’s look at how to use Big Data effectively. For example: assume we want to know the efficacy of a certain drug. Is that the exact purpose? Most unlikely. We want to know the efficacy of a certain drug against a certain ailment, compared to alternative forms of treatments or other drugs, and even compared to just having a placebo. Most likely we want to also know the side effects. We want to know if the world is a better place with the drug compared to without it, and whether the difference is meaningful. This means that we have a lot of variables that we need to ‘control’ in our experiment so that they don’t cause a sample bias. If you haven’t got the data on the ‘control variables’ which might influence the outcomes independently of the effectiveness of the drug, you will draw wrong conclusions, licensing ineffective drugs and banning effective ones.
The classic way of going wrong about collecting data is, of course, anything that allows sample bias. The Literary Digest in 1936 tried to predict the outcome of the presidential election by polling 10 million people. This huge sample predicted that Landon would win over Roosevelt with a 55% to 41% majority. The election turned out 61% to 37% in Roosevelt’s favour. When it comes to data, size isn’t everything. They were clever enough to know that they should not just poll their readers (they might be relatively well-to-do and therefore the sample would be biased towards Republicans). So they went to great lengths to draft their sample of 10 million from automobile registrations and the phone book. Alas, in 1936 that was still a sample heavily biased towards either the better-off part or the rural part of the population, both of them more likely to be Republicans than the average American.
Another example are ‘studies’ which claim that race influences life expectancy, income, or fertility rates. Others claim that wealth, religion, location, education, diet, or some other factor are to blame. The only hope of maybe getting a meaningful result is to consider as many factors as necessary, and as few as possible. And that, namely which factors to include and exclude, fundamentally remains a human judgment we have to make before we design the experiment. Computers don’t dream, and they don’t have judgment. The type of data you collect and feed into the analysis matters a great deal.
The act or recycling data for different uses is the original sin of statistics. It is prone to make you selective with respect to which data you include or exclude, without you even noticing you are doing it. In this way you can prove almost anything.
Big Data will not eliminate the experimenter’s bias. In fact, it is prone to confirm it. The way to avoid this is to first decide which data you need and then collect it.
When someone has used recycled data, there is no way that we can distinguish after the fact whether the experimenter’s bias has crept into the experiment, simply because there are many ways in which the experimenter did not look at the data. And the larger the data-set is, the more you end up throwing away, introducing bias before you even run the test.
This has always been true, and remains true in the age of Big Data. You have to ask a specific question (test a specific hypothesis), and you get an answer which tells you in some probabilistic sense whether your hypothesis might be true or not. That’s all you will ever get. It starts with the specific question, and you collect the data only once you know the question.
- Don’t look for the Big Truth. Look for many small but specific
The problem of looking for the big truth comes in two guises: People who look for the big truth they don’t yet know, and people who look for the big truth they think they already know. Let’s start with the latter species.
From a fragment from the ancient Greek poet Archilochus, we have the proverb “the fox knows many things, but the hedgehog knows one big one”. “The Hedgehog and the Fox” is one of the most popular essays by the philosopher Isaiah Berlin. Berlin himself said “I never meant it seriously. I meant it as a kind of enjoyable intellectual game, but it was taken seriously. Every classification throws light on something”.
Berlin classifies writers and thinkers into two categories: hedgehogs who view the world through the lens of a single defining idea (Plato, Lucretius, Dante, Pascal, Hegel, Dostoevsky, Nietzsche, Proust, and Braudel) and foxes who draw on a wide variety of experiences for whom the world cannot be boiled down to a single idea (Herodotus, Aristotle, Erasmus, Shakespeare, Moliere, Goethe, Pushkin, Balzac, Joyce).
Philip E. Tetlock, a political psychology professor in the Haas Business School at UC Berkeley, draws heavily on this distinction in his exploration of the accuracy of experts and forecasters in various fields (especially politics) in his 2005 book Expert Political Judgment: How Good Is It? How Can We Know? He observed that hedgehogs are more consistently wrong. Simply put, the accuracy of your forecast depends on your cognitive style.
Nate Silver explains it in detail in his highly commendable, recent book, The Signal and the Noise, but in essence, hedgehogs are hungry for data to confirm what they are saying, while foxes are hungry for data to find out whether their expectations need to be adjusted.
Tetlock’s conclusions then seem trivial. The quality of your forecasts depends on your willingness to have them updated by new data coming in.
The journalist who says that “following our campaign, the government has changed its course” assumes there is a connection without knowing whether the government might have changed course anyway. And the government, for its part, will ascribe any positive statistic to its own actions. You have to be a hedgehog to get ahead in those jobs, but it’s also bad news for your skill of judgment. You have to stick out, stick your neck out, make bold statements and predictions, and stick to them. You have to ignore new data coming in, unless it confirms what you were saying.
Hedgehogs will go as far back in history as is necessary to find an example to prove their point, even if circumstances were totally different then. They hold firm beliefs that will not be swayed. Their grand idea defines them, and Big Data will deliver the examples that ‘prove’ to them that they are right. The fact that there is one coin out of 100 that came up heads 62 times will be held up as ‘proof’ that the game is rigged, even though the statistic of the that particular outcome is entirely consistent with the assumption that all coins were fair.
Foxes go about the concept of Big Data in a different way, by testing what they believe to be true, and then adapting their expectations after the test. Foxes can play in their minds the arguments that would require a dozen hedgehogs in a room to be played out. They question their beliefs, and in questioning them with the help of data, one specific belief at a time, they learn.
Yet if foxes look for many small things to learn from, they must also be careful to fish with a spear while doing so: To look for specific truths. The store manager who learned most was the one who asked a specific question which he already had an educated hunch about. He had a belief, and designed an experiment because he was prepared to have his belief tested by the data. Hedgehogs don’t design experiments. They look through existing data to find the piece that confirms their bias.
Now let’s look at the second problem with Big Truths: When we look for the big truth we don’t already know. The first store manager was neither a fox nor a hedgehog. He didn’t have any belief to test. When you don’t even know what you are looking for, you have an even more fundamental problem. You should never look for unknown unknowns.
So the next trick we need to always have at hand when trying to avoid Big Data problems is: Write down what you already know before you search. Be specific about want you are testing.
The first supermarket manager could have tested ten specific correlations that he had a strong ‘hunch’ about. If even just a fraction of those had turned out to be statistically significant, it’s likely his intuition is really onto something. But because he put no intuition into his search to test against, he learned nothing useful. In fact, many significant correlations may have been truly meaningful (jam and butter, cakes and whipped cream, plates and cutlery). Perhaps they are correlated at 30% or so, something which happens less than once in 1,000 years by chance alone, but he cannot identify them as significant anymore because these were drowned out by all the flukes he found. He wasn’t looking for anything specific. Everything is drowned out by the random and meaningless, but much stronger beer-and-nappy signal. All it took, literally, for this manager to miss the chance of learning something was that he didn’t sit down and think ahead what might be the outcome, and then tested it. He had no idea what he expected to find, and searched for the unknown unknowns.
So our toolbox has expanded: (1) ask a small, specific question, and (2) then design an experiment and collect the data.
It’s instructive in this context to remember one of the conclusions of the 9/11 Commission report, which had identified four types of systemic failures that contributed to our inability to appreciate the importance of the signals. The failures included failures of policy, capabilities, and management. The most important category, however, was failures of the imagination. The 9/11 plot was less a hypothesis that we evaluated and rejected as unlikely – and more one that we had failed to really consider in the first place. Big Data was little help where the imagination failed. Computers don’t dream.
- Even if you can’t calculate it, at least think in terms of p-Value
As I said earlier, think of p-Value as the luck factor. It tells you how likely some outcome was to occur and occur by chance alone.
And here is the trick: The p-Value that you find something/anything really unlikely in a very, very large set of data is very high. When you find it in a really large sample of candidate answers, it’s no longer statistically significant.
It’s unlikely that a mammogram will give a wrong reading. And yet mammograms and are a very distressing real-life example of what happens because people can’t be taught to think in terms of p-Value. People don’t internalize the fact that the overwhelming majority of women who go to the test are healthy, and therefore even a small error rate on the device will still ensure that we get a higher number of false positives than of true positives, simply because we test so many healthy ones. Even some doctors don’t understand this, so how do the affected patients stand a chance? Are we ready to police the practitioners of Big Data?
Your investment advisor may well deceive you — unknowingly — into ignoring p-Values. By picking the fund that was best in the past and ignoring the question of how likely it was that this fund got there by luck and not by skill. Yet in fund management the p-Value happens to be quite simple mathematically. I offer a tenner to every audience member who knows an investment adviser who actually calculates the answer. Most don’t even ask themselves the question.
Mathematically, the problem the investment advisor faces (and perhaps the problem you face as a risk professional, deciding how to allocate capital to different stocks or strategies or business lines) is called the multi-armed bandit problem. The way it’s traditionally set up is to imagine you are playing 20 slot machines simultaneously in a casino, trying to work out which one gives you the best ‘value-for-money’ (wins, or time-on-device) and then getting that value-for-money. All you can do is test them with real money. But how will you ever know which one is best? It’s a trade-off between imperfect exploration of what you don’t know about the probabilities of their pay-offs and exploitation of what you have, again imperfectly, learned about these probabilities through observation. You will know their pay-off probabilities better and better over time, but never perfectly. When we choose investment managers, we are really the players in a multi-armed bandit problem, and we therefore must ask what we do and do not know if we simply know past performances.
What is worse, big mutual fund companies have hundreds of funds you have never heard of. They only show you the ones that performed in the past, and even merge bad ones with good ones to erase the track record of the former, creating a deliberate sample bias in their performance statistic.
Here is a classic example of how you can innocently set precisely the wrong incentives ;for your staff: You are the chief investment officer of a large insurance company. You have many analysts who design and experiment with investment strategies. Every time you see a proposal from one of these analysts, you naturally ask how that strategy would have performed in the past. Your staff, anticipating that, never show you the strategies that won’t pass your back-test hurdle. But their best chance to impress you is to plough through a lot of candidate strategies, until they find one that did really well in the past. Do you notice you have just turned them into the equivalent of a drug company testing 1,000,000 ideas, or a coin conman flipping 10,000 coins trying to sell you the best one? If you want to reduce the luck factor, you have to ask each analyst not just how well their strategy performed in the past, but how many strategies they actually tested before showing you a proposal. If one analyst spends most of his time thinking about what might work, and only had time to test five of his ideas, and the other tested thousands by computer, then comparing their strategies by past performance is no longer an apples to apples comparison. The lesson: If you can think of a way to force your analysts to reveal all their negative results so you can count them, implement it.
This is similar to another very clever way drug companies can game the system: Instead of simulating drugs by computer, they can also look through millions of actual patients’ records, searching for what appears to have worked in real life. When they do that, they are looking for the unknown unknowns. They are very likely to come up with a lot of ‘type I’ errors. Worse, the data they thus produce will make them pass the licensing tests, because the licensing tests don’t ask how many patients’ data correlations you have discarded before you found one that ‘worked’. The licensing authorities are letting people buy the coin that performed best once, which is not the same as the best coin.
When the NSA is looking for terrorists in phone records by checking all phone records of all people, tens of thousands will show more suspicious communication patterns than the real terrorists. The number of real terrorists hasn’t changed, just because we cast our net wider. If instead, they collected a more promising, much smaller hay-stack before even starting the data analysis, they will find far fewer false positives, and more importantly, it will more likely avoid the occurrence of the far more deadly false negatives. Besides, it has the advantage of keeping the population far happier as regards to their rights to privacy.
So our toolbox has expanded again: (1) ask a small, specific question, and (2) then design an experiment and collect the data. (3) evaluate your hypothesis. (4) if your hypothesis is rejected, repeat steps (1-3) with a new hypothesis. (5) Be honest about how many hypotheses you have tested. The more often you had to repeat the search with new hypotheses, the lower you confidence should be in anything that you eventually find.
These rules aren’t new. And if you are the data analyst, that’s all you need to do to stay safe, even in the age of Big Data.
- Beware of people’s incentives
But what if you are not the analyst? Often, you are in the role of the consumer, the boss, the patient, the reader, or the citizen. We may buy drugs that appeared to work but worked by chance, our government may end up suspecting the wrong people based on phone records and then go into more intrusive intelligence gathering on everyone, we bet on past winners in the mutual funds markets, and we promote people who found amazing things because they were simply data-mining.
Recognise that people are under pressure to produce results. Always ask how many things they have tested, how many coins they have flipped before finding a good one. People’s promotion, status within their organisation, and the funding of their department depends on finding something every once in a while.
We never ask drug companies how many compounds they have tested, how many results they never published, before finding one that ‘worked’ and got published and approved. If you work in the NSA, your boss is unlikely to ask why you were looking though terabytes of data once you have found something that looks to point to a connection. The funding of his agency depends on finding something promising, and as far as he is concerned, more data, especially if it’s cheap to collect, has to be better than less data.
We must learn to be overtly wary of people who don’t follow the five steps of our toolbox. But it’s hard. People don’t willingly present all the failed things they checked. People don’t have the time or money to collect data specifically for each question they may want to check. And people may genuinely believe that ‘the answer is out there’ when they don’t even know the question.
Chances are very high that a boss will never ask how many other hypotheses were tested before one that works was found. Chances are that no-one will deny their loved ones the medicine that worked in a published study on real people, no matter how many studies never made the cut. The age of Big Data is the age of many promising opportunities, but we must deal with this issue – and avoid the trap of becoming victims of our own intuitions to fall for badly-mined data.
To fight our impulses, and to combat the cheats, we must get into the habit of asking about the thousands or millions of things the data miners could have found but didn’t, and why they were using the particular data that they used.
We must make it understood that the temptation to design experiments badly or not at all is bigger with Bigger Data. It’s the original sin of statistics, and yet it seems inconspicuous most of the time.
And we must remember that the rules of statistics haven’t changed. The five steps in our toolbox aren’t new, but the number ways in which we can violate the rules has multiplied incredibly.
In an age when we can let the computer search so much more than ever before, we need to remember what I said in the very beginning:
- the more we search, the more likely our positives are false;
- the more we search, the more compelling the false positives;
- the more we search, the more the true positives are hidden by the loud noise of the compelling false ones;
- human incentives reinforce the first three problems.
As a society, we already seem quite good at ignoring this, and as humans we lack some of the intuition that should make us naturally suspicious of these problems. This is why I believe Big Data will be misused most of the time.
Big data can help solve anything. Equally, it can prevent anything from being solved. We have to maintain the same old rules for data analysis even more rigorously these days.