"Everything you know is wrong" - and I've got proof!
In an article worthy of Firesign Theater, Noah "Rain-On-Your-Parade" Smith, writing for Bloomberg View, informs us that "Bad Data Can Make Us Smarter" (October 23, 2014). Excellent! I consider myself invited to do my best to do my worst in my future scholarly publications. But why is this true?
The number of published papers has exploded over the past century, but the statistical techniques used to judge the significance of a finding haven't evolved very much. The standard test of a scientific hypothesis is the so-called t-test. A t-test will give you a p-value, which is supposed to be the percent chance that the finding was random. So if you run a test and get a p-value of 0.04, many people will take that to mean that there is only a 4 percent chance that the finding was a fluke. Because 4 percent sounds like a low-ish number, most researchers would call such a finding "statistically significant."But "most researchers" would be wrong! Har-har-har!
Now, if there were only one scientific test of one hypothesis in all of human history, a p-value of 0.04 might be just as interesting as it looks. But in the real world, there are many many thousands of published p-values, meaning that a substantial number must be flukes. Worse, since only the tests with significant p-values tend to get published, there's a huge selection bias at work - for every significant p-value you see in a paper, there were a bunch of tests that didn't yield an interesting-looking p-value, and hence weren't able to make it into published papers in the first place! This is known as publication bias. It means the publication system selects for false positives.Uh-oh . . . I've been regularly publishing a couple of times a year for over ten years.
But it gets worse. Because the set of tests that researchers run isn't fixed - since researchers need to publish papers - they will keep running tests until they get some that look significant.Gee . . . I don't even run tests.
Suppose I ran 1,000 tests on 1,000 different totally wrong hypotheses. With computers, this is easy to do. Statistically, maybe about 50 of these will look significant with the traditional cutoff of 5 percent. I'll be able to publish the 50 false positives, but not the 950 correct negative results!As I noted above, I don't even run tests. Is that better or worse?
This is data-mining, and there's essentially no way to measure how much of it is really being done, for the very reason that researchers don't report most of their negative results. It isn't an ethics question - most researchers probably don't even realize that they're doing this. After all, it's a very intuitive thing to do - look around until you see something interesting, and report the interesting thing as soon as you see it.You see? I told you: "Everything you know is wrong" - and I've got proof!