Wednesday, August 04, 2010

and every man a liar

kw: book reviews, nonfiction, experts, polemics

On occasion my grandmother said, "Everybody's pixilated except the two of us; but sometimes I worry about you." For the X, Y and Z generations: "pixilated" is related to "pixie", not "pixel", and means "somewhat removed from reality". Even the Bible tells us, "Let God be true, and every man a liar," hence my title. Why? I've just read a book that presents a thesis very much in accord with my beliefs, Wrong: Why Experts Keep Failing Us—And How to Know When Not to Trust Them by David H Freedman. Since I agree with the author so much, is it safe for me to trust him? We'll get to that.

Here's a good example of a phenomenon that is part of the problem. Mark Van Stone, a Mayan archaeologist, has written 2012: Science and Prophecy of the Ancient Maya. In the face of literally millions of predictions of disaster, he demonstrates that December 21, 2012 is about as important as January 1, 2000 was to those of us in the Western world: A significant New Year to celebrate, but not a whole lot different from December 31, 1999. He's having a hard time finding bookstores to carry the book, and venues to lecture about it. People may seek out experts' help when they don't know what to think, but they are pretty careful to select experts whose pronouncements confirm what they already believe or wish to be true. Disaster sells. "Tomorrow will be like today" doesn't sell. But Dr. Van Stone is much more likely to be right than the millions of doom-and-gloomers.

In Wrong, the author first establishes that the "expertise" of informal experts (such as some actor telling you what brand of mouthwash he prefers) is quite suspect, and most people understand this. The standard of established scientists is higher, or so we hope, but how much higher? This is the theme of most of the book. There are several kinds of studies, which are accorded different levels of trust:
  • Observational: data-gathering. Interesting, not trusted much.
  • Epidemiological: case studies, hopefully of many similar cases. More trustworthy, and the larger the better.
  • Meta-analysis: review of many studies. Considered quite trustworthy.
  • Randomized controlled trial (RCT): The gold standard, particularly if large.
Quick question: Suppose one wants to study how effective anabolic steroids are in enhancing athletic performance? Is a RCT even possible? Or do you do an animal study, hoping that rats and humans are similar enough?

In fact, this and many, many other well-understood phenomena are known only from case studies. The athletes and coaches of the world have done their own informal epidemiological study of steroids and concluded they have a huge effect. We all know what has followed: laws, scandals, asterisks in the record books, and a number of untimely deaths.

Since the RCT enjoys ultimate trust, is it worthy of it? It can be. But we always have to look beneath the covers. Consider this: the results of most RCT's are reported statistically, with a statement that the conclusion "is significant at the 95% confidence level." The dark side of 95% confidence is that other 5%. At the very least, it means that statistical flukes could invalidate one-twentieth of the RCT's that have been published. But there's more.

Publication Bias is the tendency of "positive" results to be published, and the tendency of "negative" results to be either trashed by the researcher (or at least filed away) or rejected by the journal to which they are submitted. Freedman quotes researchers who estimate that at most one in ten "negative" experiments gets published. To be clear, a "negative" result means that a researcher started out with an idea, but the statistical analysis of the experiment indicates that the idea is false. Thus, the result means, "Nothing new, folks. As you were." This is not exciting, so why publish it? The researcher will only try to publish it if it contradicts something already published, particularly if she and the other author aren't friends.

So, let's consider an ideal case: twenty research teams gather data and wind up with 100 numbers. Suppose the numbers are the duration of an episode of common cold. A medication is being tested, to see if the duration is different from the "normal" seven days. They apply standard statistical tests, and 19 teams' results can be stated, "The mean [average] is not significantly different from seven days." The twentieth team finds that, not only is the average value six days, but the "95% confidence interval" for cold duration in the test subjects is between 5.3 and 6.8 days. Thus they conclude, incorrectly, that the medication is effective. They publish.

Of the other 19 teams, two manage to get their findings published (It is much more likely, in my experience, that none of the 19 negative findings will see print). The result is that the published record contains one positive finding and two negative findings: one-third of the published record is in error!

I spent fourteen years in academia. The pressure to publish is terrific, and you'd better publish something "interesting" and "significant". Tenure is typically based on this. Before attending graduate school, I worked as a machinist at Cal Tech. I happened upon a cabinet full of research documents, recording work done on a synchrotron during the 1960s. Nearly everything was negative results, like they tried to find "particle X" and failed. Fortunately for graduate students everywhere, proving yourself wrong is OK at this level, and numerous PhD's were conferred anyway. But very little of those research results was ever submitted for publication.

As a fun exercise, I generated 10,000 "random normal" numbers and grouped them by hundreds. Each number was the sum of twelve RAND() values. This is a very good proxy for a more rigorous standard normal variable. I applied statistical tests, and the one we'll focus on here is the t-test to determine if the mean (numerical average) is significantly different from the expected value of 6.00. I used a t-test with a z-factor of 1.96 for the 95% confidence level.

95 of the 100 passed the test, and five failed it. I was suspicious, and ran the "recalculate" a few times to check. Mostly, 95 passed, but in one case, 98 did so, and in another 93 did so. This is not unusual. I picked one of the sets for which 95 sets passed the t-test to analyze a little further. I grouped them by twenties. Two sets of 20 had 19 passes and one fail. One had 17, and for the other two sets, all 20 passed. I selected four of the sets of 100 values to plot on probability coordinates, as shown here:


Var028 has the highest mean value, and is one of the five "failures". Var064 has the lowest mean value, and is also a "failure". In the world of "significantly different is good", they would be "positive" results. Var021 has a mean value (5.9971) closest to the mean for the total population of 10,000 (6.0015). Var078 has the greatest scatter (SD = 1.165; 1.0 is expected). Note the apparent outlier in Var078, down and to the left. Many researchers would throw this out, not including it in the statistical tests. Yet this is a valid member of the original data set from which all these variables were drawn. It happens to be the lowest of the 10,000.

Removing that seeming outlier changes the mean by 0.0372 and the SD by 0.055. This would be enough to make a "positive" into a "negative", except that Var078 was negative already, just barely.

OK, so even RCT's need to be taken with a grain of salt. Add to this that not everyone is totally honest. In fact, the system seems to be primed to reward dishonestly, so like in politics, at least some of the scum rises to the top.

What is a fellow to do? The last chapter is titled "Eleven Simple Never-Fail Rules for Not Being Misled by Experts". As if it were that simple. The take-away message is, be suspicious of simple answers to complex questions, look for experts who have nothing to gain (this can be devilish hard), and give yourself time to think before making decisions.

Although I have four major instances in my life where a doctor was wrong, and in one case I had to save my own life, I still go to doctors. I go not as a "patient" but as a customer paying a consultant to render advice, which I weigh carefully, and sometimes to do what I can't (self-surgery is not advised). When I buy shares of a mutual fund, I am trusting an expert, the fund manager, to make better stock purchase decisions than I would (or have the time for). Even, when we select a grocery store to patronize, we're trusting the store's purchasing agents to get the quality goods at fair prices from producers (though we do go to Farmers' markets when we can).

Experts are like parents. Some are good, some bad. We learn as we grow up that our parents are not perfect, though most of us (sadly, not all) learn that they are at least well-intentioned. The author of Wrong has added four Appendices, and the fourth is an essay on the factors that might make the book worthwhile. Considering that "…and every man a liar" is an exaggeration, one can hope that David Freedman has at least made a valiant attempt to present the truth about expertise. He had to rely on experts for much of his own material! So this book is a meta-analysis, and is likely to be slightly more reliable than the individual analyses. Finally, consider these four pronouncements (page 39) from medical studies, answering the question, "Can vitamin D help fend off cancer?":
  • No, said a 1999 study.
  • Yes, from 2006, it cuts risk by 50%.
  • Yes, from 2007, it cuts risk by 77%.
  • No, said a 2008 study.

No comments: