Friday, November 23, 2018

Removing the straitjacket of non-causation in statistics

kw: book reviews, nonfiction, statistics, probability, causation, mathematics

In 1926, during the height of the eugenics movement in the U.S., a researcher who has been nearly forgotten studied the relationship between the intelligence of children and that of parents. This is the core debate, even today, regarding the "nature-nurture" dichotomy. Which is more important, upbringing or inheritance?

Step back a minute, and consider, with the current popularity of "big data", how this might be tackled. It is no longer difficult to gather enormous amounts of data regarding the IQ of numerous children, adults, and societal indicators such as neighborhood of residence. Do all the math you might wish, with regressions and correlation diagrams, and what might you find? No doubt some kind of correlation will show up, perhaps very obviously. But what does it mean? What has "caused" the greater intelligence of some children, and the lesser intelligence of others?

The word "cause" was forbidden in statistical monographs for decades. For many researchers even today, the mantra (I chose that word with malice aforethought) is, "Correlation does not imply causation." While this is indeed true, even a tautology, it is not all there is to it. We naturally think of nearly everything in cause-and-effect terms, and work done in the past couple of generations now makes it possible for researchers to discuss causes without losing tenure, grants, etc.

For the young researcher, Barbara Burks, the mantra was nonsense. She sought causes. To this end, she gave IQ tests to every member of 204 households that included foster children, and 105 households without foster children. For 1926, this was pretty big data. The choice of studying both foster children and natural children along with the adults was clever. Even more clever was the little diagram she used to analyze her results:

The arrows imply causation. Here, the "X" factor that might influence both the level of intelligence of the child, and the social status of the household, was thought to be the "heritage", including genetic inheritance, of the family. The parents, in whatever measure they benefit (or not) from "heritage", will have their own X factor, which could have been added as Y, off to the left perhaps.

Note that two of the arrows have heads at both ends. This indicates feedback effects between the social status and the intelligence of all members of the family (I imagine a family of "ordinary" intelligence having a very, very bright kid, and this leading to an improvement in social standing, for example).

Such a diagram embodies a "causal model", in the terminology of Judea Pearl, in The Book of Why: The New Science of Cause and Effect. Such a diagram, and mathematical processes invented by Pearl and his students, provide what is missing in non-causal statistics: the understanding that some things really do cause other things. By the way, Ms Burks's conclusion: genetics provides 35% of the observed differences in the intelligence of children. This was a disappointment to eugenicists, including Ms Burks. In particular, Louis Terman, an inventor of the Stanford-Binet IQ test, also famous for his "genius" studies, rejected it outright. He was quite certain that genetics were behind "nearly all" the differences in IQ. One might imagine him, upon seeing her results and conclusions, huffing, "Impossible!"

This reminds me of the first of Arthur Clarke's laws: "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." Dr. Pearl writes of his own Odyssey of discovery. He did not come to causal reasoning easily. But now he and his students have developed "causal calculus", which is introduced in The Book of Why, by Judea Pearl and Dana MacKenzie.

I must confess, though my long career as a scientific programmer led me into statistical work again and again, I never became comfortable with the formulas of probability. When I see the term P(Y|X), I have to think a moment to get my head around, "The Probability of Y occurring (or existing), given the occurrence (or existence) of X". In non-causal terms, you can freely substitute "daybreak" and "rooster crowing" for X and Y, either way: "The probability of daybreak, given that the rooster crowed" and "The probability of a rooster crowing, given that day is breaking." Hold that thought.

Dr. Pearl has added the "do" operator, which implies an intervention, so that P(Y|do(X)) means "The probability of Y occurring, given that the intervention X was made, compared to X not being done". This is the reasoning behind the randomized controlled trial (RCT) in medicine, but it was not stated in a formula before. Indeed, in older medical journals the authors use all kinds of locutions and verbal gymnastics to avoid saying, "Medicine X caused a Z% reduction in death rate due to disease Y". Many still do so.

Thus far, I can follow along. Dr. Pearl freely ignores the folklore that each mathematical expression used in a book reduces its audience by half. Now, I like math, but it would take a great deal of study for me to become conversant with causal calculus. In an example called "DO-CALCULUS AT WORK" on page 236, we find expressions such as
Σt P(c|do(s),do(t))P(t|do(s))
This is the second of seven formulas in a derivation. At that point I realized I probably ought to devote my few remaining years to something besides learning how to not only parse such statements, but to create and perform them!

Dr. Pearl's work has great benefits for those researchers who can wrap their minds around these concepts and formalisms. For example, the decades-long struggle to determine to what extent smoking causes lung cancer, the subject of a major chapter, was undertaken in the face of determined and well-funded opposition to the concept, but might have been shortened to a year or a few years if causal language had been allowed. This stricture was as if the scientists studying smoking and cancer, those who were not in the pay of the tobacco companies, tied both hands behind their backs and had to perform their work with their toes and tongues. Now causal language is out of the closet.

An early chapter discusses the Ladder of Causation, from Association (what we observe), to Intervention (what we do to see what happens), to Counterfactuals (what we imagine might happen if X were not so). It appears that only humans can perform counterfactual reasoning, such as, "Will the day break if we get rid of all the roosters?" or, as a song says, "What if we gave a War and nobody came?"

We can't always figure out what is a cause and what is an effect. But where we can, the language of causation helps us model an event, such as by the use of a diagram such as the one above. Also, Do-Calculus now provides a mathematical way to treat cause and effect in a meaningful and quantitative  way. It adds power to Design of Experiments logic, so that a researcher is more likely to correctly determine the appropriate set of causative factors and winkle out just how important each is, in producing the effect being studied.

As difficult as the reading was, due only to my unfamiliarity with the jargon and formulas, reading the book was very enjoyable. The winding path Dr. Pearl took to get past the hamstrung statistical reasoning of half a century ago, on through Bayesian analysis, and on to develop causal reasoning in a formal way, with the appropriate formalisms of the mathematical language of Do-Calculus, make for a quest saga every bit as gripping as the search for a hidden city.

No comments:

Post a Comment