Tuesday, July 23, 2013

You don't have to hate statistics

kw: book reviews, nonfiction, mathematics, statistics, popular treatments

Measure something. Say, take a yardstick and measure the width of the kitchen counter. In my kitchen, I get 24 inches. That is an observation. Guess what? You can't do statistics using one observation. Not because you are somehow incompetent, but because of the way statistics is defined. A common definition is:
Statistics is the practice or science of collecting and analyzing numerical data in large quantities.
Note the final qualifier: "in large quantities". It is possible to do a certain amount of statistical inference using just a few items—and we'll do some momentarily—but you typically need lots of data to produce a robust inference. However, a few principles can be seen by analyzing just a few observations. I measured my counter in five more locations. Here are all my observations:

24 1/8
23 7/8
23 7/8
23 3/4 (= 23 6/8)
23 5/8

We can do a few things with these six numbers. First, comparing the largest with the smallest, we see that the range is 3/8 (just under 1cm). I can take the average, which comes to 23 7/8. Hmm; if the building plans specified a 24 inch counter top, this one averages an eighth inch too narrow. Then there is a trend. These are in order, from one end of the counter to the other. The largest measurement is the second one, the smallest is the last, and the rest of the measurements follow a decreasing trend. In angular terms, a "tilt" of 3/8" in about 10 feet is only a sixth of a degree, but I'd expect a builder to do better than have "nearly a half inch" of variation over ten feet. Oh, well. One of my projects for later this year is to replace the counter tops anyway. I hope quality control has improved since these were installed in the 1970s!

Now for just a little terminology. The "average" I figured is known at the "mean". It is not the only way to determine "central tendency". Another is the "median", which means, the one in the middle (or the average of the central two if the sample has an even number of observations).  For example, if I sort these six numbers (in this case, just move the 24 1/8 above the 24), it happens that there are three that are 23 7/8 or larger, and three that are 23 7/8 or smaller. So the median is 23 7/8. This is not always the case, and perhaps it is not even usually the case. For example, if I have the seven numbers 1, 2, 3, 5, 8, 14, 30, the mean is 9 but the median is 5. Note that only 2 of these numbers are greater than 9.

Another such measure is the "mode", which means the most likely value. Mode is really not too meaningful when there are only six observations, but for these data, the mode is also 23 7/8. Suppose instead that I had measured that fourth width as 23 3/4. This would have very little difference on the mean (23 6.8/8) or the median (23 13/16 or 23 6.5/8), but the mode would now be 23 3/4, because that number arose the most frequently (twice).

This illustration shows how these are related (Image from The Daily Dongle). A frequency plot of a very regular set of measurements such as shown in (a) will have mean, median and mode that are equal or nearly equal. Sometimes we make measurements that have more than one "hump" (their distribution is called bimodal) as in (b). But (c) and (d) show two ways that a series of measurements may reveal a skewness, in which case the three measures will be quite different.

Each has its uses. Average height of Euro-American males is best described as the mean, the numerical average of all measurements. We might also surmise that the median and mode will be very similar to the mean. But if you include Euro-American women, the bimodality may not be too evident, but it is there. At the very least a frequency plot will be flatter on top and have a wider total range. If the average male is 70" tall and the average woman is 64" tall (for Euro-A's, anyway), the grand average will be 67", but that single number tells you less than the two numbers, segregated by sex.

What about yearly income, or prices of homes in a city or county, or the whole country? When you hear a Real Estate report on the radio, you will hear, for example, "Median home price has risen by $5,000 in the past month". Why not use the mean? Because the distribution is skewed. There might be a few homes with very small values, and a few with very high values, but where do you put the "middle"?

Example: Broken Arrow, OK (I know someone there). The least expensive houses on the market, as I find from Realtor.com, are in the $25,000-$50,000 range. The most expensive, in the range between $1.2 million and $1.4 million. Do you think it likely that home prices are evenly distributed between these limits, producing a "middle" value of about $700,000? Not likely! In this market, this moment, 684 homes are for sale. Houses # 341 and 342 on the sorted list the web site provides are both priced at $170,000. That is our median for this market (today). Quite a bit different from 700k, isn't it? Half the houses' owners are asking $170,000 or less, and the other half are asking more. If you can afford a $200,000 house, at the most, you have a lot to choose from. Wherever the larger values in a distribution are a big multiple of the smaller values, the median is usually the best measure of "average".

This is my simple attempt to explain a few statistical principles. Charles Wheelan does a superb job of explaining these and a goodly number of others in Naked Statistics: Stripping the Dread From the Data. In the middle of the book, for example, he dwells quite a bit on the Central Limit Theorem. This has to do with sampling.

Above, I took six measurements of my kitchen counter. I could have taken a lot more, perhaps spaced every inch, or even closer. Suppose I sent my wife into the kitchen with a yardstick and asked her to make six measurements, with the same yardstick, in locations of her choosing. Then perhaps we could grab some of our neighbors and have them repeat the experiment. Now I will have several sets of numbers, and each set will have its own average. Do you think any of the averages will be close to, say 22, or 27? Not unless there are some BIG wiggles in the counter's shape, that I avoided with my measurements. If I could get a lot of my neighbors to make sets of measurements, the Central Limit Theorem (CLT) predicts that they will be distributed a lot like section (a) of the illustration above, clustering about some average value that is close to the "real" mean for all possible measurements of my counter.

As the author goes on to show, with marvelous examples, this is the source of the power of polling. Not only can a poll yield very useful results about all 180 million American adults by polling 1,000 or 2,000 people (properly chosen!), marketers (who pay the most for such data) can predict some of our preferences based on what we have already bought or even searched for (Google sells its search results, don'tcha know). My wife and I have "loyalty cards" from a few local grocers and other stores. We get discounts on certain items for scanning the card when checking out. In a sense, the store is paying us for the right to keep track of our purchases. Something else we get during checkout is a series of spot-printed coupons (the more we buy the more coupons they print). Some coupons are for more of the things we often buy. Others are for similar items of competing brands (the brands' owners are in on this also). And there will usually be a few "wild card" coupons that show up over time, for things we might not usually buy. Why? Because other people whose purchasing habits are similar to ours buy those things, and the store is betting that we are more likely to try those items if we get a coupon to prod us, compared to giving the same coupon to random shoppers. They have also figured out that we are "on the edge of elderly", so some of the coupons are for things like Ensure (an energy drink for old folks) or Depends (adult diapers).

Think about it. A typical supermarket has tens of thousands of customers that visit regularly. If 25% have the store card, they can slice and dice that population a dozen or a hundred ways, to target their coupon campaign. And, since coupons cost almost nothing to print, they can throw in 30%-50% off-the-wall coupons so we don't realize how precisely we have been targeted!

If you get nothing else out of this book, read it carefully for the author's explanation of the CLT and his stories of how it is used (such as how Target "helped" a father learn that his teen daughter was pregnant). He reveals all sorts of tricks of the trade, such as the numerical way to handle binary differences such as male/female. I am a math junkie, so of course I love a book like this. But I think the math-averse will also find it very entertaining and informative.

No comments: