Tuesday, March 06, 2018

We can't lie to Google

kw: book reviews, nonfiction, data mining, data analysis, big data

There is a very interesting and useful tool available from Google, Google Trends. It provides a taste of Big Data (or Data Mining) to all of us. It tracks search terms (all of them) that have been entered in the Google search box for the last 14 years. It charts their relative popularity. Here is an example:


This is just the top of a page of charts related to these searches. There are breakouts by region and certain subtopics or related searches. Just from this chart we can see three things:
  1. While on average the Grand Canyon is the most popular of the three, the relative popularity of these three National Parks varies over time (this chart shows 5 years).
  2. All three parks' popularity varies with the season. I presume that in springtime and early summer, people are planning vacations.
  3. Each park had a spike of interest in a narrow time frame. I'll leave it to you to figure out which event triggered which spike. (Hint: high wire; seismicity; sesquicentennial)
Here is a clue to the value of meta-analysis made available by a tool like this: people search for what interest or concerns them, and there is nothing to be gained by lying about it. They really want to know. Thus, a result like this is more telling:


Here, seasonality of weight loss searches is quite evident. My impulsive analysis: a big spike in January is related to this most common New Year's Resolution, followed by sustained interest until mid-summer, then a slacking off as people enter first vacation season and then the holiday season, and have other things on their minds. Also, it is clear that many more people are worried about their weight than about smoking (and I may not have used the most common search term about this; GTrends is very literal and there are lots of ways people ask questions. But I could have searched for Topics instead).

Just out of curiosity I removed the weight loss portion. This amplified the remaining "stop smoking" result, and it is had no seasonality.

The little arrow at center-right in the diagrams shows that you can download the data (I haven't tried it) for analysis using Excel. This allows you to compare a great many search terms or topics, to winkle out all the variations, and sum them up for a general picture, or parse them by country or region or whatever. Note that if you mix topics and search terms, they are measured on different scales, because a topic gathers many search terms.

I'll refrain from going on and on, for this is a book review, after all. Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, by Seth Stephens-Davidowitz, introduces the new things that are being done because we have enormous data sets available, and even the more because certain of them (such as Google searches) give more truthful information than "traditional" methods of social study such as surveys.

Have you ever lied to a survey taker, or on an online survey? Many folks do. Voter polls are notorious for inaccuracies, particularly during hotly-contested elections, such as the recent circus featuring Mrs. Clinton and Mr. Trump. Nearly every poll was dramatically wrong, nearly every time. Why? This is more than just bad poll question design (which is all too frequently done on purpose), this is a case that many of those who were going to vote one way either said, "I don't know for sure" or flat-out lied to the pollster. Consider someone who answers the telephone at work, and it is a poll taker. They aren't supposed to call workplaces, but they frequently do anyway (where else you gonna find people at midday? And with so many folks declining to even answer a phone at dinner time...). The office worker agrees to a "quick poll", and soon enough the big question comes, "Who do you prefer in the Presidential election?" This person is a closet Republican in New York City or San Francisco, and knows she'll be overheard by co-workers, with whom she never, ever talks politics. How do you expect her to answer? In those cities, you can lose your job over your Presidential preference.

The author of Everybody Lies shows a number of ways that he and others gather data from Google searches (he calls them his favorite data set), from Facebook likes, and a from number of other data collections about our behavior online. He also places emphasis on what he calls New Data, data we didn't have available before. Online behavior traces are one kind of New Data, but the concept is not new. For example, in 1854 Doctor John Snow mapped every death from cholera for the year in the Soho area of London, hundreds of them. That counts as Big Data for the 1850's! His dot map, shown here, helped him pinpoint a specific water pump as the source of the cholera epidemic; it was a pump on Broad Street, just above and right of center in this map (the dark bar there shows nearly 20 deaths in a single multi-family house). This was both Big Data and New Data for the time.

Getting back to November 2017: People who didn't get nearly as much press as the daily pollsters, but who were looking at "sentiment analysis" of internet searches, were saying, "Not so fast..." but were generally ignored. But it was they who were right. That is because, while we can lie to Google,
we don't because what is the point? We search for what we really want to know. And if someone wanted to mess with Google's search results database, and skew what it reports, it would take a huge effort, even by spambot standards. Even bot-masters have better things to do with their time!

There are some arenas in which we do lie, even to our computers. Facebook is an example. We post mostly positive stuff. We also primarily like to see positive stuff. In fact, when we see too much negative material in our newsfeeds, many of us either block or unfriend the purveyor thereof. So we tell our Facebook "friends" about what is going well with us, and we ask Google about what bothers us.

A number of fascinating and unexpected results show up. For example, would you expect a larger proportion of sports stars to come from comfortable, middle-class backgrounds, or from poor or working-poor backgrounds? Many people think that desperation to escape the ghetto drives talented kids to excel so they can actually get out. Analysis of the history of every professional sports figure counters that expectation. Most successful professionals in baseball, basketball, etc. come from middle-class backgrounds. One major factor: There are a whole lot more middle-class people than poor people in America. But even in proportion, the percentage of middle-class youngsters who excel in sports is greater than the percentage of poorer kids. We could go into a lot of possible reasons why, but it's a side point here. Truly great players like LeBron James, who was raised by a poor single mother, are the exception.

Late in the book the ethics of data mining arise. Many people are a little creeped out to learn that a trend analysis can reveal so much about them. Meanwhile, CEO's of retail companies are drooling over the same data, trying to figure out how best to induce us to part with our cash. One aspect is A/B testing. It isn't hard for the programmers at Google or Facebook to write code that, for example, puts a bluer "click here" button on an ad sent to everyone from a certain set of web servers, and a purpler button to everyone else. Which one gets a greater proportion of clicks? Do the words "click here" or "try it" or "join up" work best? When billions of ads are being shown, a 1% advantage comes to thousands or millions of potential sales. Ethical or not, that cat is out of the bag. There is no putting it back. We simply need to stiffen our own backbone of ad resistance if we wish to avoid being manipulated.

A fun side note. I am learning to use Incognito Mode and Duck Duck Go searching when I want to research something to buy. Otherwise Google and Facebook and just about everyone else spends the next month or three sending me ads related to something I already bought (or decided not to buy). When they get better connected to the retailers, so they know I bought it and probably won't buy another, and when they get better connected to Amazon's database of "people who bought this also bought these", it is likely that my ad resistance will become a lot harder to maintain! They'll have a pretty good idea what I want next, before I know I want it.

That is the world we live in. Teach your children well. Maybe they'll become more canny and cautious Web users than we have been.

No comments:

Post a Comment