Monday, July 29, 2013

Book Review: The Signal and the Noise : Why So Many Predictions Fail – but Some Don't. By Nate Silver

How Politics, Sports, and Microbial Ecology are very much alike:


I thought this might be a topical post in light of Nate Silver's announcement that he will be moving his operation to ESPN. Nate is one of my favorite people, as his interests  (Sports, Politics, Big Data) match my own in many ways.





In the field of microbial ecology we are increasingly dealing with mounds upon mounds of data. This is due to the advent of DNA sequencing technologies that can count millions of pieces of DNA and tries to match them to databases that tell us which microbe the DNA came from. Sure, there is signal in these mounds, but there can also be lots of noise. When you make so many observations, there are bound to be some that happen by chance. Even if you are 95% certain that your observations aren't coincidence, it only takes 20 observations before you would expect one to be spurious (19/20= 95%)


When I started analyzing my own sequencing data, I realized I needed a much better understanding of statistics to be able to grok what my results meant. I had very little formal statistical training, which is a sad reflection on my high school (where all the smart kids should take calculus, I was assured) and undergrad (required a "calculus for business majors" class, but no stats) programs. Ask me when was the last time I formally took a derivative or integral (my undergrad calculus class-- 10 years ago). Ask me when was the last time I used any statistics (yesterday). I think there is a fundamental disconnect between which math skills are actually needed by the majority of people, and which are taught in schools.

Because I didn't have a good foundation on things like Bayes' Theorem, I started looking around for a book that would teach me some fundamentals so I could develop a good feeling for what type of statistical tests would be most appropriate for my data. I didn't want to read a dry textbook. I heard about this book on an interview that Nate did on some TV show and thought it might be an interesting way to learn some statistics. I knew about Nate from his work in predicting elections (the 2012 US elections in particular) and some of his work in sports as well. 

This book talks about the advances in predictions in fields ranging from earthquakes and weather to sports, gambling, and politics. Many of these fields have large data sets to draw from, just like microbial ecology. If you think about it, we have been keeping records in baseball for a very long time. If you wanted to ask how left-handed pitchers do against left-handed batters in the 9th inning of tied games, there is probably a decent sample size to look at. 

As a long time fantasy football and basketball player (one of my hobbies) I have played around with sports statistics for a while to try to make better decisions about who to draft when and what trades to make. (Gotta fill that all-important virtual trophy case!) There is a similar problem in fantasy sports, lots of data, lots of noise. Some people swear that 3rd-year wide-receivers are the most likely to break out, since it take players that long to learn an NFL offense. People said the similar things about quarterbacks for a long time, but then Cam Newton, Andrew Luck, and Robert Griffin III came along and blew away the avoid-rookie-quarterbacks meme. When making sit-start decisions in fantasy basketball, "experts" say that all else being equal, you should always start the player who is playing in a game where the teams are worst at defense, since you get more possessions per game to pile up stats. In actual sports games (not fantasy) there is some debate on whether things like "momentum" are real (is a team/player on a winning streak or a hot scoring streak within a game more likely to perform better than they otherwise would?). I assume one of the reasons ESPN wanted Nate was to help viewers/readers figure which of these "mechanisms" is real and which is noise. The data is there, it just takes a trained person to analyze it. 

Politics also has large datasets going back many years. With this data people try to answer questions such as: Are local elections predictive of national trends? When is the state of the economy a predictor of presidential elections? Will a candidate's race play into the outcome of an election? It takes careful analysis to separate signal from noise. (See the Redskins Rule -- when the Washington Redskins of the NFL win their last home football game prior to the U.S. Presidential Election the incumbent party wins the electoral vote for the White House; when the Redskins lose, the non-incumbent party wins). 

Microbial ecology is similar in that we can get large datasets around which to make hypotheses about the way communities work. We can try to see if they are real by breaking down the numbers and testing our theories about how mechanisms work. Instead of altered run/pass ratios in games with inclement weather, we look at altered bacteroides/firmicutes ratios (different bacterial groups) in obese people. Some correlations end up being real (the proposed mechanism actually influences the outcome) and some end up being the microbial version of the Redskins Rule (no plausible way for the outcome of a football game to affect the outcome of the election). The real mental work comes in proposing likely mechanisms for the correlations you observe and designing further tests to see if those mechanisms hold true. This takes "subject-matter expertise." Instead of proposing that 3rd-year wide receives break out due to learning an offence, we propose that the physiological effects of pH cause shifts in soil communities

Anyway, I really enjoyed this book. It keeps a light tone, and was a pretty easy read, even for the statistically uninitiated like me. I recommend it for anyone who may want to work with "big data." I give it 5/5 Petri dishes!

1 comment:

Please keep comments respectful-- I do not currently moderate comments.