Monday, July 29, 2013

Book Review: The Signal and the Noise : Why So Many Predictions Fail – but Some Don't. By Nate Silver

How Politics, Sports, and Microbial Ecology are very much alike:


I thought this might be a topical post in light of Nate Silver's announcement that he will be moving his operation to ESPN. Nate is one of my favorite people, as his interests  (Sports, Politics, Big Data) match my own in many ways.





In the field of microbial ecology we are increasingly dealing with mounds upon mounds of data. This is due to the advent of DNA sequencing technologies that can count millions of pieces of DNA and tries to match them to databases that tell us which microbe the DNA came from. Sure, there is signal in these mounds, but there can also be lots of noise. When you make so many observations, there are bound to be some that happen by chance. Even if you are 95% certain that your observations aren't coincidence, it only takes 20 observations before you would expect one to be spurious (19/20= 95%)


When I started analyzing my own sequencing data, I realized I needed a much better understanding of statistics to be able to grok what my results meant. I had very little formal statistical training, which is a sad reflection on my high school (where all the smart kids should take calculus, I was assured) and undergrad (required a "calculus for business majors" class, but no stats) programs. Ask me when was the last time I formally took a derivative or integral (my undergrad calculus class-- 10 years ago). Ask me when was the last time I used any statistics (yesterday). I think there is a fundamental disconnect between which math skills are actually needed by the majority of people, and which are taught in schools.

Because I didn't have a good foundation on things like Bayes' Theorem, I started looking around for a book that would teach me some fundamentals so I could develop a good feeling for what type of statistical tests would be most appropriate for my data. I didn't want to read a dry textbook. I heard about this book on an interview that Nate did on some TV show and thought it might be an interesting way to learn some statistics. I knew about Nate from his work in predicting elections (the 2012 US elections in particular) and some of his work in sports as well. 

This book talks about the advances in predictions in fields ranging from earthquakes and weather to sports, gambling, and politics. Many of these fields have large data sets to draw from, just like microbial ecology. If you think about it, we have been keeping records in baseball for a very long time. If you wanted to ask how left-handed pitchers do against left-handed batters in the 9th inning of tied games, there is probably a decent sample size to look at. 

As a long time fantasy football and basketball player (one of my hobbies) I have played around with sports statistics for a while to try to make better decisions about who to draft when and what trades to make. (Gotta fill that all-important virtual trophy case!) There is a similar problem in fantasy sports, lots of data, lots of noise. Some people swear that 3rd-year wide-receivers are the most likely to break out, since it take players that long to learn an NFL offense. People said the similar things about quarterbacks for a long time, but then Cam Newton, Andrew Luck, and Robert Griffin III came along and blew away the avoid-rookie-quarterbacks meme. When making sit-start decisions in fantasy basketball, "experts" say that all else being equal, you should always start the player who is playing in a game where the teams are worst at defense, since you get more possessions per game to pile up stats. In actual sports games (not fantasy) there is some debate on whether things like "momentum" are real (is a team/player on a winning streak or a hot scoring streak within a game more likely to perform better than they otherwise would?). I assume one of the reasons ESPN wanted Nate was to help viewers/readers figure which of these "mechanisms" is real and which is noise. The data is there, it just takes a trained person to analyze it. 

Politics also has large datasets going back many years. With this data people try to answer questions such as: Are local elections predictive of national trends? When is the state of the economy a predictor of presidential elections? Will a candidate's race play into the outcome of an election? It takes careful analysis to separate signal from noise. (See the Redskins Rule -- when the Washington Redskins of the NFL win their last home football game prior to the U.S. Presidential Election the incumbent party wins the electoral vote for the White House; when the Redskins lose, the non-incumbent party wins). 

Microbial ecology is similar in that we can get large datasets around which to make hypotheses about the way communities work. We can try to see if they are real by breaking down the numbers and testing our theories about how mechanisms work. Instead of altered run/pass ratios in games with inclement weather, we look at altered bacteroides/firmicutes ratios (different bacterial groups) in obese people. Some correlations end up being real (the proposed mechanism actually influences the outcome) and some end up being the microbial version of the Redskins Rule (no plausible way for the outcome of a football game to affect the outcome of the election). The real mental work comes in proposing likely mechanisms for the correlations you observe and designing further tests to see if those mechanisms hold true. This takes "subject-matter expertise." Instead of proposing that 3rd-year wide receives break out due to learning an offence, we propose that the physiological effects of pH cause shifts in soil communities

Anyway, I really enjoyed this book. It keeps a light tone, and was a pretty easy read, even for the statistically uninitiated like me. I recommend it for anyone who may want to work with "big data." I give it 5/5 Petri dishes!

Tuesday, July 2, 2013

RIP Google Reader

Google Reader was shut down today. I am just one of many who have written about this topic, but I still want to put in my 2 cents.

As someone who likes to think of himself as a "high-information" person, Google Reader absolutely changed the way I use the internet. The ability to aggregate all my favorite blogs and news sources into one page was a game changer; it saved me soooo much time.

I can still remember when I discovered RSS feeds 5 years ago and figured out how to use them. I was
waiting for an experiment to finish and obsessively refreshing some fantasy football advice site that I knew was going to post updated rankings at any moment. I thought to myself "If only there was some way for me to be notified when they updated their site." Clogging up my inbox with subscriptions was a non-starter for me... I wanted a way to keep them separate. I remembered seeing those little orange RSS buttons all over the place and decided to look into what they did. The rest is history. Google Reader quickly became my RSS app of choice due to its simplicity and ease-of-use. I didn't need bells and whistles, I just wanted something efficient. Closing down that last functioning tab of my reader feed was like closing the casket of a loved one (in nature, if not in magnitude, of course).

Google's decision to discontinue Reader still puzzles me a little. The best explanation of what happened that I have found is here. For my purposes, however, there still isn't a better tool out there than RSS feeds keep me up-to-date with my list of sites that keep me informed about the things I am interested in. I don't want to miss posts and I don't want to manually check all of the sites constantly. For now I have found comfort in the arms of Feedly, and it has been... okay. I still miss the Platonic Ideal of simple, efficient, clean layouts that was Google Reader.