On twitter, we let out a prediction last night.
And then it happened.
We’ve been working on models to predict games for about a decade now, and this current iteration is really exciting. Here we explain what makes it different than the typically prediction model.
If you want a prediction machine out of the box, it’s hard to beat xgboost. In a wide variety of tasks (including NBA game prediction), xgboost has the best predictive accuracy without needing to spend much time fiddling with the model. In fact, training the model is usually as simple as doing something like xgboost.fit()
. Literally a single line. Most of your time is spent deciding what features to throw into the xgboost model. And then downstream of that, optimizing the parameters of the model by brute force.
A completely hand crafted model
But now we’re taking a Bayesian approach. This offers a huge degree of flexibility, but comes at the expense of needing to hand craft every aspect of our model (we don’t get anything out of the box). Specifically, we settled on Stan, a state-of-the-art platform for statistical modeling and high-performance statistical computation. Our work-in-progress model will be posted at the bottom, but we’ll dive in the details first.
The best part about Stan is you can inspect every single aspect of your model to understand how it’s working. And for free you get to see the uncertainty in every feature of your model. As an illustrative example, our model incorporates home court advantage. And we can directly inspect what the model has learned about home court advantage.
We see that the model thinks home court advantage is worth about 2.5 points currently. More than that, the model isn’t completely sure though; home court might be worth 1 point or might be worth up to 4 points. And this uncertainty gets propagated all the way through the model’s final prediction. Every aspect of our model does this.
So when we make final predictions, we get huge uncertainties. People would be surprised to see that our model says it’s possible that the Lakers might win by 20 points or they might lose by 20 points, but it makes sense if you think about it. It happens all the time. If our model thought it wasn’t possible for the Lakers to lose by 20 points, there would be a problem. The key insight though is that the model think’s it’s much more likely that the Lakers win by a few points than they lose by 20 points.
So when we say that Detroit will cover the spread with a 71% chance, that’s where all the uncertainty comes in. Maybe the Lakes blow them out, which is entirely feasibly (and the model is well aware of this possibility). But more likely it’s a close game with the Lakers winning. Which in this case, is exactly what happened.
The Underbelly
Here’s the full Stan model. It’s a work in progress, but it’s already over 150 lines of code. Maybe that doesn’t seem like much, but remember, most models are literally a single line of code (remember, xgboost.fit()
?). And that’s not speaking to the 1000s of lines of code to scrape and clean the historical training data required for this and any other model.
As we expand on the model, we’ll post periodic updates. Subscribe if you want to get those updates in your inbox.
If you are satisfied with just a high level understanding of the model, you can stop reading here. Otherwise, if you want to dig into the code, here it is:
// Omnibus model, iteratively throwing in the kitchen sink
data {
//----------------------------------------//
// Historical game data //
//----------------------------------------//
int<lower=0> n_games; // Number of games
int<lower=0> n_teams; // Number of teams
int<lower=0>away_teams[n_games]; // Away team for each game
int<lower=0>home_teams[n_games]; // Home team for each game
real<lower=0>away_score[n_games]; // Away score for each game
real<lower=0>home_score[n_games]; // Home score for each game
real home_offensive_inactive[n_games]; // Home offensive inactive score for each game
real home_defensive_inactive[n_games]; // Home defensive inactive score for each game
real away_offensive_inactive[n_games]; // Away offensive inactive score for each game
real away_defensive_inactive[n_games]; // Away defensive inactive score for each game
//----------------------------------------//
// Data only used to generate predictions //
//----------------------------------------//
int n_games_prediction; // Number of games
// Scores for players confirmed out
real home_offensive_inactive_prediction[n_games_prediction];
real home_defensive_inactive_prediction[n_games_prediction];
real away_offensive_inactive_prediction[n_games_prediction];
real away_defensive_inactive_prediction[n_games_prediction];
// Scores for players confirmed out and Day-by-day
real home_offensive_inactive_prediction_questionable[n_games_prediction];
real home_defensive_inactive_prediction_questionable[n_games_prediction];
real away_offensive_inactive_prediction_questionable[n_games_prediction];
real away_defensive_inactive_prediction_questionable[n_games_prediction];
}
parameters {
//----------------------------------------//
// Team performance parameters //
//----------------------------------------//
real home_advantage; // Home team advantage
real<lower=0> sigma; // Model error
real<lower=0> alpha; // Model intercept
vector[n_teams] team_offense; // Team Offense
real<lower=0> team_offense_sigma_bar; // Pooled team offense
vector[n_teams] team_defense; // Team Defense
real<lower=0> team_defense_sigma_bar; // Pooled team defense
//----------------------------------------//
// Inactice Player parameters //
//----------------------------------------//
real beta_offensive_inactive;
real beta_defensive_inactive;
}
model {
//----------------------------------------//
// Priors //
//----------------------------------------//
team_offense_sigma_bar ~ cauchy(0, 5);
team_offense ~ normal(0, team_offense_sigma_bar);
team_defense_sigma_bar ~ cauchy(0, 5);
team_defense ~ normal(0, team_defense_sigma_bar);
sigma ~ cauchy(0, 5);
alpha ~ normal(100, 5);
home_advantage ~ normal(0, 5);
beta_offensive_inactive ~ normal(0, 5);
beta_defensive_inactive ~ normal(0, 5);
//----------------------------------------//
// Model //
//----------------------------------------//
for(game in 1:n_games) {
home_score[game] ~ normal(alpha +
home_advantage +
beta_offensive_inactive * home_offensive_inactive[game] +
beta_defensive_inactive * away_defensive_inactive[game] +
team_offense[home_teams[game]] -
team_defense[away_teams[game]],
sigma);
away_score[game] ~ normal(alpha +
beta_offensive_inactive * away_offensive_inactive[game] +
beta_defensive_inactive * home_defensive_inactive[game] +
team_offense[away_teams[game]] -
team_defense[home_teams[game]],
sigma);
}
}
generated quantities {
vector[n_games_prediction] home_predictions;
vector[n_games_prediction] away_predictions;
vector[n_games_prediction] total_score_predictions;
vector[n_games_prediction] home_team_diff_predictions;
for(game in 1:n_games_prediction) {
home_predictions[game] = normal_rng(alpha +
home_advantage +
beta_offensive_inactive * home_offensive_inactive_prediction[game] +
beta_defensive_inactive * away_defensive_inactive_prediction[game] +
team_offense[home_teams[game]] -
team_defense[away_teams[game]],
sigma);
away_predictions[game] = normal_rng(alpha +
beta_offensive_inactive * away_offensive_inactive_prediction[game] +
beta_defensive_inactive * home_defensive_inactive_prediction[game] +
team_offense[away_teams[game]] -
team_defense[home_teams[game]],
sigma);
total_score_predictions[game] = home_predictions[game] + away_predictions[game];
home_team_diff_predictions[game] = home_predictions[game] - away_predictions[game];
}
// Repeat above but consider 'day-by-day' players to be out for the game
vector[n_games_prediction] home_predictions_questionable;
vector[n_games_prediction] away_predictions_questionable;
vector[n_games_prediction] total_score_predictions_questionable;
vector[n_games_prediction] home_team_diff_predictions_questionable;
for(game in 1:n_games_prediction) {
home_predictions_questionable[game] = normal_rng(alpha +
home_advantage +
beta_offensive_inactive * home_offensive_inactive_prediction_questionable[game] +
beta_defensive_inactive * away_defensive_inactive_prediction_questionable[game] +
team_offense[home_teams[game]] -
team_defense[away_teams[game]],
sigma);
away_predictions[game] = normal_rng(alpha +
beta_offensive_inactive * away_offensive_inactive_prediction_questionable[game] +
beta_defensive_inactive * home_defensive_inactive_prediction_questionable[game] +
team_offense[away_teams[game]] -
team_defense[home_teams[game]],
sigma);
total_score_predictions_questionable[game] = home_predictions_questionable[game] + away_predictions_questionable[game];
home_team_diff_predictions_questionable[game] = home_predictions_questionable[game] - away_predictions_questionable[game];
}
}
So glad to see you with new posts again! Huge fan of your work and especially of your transparency with posting your Stan code.
Can I ask how you construct your inactive scores for using home_offensive_inactive, home_defensive_inactive, etc. in the model? Do you use some estimate of individual player offensive/defensive skill and add them up for each team for each game?