I’m taking a break from the usual modeling. If Bayesian models of basketball performance are what you signed up for, I’ll be back to those shortly.
My workflow for all my NBA models used to look like the following:
Write a stan model
Save it in a random folder
Run a generic python script that trains the model
Save the model to disk, somewhere
Open up a new jupyter notebook, inspect the model
Make plots
Git commit everything, because why not?
This was fine for my first two or three models, but it quickly because a mess. There were a few issues. Mainly, organization and reproducible were lacking. What model did I run in this post predicting player streaks? What training data did I use? I can’t even find the directory I did this all in. If I behaved like this at work, people would assume I was drinking on the job.
So last weekend, I reworked everything from the ground up. Here, I’m going to walk through everything and include code snippets where appropriate.
My key requirements were a workflow that was:
Reproducible
Not a mess
Low overhead for each incremental model
MLFlow
For organizing my models, experiments and training runs, I’m using MLFlow. This is a great tool for doing data science in a reproducible, organized manner. It’s so light weight, I’ll use it at work even if I’m just doing a one-off experiment.
Server
It starts by running an MLFlow server. You don’t need to go crazy on the storage backend if it’s just for personal use. Ideally, I would run just the local file system as my storage backend, but for some reason, MLFlow won’t let you run a model registry using a file system storage (???). So the next simplest storage backend is a sqlite db.
#!/bin/bash
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./artifacts \
--host 0.0.0.0
Conda Environment
Then, I write a yaml file containing my conda environment. This allows for reproducible runs. When you run an MLFlow experiment, you tell it which conda environment to use. It uses and logs that environment, so you can run it in the exact same manner at a later date.
Here’s a bare-bones environment file
name: condaflow_env
dependencies:
- python=3.8
- matplotlib=3.5.0
- pandas=1.3.5
- bokeh=2.4.2
- pip
- pip:
- pystan==3.3.0
- mlflow==1.23.1
- arviz==0.11.4
Projects
Next, I define an MLProject
project file. This is really convenient, because you can see at a quick glance all of the entry points to your models and what parameters you can use. It also interacts well with logging your experiments (which I’ll come to in a bit).
name: stan_modeling
conda_env: stanflow_env.yaml
entry_points:
shooting_distance:
parameters:
model_name: {type: str, default: 'shooting_distance_distribution'}
year: {type: int}
command: "python shooting_distance.py --model_name {model_name} --year {year}"
In the MLProject
file above, I’m showing the entry point shooting_distance
which was the entry point for my last article. Notice how it takes two parameters: model_name
tells it which model file to use and year
tells it which year of data to analyze. Having model_name
as a parameter allows me to easily iterate on different Stan models (which before was an organizational nightmare).
Models
The Stan model definitions go in a giant directory of models. Maybe this needs some reworking, or maybe not. I haven’t decided. All of the real organization happens in models.py
. models.py
contains a base class GenericStanModel which contains all of the methods that I would want to run on any Stan model - namely the fitting, predicting, diagnostics and logging. Then, for each model I want to run, I inherit GenericStanModel and add the few specifics for each model- usually just organizing the training data required for the specific model.
MLFlow run
The nice part is that all of this infrastructure is only done once, and each additional model is just a Stan model (which you would always need to write anyway). It only takes ~20 lines of python for interacting with the Stan model.
Once you get all this set up, you basically just run it.
#!/bin/bash -ex
export MLFLOW_EXPERIMENT_NAME=shooting_distribution
export MLFLOW_TRACKING_URI='http://0.0.0.0:5000'
for y in {2000..2022}
do
mlflow run -e shooting_distance -P year=${y} .
done
So what does that run give you? Because of all the infrastructure, it organizes and logs everything.
Tracking
Here, I’m going to walk through the Tracking UI that everything gets logged to.
First, you get a nice dashboard of every run (in the example above, one run for each year).
Everything is pretty self explanatory, but a very useful column is “Version”. This is the git hash of the code you ran. So if you want to go back to the exact code you used at the time of a previous run, you can use that git commit.
If you click into a run, this is what you see:
There’s a lot of useful information here, but the most useful is the exact “Run Command” that was used to produce the model. If you scroll down to “Artifacts”, this is the most useful section.
Here I auto-generate a variety of diagnostic plots to see if my model had issues fitting. These are plots you should do for every model anyways, so anytime I fit a model I spit these out for easy inspection. things like: auto correlation, divergences, ESS, traces, PPC. I used to do these manually for every model, so now it’s nice that I just get these for free every time.
Next I have the model definition, so I can quickly remember what model was fit. And below that, I store the training data.
Model Registry
All models I trained are automatically put in a Model Registry.
This has a few benefits. The API for retrieving models is very convenient. To retrieve the latest version of a given model:
model = (mlflow
.pyfunc
.load_model(model_uri=f"models:/{model_name}/latest"))
So whenever I want to analyze a model in depth, it’s a consistent API for pulling the model. And I can easily pull different versions of the same model, all while being organized.
The second key part is that I can put models into “Staging” and “Production”. So for persistent models like my game prediction model, I can have a production model, and many in-progress models that I can promote to production as needed. This allows me to swap out the production model with ease, and still make seamless predictions on demand.
Summary
I wanted an organizational system that was:
Reproducible
Not a mess
Low overhead for each incremental model
This system achieves those goals by:
All experimental runs and models are stored with their git hash, run commands, training data, and all model metadata (reproducible)
Everything is displayed in a convenient UI (not a mess)
Each new model takes ~20 lines of python code to take advantage of the whole infrastructure and auto-logging.
Let me know if you have any feedback or areas this could be improved. What do you do with all of your models?
How I manage hundreds of NBA models
Thanks for posting this! Really helpful to see the process behind the models.