What’s the value of a model like SALO compared to Corsi-type stats? Where does it come from?

A count or percentage stat is a *transformation of data* to state *what events happened* in the presence (or absence) of a given player, possibly adjusted for comparison of those outcomes to whole-league outcomes in the same game states (trailing, or on the power play, etc.).

SALO, by contrast, supposes each player has some (latent) * ability to make events happen*, and the data are used to

Intuitively, the latter is what we’d really like to talk about, at least much of the time. We are a community that skates to where the puck is going, not where it has been. We don’t just want to recount the game, but understand it. What did happen before isn’t necessarily what we expect to happen again – even in identical circumstances. Transformed data can only go so far in telling us about the latter topic.

What follows is an in-depth walkthrough of one method (SALO) for estimating ability in one aspect of the game (shots on goal). For technically inclined readers, the first and last sections may suffice.

SALO is a hierarchical Bayesian model with three parts:

- an ordered logit likelihood for net shots and net penalties each half-second due to on-ice skater ability and game state,
- a Gaussian prior for skater ability on the logit scale,
- and a beta-binomial relationship, with a logit link, between shot ability and games played.

The model is fit to multiple years of regular-season data with the No-U-Turn Sampler, a Hamiltonian Monte Carlo (HMC) algorithm.

Error measures are computed from the empirical distribution of the HMC draws. The draws are also used to construct the empirical posterior predictive distribution for propagating error estimates into model extensions.

Each skater’s ability estimate is translated to a shots-for percentage by taking the ratio of shots for to total shots predicted by the model at even strength with that player on ice alongside nine average icemates.

Projections are drawn from an aging curve created by regressing year-to-year ability change on age, using the fitted prior for a skater with 0 games played as the next-season ability for those who drop out each offseason.

Hyperparameters are estimated. Hyperpriors are not imposed.

Data transformations for player description (like adjusted CF%) are valuable, but don’t come with:

- Correction for icemates (teammates and opponents);
- Correction for sample size;
- Error bars (statements of the expected precision of each value).

Each of the above seems pretty desirable for quantitatively representing what we actually think about players’ abilities given the game data we have.

It’s clear, although non-trivial, how a player’s count stats might be adjusted for, say, the lead their team had when each shot event occurred: divide the count of events in each state (-1, +2, etc.) by the league-wide increase in frequency of the same event in that state compared to tied.

This technique cannot readily be extended, though, to take the effect of *icemates* out of the stats for any given player. The required corrective factor – the frequency of any event with a given icemate – is the thing being calculated in the first place.

Instead of creating an adjusted count via weighting by multiple frequencies that are themselves unadjusted, we’d like to have a method that tries to figure out all players’ shot (penalty) contributions *simultaneously*.

We can do so via an explicit expression for the *likelihood* that a shot for or against (penalty drawn or taken) occurs at any time, given the ability levels of all the players on ice, which we can plug the actual data into.

A 60% Corsi For number for a player who’s seen 1000 5v5 Corsi events gives us more confidence they’ll prove above average in the future at driving net shot generation than one derived from just 10 events. Regression to the mean is a qualitatively well-accepted part of player assessment.

Quantitatively, however, is another story. Just how far toward the mean do we expect future results to regress for that thousand-event player? What about for the ten-event one?

This second feature becomes part of the model when we add to the likelihood a *prior*, an expression for what we believe about the overall probability of a parameter taking a given value, independent of the data.

The prior, the defining feature of a Bayesian model, lets the model responsibly *share information across players*, balancing league-wide and individual data to appropriately shrink each estimate away from extremes.

It’s not sufficient to put a best guess at a player’s ability out there by itself. It’s important to state as well how precise we think that guess is.

A transformation of the data from a whole population – such as from all the gameplay that happened – doesn’t have an associated variance. From the data we have we know exactly what happened in the data we have. Yet we still have only a finite amount of information about each player’s capacity to cause events on ice. The task is to understand abilities; how do we quantify the uncertainty remaining in our understanding given the data?

Fitting a statistical model means applying an *algorithm* to choose the best estimates of its parameters – those that yield the highest joint probability for the observed data (the likelihood) and for the parameters themselves (the prior).

Most choices of a model and an algorithm also provide handy error bars for each estimate, simply by examining how that joint probability decreases as each parameter value is varied from the optimal estimate. Monte Carlo algorithms, including the one used here, go further and provide that estimated error in a form that is easy to reuse and incorporate into subsequent calculations done with the estimates found.

The first part of the model, the *likelihood*, is the equation mapping the ability levels (to be estimated by plugging in the data) for each skater on ice at a given time to the probability of a shot (penalty) for either team.

SALO uses an ordered logit likelihood. In this likelihood, the sum of the current home skaters’ ability levels, minus the sum for the away team, minus an estimated baseline coefficient, maps to a probability of a shot for either team in each half-second of gameplay.

A link function translates a sum of coefficients, each anywhere from negative to positive infinity, into a sensible outcome value for a model. SALO uses an (inverse) logit link to turn skater ability levels into shot probabilities.

Probabilities live between 0 and 1, but it’s a very messy business to put any bounds on coefficients. To prevent absurd predictions (like a shot probability less than zero), the link function needs to translate any real input into an output between zero and one.

The canonical link function for this kind of model is the logit link. Using this link means assuming that the net sum of coefficients defined above represents not the probability, but the *log odds* of a (home team) shot:

- For a probability \(p\) between 0 and 1, the equivalent odds ratio equals \(p / (1-p)\). The odds ratio can be between zero and infinity.
- Taking the natural logarithm of the odds ratio \(\log(p / (1-p))\) gives a number that may be anywhere from negative infinity (when \(p = 0\)) to positive infinity (\(p = 1\)).

SALO treats the possible outcomes as *ordered*. The net sum of abilities is assumed to have exactly the same effect on the log odds of a home team shot (penalty) at a given time as on the log odds of *not* observing a shot (penalty) for the visitors.

Only the baselines – the log odds of each outcome given equal numbers of average players on ice – differ: most half-seconds we do observe no shots (penalties) taken by the visitors, but don’t observe any shots (penalties) taken by the home team.

SALO estimates the two baselines separately – meaning the model accounts for home-ice advantage by allowing the baseline log odds of a shot (penalty drawn) for the home time to be higher than those of a shot against (penalty taken).

In addition to the players on the ice, the likelihood has parameters representing the effects of the state of the game on the log odds of a shot for either team.

The game state components with modeled effects include:

- Skater advantage (extent of shorthandedness or the opposite);
- Lead and period (being up by 1, 2, or 3+ goals in each frame).

The baselines above thus represent the baselines specifically at even strength when tied; both of the above effects are defined as zero in this case.

A model consisting of only the likelihood described above could be estimated by itself. However, the reliability of each estimate from such a model would depend on how many data points we observed for the corresponding skater. For those with relatively little ice time, small sample size would mean potentially wild ability estimates.

In SALO, a two-part *prior* is joined to the likelihood to help stabilize such estimates. A prior, as noted above, is an equation for the probability that a given coefficient takes a given value in the first place, based on things we already understand before looking at the data.

Each part of the prior includes parameters of its own (known as hyperparameters) that describe *how strongly* the prior should act on the model’s estimates. In SALO, these hyperparameters are themselves estimated together with the individual coefficients. The strength of the prior relative to the data is set using information ultimately donated by the data themselves.

This means that:

- the more a given skater plays, the more they shape the overall picture of skaters given by the prior;
- the less they are observed, the more the overall picture the prior shapes the estimate of their ability.

The prior in SALO uses two equations to represent two hopefully uncontroversial ideas:

- Players are average on average.
- Better players get more games.

SALO supposes that the overall distribution of player ability levels is Gaussian (or normal). The standard deviation of the overall Gaussian distribution is left to be estimated.

The idea of regression to the mean is familiar to quantitative hockey readers. A player who puts up extreme numbers is expected in the future to return to more pedestrian ones. The more data we have on a given player, the less strongly we expect to see them regress. Quantifying just how severe that regression will most likely be is often left asn an exercise, however.

Incorporating a Gaussian prior means that SALO builds in the idea of regression to the mean. Estimating the standard deviation of the Gaussian means allowing skaters for whom larger samples of data are available to determine how strong that regression is expected to be.

A player who comes up to the big leagues for a cup of coffee isn’t generally even an average player, of course. It’s a better assumption that they’re somewhat below average. Someone average would play more often. Is there a good way to regress players with the smallest samples toward something other than exactly average?

SALO modifies the Gaussian prior somewhat by including a beta-binomial regression for games played in terms of player ability.

The beta-binomial distribution is a flexible way to model the probability of getting any given number of successes in a given number of chances. For regression purposes, the parameters of the beta-binomial consist of a mean \(\eta\), broken down in terms of the effects of each relevant variable, and a precision \(\phi\).

The mean is a probability – the expected *fraction* of successes, given the input variables. This means using the logit link again! SALO supposes that the *log odds* of a player getting a jersey equals:

- a baseline \(\alpha\) (the log odds for an average player)
- plus a slope \(\gamma\) times their ability.

The baseline and the slope are estimated, meaning the relationship between ability and games played need not come out particularly strong (and it doesn’t).

The precision, between 0 and infinity, adds further flexibility by letting the distribution’s shape vary:

- As \(\phi\) goes to infinity, the distribution of successes looks increasingly like a series of weighted coin flips, one for each chance at success. This doesn’t realistically describe games played; lineups aren’t drawn from a hat each night.
- As \(\phi\) approaches zero, the distribution instead looks more like a
*single*weighted coin flip awarding all the possible successes or failures at once. This also doesn’t seem realistic.

In SALO, the precision hyperparameter is estimated as well, letting the beta-binomial shape land somewhere between the two extremes as encouraged by the data.

Modifying the prior to capture the information about ability contained directly in each player’s sample size is an idea from David Robinson at Variance Explained, although Robinson’s *input* variable is appearances and his *output* is ability rather than the other way around as here.

Fitting the model means plugging in the observed data – which team, if any, took a shot each second and who was on the ice at the time – and searching over all possible values of the model’s parameters for ones that yield high values for the model’s joint probability (the product of the likelihood and the prior).

This search is performed with a Monte Carlo algorithm, a technique that boils down to the following steps:

- Propose a set of
*any old values*for the parameters. *Store*the proposed set of values with probability equal to the model’s probability given those values. Otherwise, drop them.- Repeat until some desired number of sets of values have been stored.

This results in a set of random draws from the actual probability distribution of the parameters, given the data (the *posterior*). Estimates, including estimates of error, can then be computed from the stored draws:

- Average the stored values of each parameter to get an estimate.
- Take each one’s standard deviation to get a standard error.

There are different versions of this kind of algorithm. One that’s especially efficient for models with very many coefficients is the No-U-Turn Sampler. SALO is fitted by writing the likelihood and prior (and their *gradients*) as functions of the data in `R`

, then plugging both the data and the functions into an R implementation of NUTS.^{1}

Model parameters live on kind of arbitrary scales, not really intuitive to read. What does it mean in practice that a player boosts the log odds of a shot on goal for by 0.02?

Once estimates are obtained, further calculations can be done with them to answer questions of more substantive interest. Such *extensions* range from simply transforming the estimates back to meaningful scales to doing additional modeling with the estimates as input data.

The first task is to report ability levels on a clearer scale, like a percentage.

It’s not hard to simply plug a set of chosen data into the likelihood and get the actual probability of a shot for each side. To transform each player’s ability estimate to a comparable percentage of shots for, then, means plugging in a data point in which the player is on the ice and all other conditions are equal.

The SF% reported is the percentage expected:

- with the chosen player on the ice
- along with nine other exactly average skaters
- at even strength, when tied
- averaged between home and away games.

The same can be done for net-penalties ability. This is not yet reported – there isn’t as much interesting variation in those estimates as in the analogous ones for net shots.

Beyond just scaling separate ability estimates more intuitively, however, it would be desirable to *integrate* those estimates into a single number representing overall value.

SALO 2.0 is paired with a system named MARKOV for generating value numbers on a WAR / 82 scale. MARKOV derives a Markov chain to represent the moment-to-moment evolution of a hockey game from the SALO model, and plugs in parameter estimates including player abilities to yield expected win totals. See [blog/_site/MARKOV.html] for some elaboration on MARKOV.

Versions of SALO before 1.2, released at the end of the 2017-18 regular season, used the implementation of NUTS in the software package Stan. Programming the SALO model into Stan was an easier task than coding it in pure R. However, Stan required three weeks and 40 GB of RAM (you read that right) to produce as many samples as can be obtained from the current implementation in a few hours.↩