When I created eGF the goal was for it to be a descriptive stat – something that helps to explain what’s going on (in terms of how goals are scored) better than currently available stats. In order to prove its value over something like Corsi what I need to do is prove that a) it is a better descriptive stat than Corsi (i.e. correlates better with current GF) and b) it is at least as useful as a predictive stat.
Before I get into any charts or data – I want to explain my rationale on predictive stats. I think you should always put yourself in the role of a decision maker, whether it’s a fantasy sports GM or an actual GM, the role of a predictive stat is to try and determine the kind of year a player will have. In other words, will a specific player help you win in the future. Because I believe there is a strong temporal element to hockey (meaning players or teams can go hot/cold) and I feel that random sampling to determine the predictive nature of a stat (i.e. taking 41 random games of data to randomly predict the other 41) removes that element, I always gauge predictive stats on a season to season basis. Basically, if a guy has good numbers in a stat in one year how likely is it that leads to success (as measured by goals for) in the next year?
This post will deal entirely with forward data; I will do a follow-up with the defense data at some point in the future. Also, so as not to cloud the picture I am using only data (for descriptive stats) from 2014/2015 and from players that also played in 2013/2014. This allows us to set a baseline for the predictive part. The reason why I limit the data to 2014/2015 is because 2010/2011 to 2013/2014 was all in my training data set for eGF, and if you build a model and then compare it to the training data you better hope that it does well. If it’s a regression, then it will almost automatically do well (provided your independent variables were chosen well). So let’s just cut straight to the chase and see if eGF is a better descriptive stat for forwards than CF.
As you can see eGF20 outperformed CF20 by a pretty wide margin (statistically significant to 99.99%) and was in turn outperformed by PTS20 by a similar margin (statistically significant to 100%). A couple more explanation points here; GF20 is a team level stat – meaning it is the observed amount of goals that a team scored while that player was on the ice. eGF20 is also a team stat in that it is the expected goals for the team while that player was on the ice. PTS20 is an individual stat, and represents the points that individual received (per 20 minutes) on the GF that his team got. Now, it is obviously problematic to use points because points can only exist if goals are scored – by default this means the correlation with points will be reasonably high. The other problem with points is that it has no value as a descriptive stat for defensive measures.
This chart illustrates that point perfectly. PTS20 has almost no correlation with GA20, and as a descriptive measure is useless. eGA20, on the other hand, has a reasonable correlation (it’s low because I believe forwards have less control over GA than defense does) that is significantly higher than that of CA20 (statistically significant to 99.8%). So eGA is a better descriptive measure than CA and eGF is a better descriptive measure than CF for forwards in this data set.
For descriptive purposes there is no reason to use Corsi when you could use eGF for forwards. Now the question is – is there a reason to use it for predictive purposes? I’m going to channel @IneffectiveMath here and let you know the short answer is no, because there are no particularly great ways to predict next season’s performance (as measured by GF20/GA20/GF%).
To establish a baseline let’s use the same data set for the descriptive stats to show their predictive abilities. That means using 2013/2014 statistics to predict 2014/2015 performance. Here is the chart:
We cannot say that there is a significant difference (>95% confidence) between any of the measures in predicting future GF20. That being said, previous GF20 (pGF20) was the best predictor in that data set followed by eGF20, then by CF20. There appears to be some ability to guess a forward’s ability to generate goals for based on his previous year’s stats (no matter which one you use).
When you look at the above graph, for all forward data from the years 20112012 to 20142015, you can see two things: 1) Excel is a terrible program that wouldn’t let me change the style of this chart for some reason and 2) the relationship holds true with more data. None of the correlations are different from each other to a statistically significant degree (95% confidence), but they are all significantly different than zero and show some ability to predict GF20 based entirely on the previous seasons’ numbers. Once again pGF20 outperformed eGF20 which outperformed CF20 (for this data set), but again not to a statistically significant level.
This is more, or less, in line with what I’d expect. A forward has a significant degree of control over the offensive zone – but he is just one of 10 players on the ice. Offensive pressure might be primarily controlled by 5 (or 6) players (being the 3 attacking forwards, 2 defense and possibly a center) so a player’s offensive skill (or lack thereof) should be somewhat consistent season to season – though masked by other factors (linemates, opposition, system, etc). For defensive play I would expect that the average forward has much less to do with defensive success and so I would expect to see less correlation with future GA. Which is exactly what you see in the next chart:
These correlation differences are not statistically different from each other (to 95% confidence), but eGA20 outperformed CF20 which outperformed GA20. This chart tells me that systems, defense, and goaltending are likely bigger parts of predicting GA20 than forwards are. Things don’t change much if you add in more data (2011/2012 to 2014/2015) all the R^2s stay below 0.1 and none of the difference are statistically significant (to 95% confidence). The reason I included GA20 in the prediction stats is the predilection people have to using CF% or GF% as a measure to evaluate a player. As we can see, for forwards what’s most in their control (as it relates to goals) is goals for – they have much less control over goals against.
That’s why charts like the one above don’t necessarily make a lot of sense to me. Well they make sense, it just doesn’t make sense to look at them and think you’re looking at something useful. Predicting a GF% for a player relies on so many things going right. On the offensive side he needs to maintain his SH%, maintain the quality of his chances, and likely maintain a system and his regular linemates. On the defensive side he has less control (as mentioned previously) but has to rely on his goalie saving the same percentage of chances and those quality of chances against remaining the same. Through system changes, player changes, and injuries it is very unlikely that the prediction endeavour (based on predicting his GF%) will be successful.
All of that being said, CF% had a higher correlation than eGF% which had a higher correlation than GF% of using previous year information to predict current year information. None of the differences were even a little bit significant from a statistical perspective (around 50% p-value).
eGF has shown to be a superior descriptive statistic to Corsi in measuring both goals for and against, for forwards, while simultaneously being statistically similar as a predictive measure for goals for, against, and GF%. In my (very biased) opinion, this removes the need to look at Corsi metrics to quantify the play of individuals as eGF is equal or superior in every (statistical) regard.