I set out to find the statistics that best correlate and answer questions I had about the game. I took every team from the last 5 full seasons (2020 is excluded), and picked a group of variables to look at it. These are the variables:
Main Variable of Focus: Wins
Hitting:
Average
On-Base Percentage
OPS
wOBA
wRC+
Hard Hit Percentage
Average Exit Velocity
Launch Angle
Pitching:
K/9
BABIP Against
ERA
Hard Hit Percentage Against
Line Drive Percentage
Groundball Percentage
Flyball Percentage
Strike Percentage
Left On Base Percentage
Defense:
Framing
Defensive Runs Saved
Ultimate Zone Rating
Outs Above Average
Runs Above Average
Defensive Rating (Fangraphs)
1. AVG vs. OBP with Wins - How much greater is the correlation to OBP?
Year | AVG with Wins | OBP with Wins |
2017 | 0.420 | 0.710 |
2018 | 0.688 | 0.799 |
2019 | 0.572 | 0.847 |
2021 | 0.324 | 0.586 |
2022 | 0.527 | 0.776 |
All Years | 0.459 | 0.704 |
With a quick comparison, my first question was answered. The importance of getting on base in any means necessary and not just focusing on hits can make a big difference in a team's win total.
2. DRS vs. OAA/RAA with ERA - Which defensive metric holds more weight?
For the full 5-year data, in correlation with ERA, r = -0.558 for Defensive Runs Saved and r = -0.359 for Runs Above Average
The highest totals during this period were the 2022 New York Yankees with 129 Defensive Runs Saved (99 Wins) and the 2017 Minnesota Twins with 50 Runs Above Average (85 Wins)
The lowest totals during this period were the 2019 Detroit Tigers with -116 Defensive Runs Saved (47 Wins) and the 2017 New York Mets with -43 Runs Above Average (70 Wins)
3. Are wOBA and wRC+ as good of statistics as they are made out to be?
Across the full period, the two statistics correlated with wins quite well:
wOBA r = 0.737 wRC+ r = 0.773.
In 2022, the correlation was even higher as wOBA had an r value of 0.808 and wRC+ had an r value of 0.890. This wRC+ correlation coefficient was actually higher than the correlation that ERA had with wins by a slight margin. These statistics in estimating runs are accurate to the degree that it can possibly predict wins at a better rate that actual earned runs scored against. They live up to the hype.
4. Do Hard Hit % and Exit Velocity correlate with wOBA and wRC+?
These ones I thought for sure would be among the most highly correlated statistics. Boy was I wrong. Both X variables in question correlated with wOBA at around 0.3 and with wRC+ at around 0.5.
The plot is clearly scattered with no pattern whatsoever. It is as if we could pick an average exit velocity and wOBA out of a 150 card hat and it would be a plausible combination no matter the outcome. The high point for wOBA is the 2019 Houston Astros with a 0.355 wOBA. That same team was tied for 74th in average exit velocity and tied for 75th in hard hit percentage of teams on our list. An interesting discovery, but not one I expected to find.
5. What are the best and worst teams of the last 5 full seasons?
Rankings out of 150 total teams
2022 Dodgers | 111 Wins (1st) | 119 wRC+ (3rd) | 2.80 ERA (1st) | 86 DRS (T-7th) |
2018 Red Sox | 108 Wins (2nd) | .340 wOBA (T-5th) | 3.75 ERA (27th) | 10 DRS (T-80th) |
2019 Astros | 107 Wins (T-3rd) | 124 wRC+ (1st) | 3.66 ERA (19th) | 97 DRS (5th) |
2021 Giants | 107 Wins (T-3rd) | .329 wOBA (T-27th) | 3.25 ERA (5th) | 32 DRS (T-48th) |
2022 Astros | 106 Wins (T-4th) | 112 wRC+ (T-12th) | 2.90 ERA (2nd) | 67 DRS (T-18th) |
2019 Tigers | 47 Wins (T-149th) | 77 wRC+ (150th) | 5.26 ERA (146th) | -116 DRS (150th) |
2018 Orioles | 47 Wins (T-149th) | .299 wOBA (T-134th) | 5.19 ERA (T-144th) | -45 DRS (139th) |
2021 Orioles | 52 Wins (T-147th) | 91 wRC+ (T-110th) | 5.85 ERA (150th) | -30 DRS (T-120th) |
2021 DBacks | 52 Wins (T-147th) | 85 wRC+ (T-131st) | 5.15 ERA (141st) | -37 DRS (132nd) |
2019 Orioles | 54 Wins (146th) | .308 wOBA (T-105th) | 5.67 ERA (149th) | -53 DRS (T-141st) |
6. How drastic are changes from year to year?
2017 Data
This year was an odd case the data in which OBP actually correlated with wins better than OPS and wOBA by a slight margin, and wRC+ by a shockingly wide margin (0.710 to 0.554). Another oddity is that K/9 was much more highly correlated with a lower ERA in 2017 than when looking at the full period data.
2018 Data
Referencing back to our first question, the correlation between average and wins was the greatest in 2018 by a fair margin. It wasn't drastically far behind in the correlation with on-base percentage as seen in the other seasons. Here once again, K/9 correlates well with a lower ERA. That trend doesn't seem to drop off until the middle of our five season period. ERA as a whole correlated with wins in this particular season even better than it did on the full average at 0.876.
2019 Data
In 2019, OBP, OPS, wOBA, and wRC+ were all highly correlated with wins in comparison with other years. Both OBP and wRC+ were actually even more highly correlated with wins than ERA. Hard hit percentage against correlated much more with ERA than in other years. For the full dataset, r = 0.304, while the correlation in 2019 was up to 0.746. Baseball is an interesting game.
2021 Data
There is quite the disparity between pre-2020 and post. The four statistics I mentioned in 2019 went to their lowest correlation levels with wins and ERA shot up to 0.889. Hard hit percentage in correlation with ERA took a big step back, while left on base percentage took a slight step forward in this subset. There were 1,465 less runs scored in 2021 than in 2019, and even less in 2022, which we will look at now.
2022 Data
The statistic that held the most weight this past season wRC+, correlating with wins at an r value of 0.890. Furthermore, Defensive Runs Saved, correlated with wins much more than in other years as well. Looking back at the top teams list from question 5, both the Dodgers and Astros from 2022 appeared on the list and posted the lowest ERAs on our list. In fact, there have not been lower team ERAs since 1972. The league ERA was the lowest since 2014, the home run total was the lowest since 2015, and the league OBP was the lowest since 1972.
7. Can conclusions be drawn from a multiple regression model?
In 2022, the average wins for playoff teams was between 95 and 96, with the lowest win total for a playoff team being the Rays with 86. Let's use 90 wins as our playoff benchmark for good measure.
Using our full data set with model Wins ~ (ERA) + (wRC+) + (DRS)
Model: Wins = 67.63846 - [11.98459 * (ERA)] + [0.65682 * (wRC+)] + [0.03257 * (DRS)]
Isolating ERA: Assuming a 0 DRS and an average 100 wRC+, how good does ERA need to be for a 90 win team?
ERA must be 3.61 to win 90 games - 8 teams did this in 2022, 3 in 2021, 4 in 2020, and only 1 in 2019
Valuing DRS: Assuming a 4.00 ERA and a very high total of 90 DRS, is a high wRC+ still required for 90 wins?
What about a 4.00 ERA with a 15 DRS, what is the wRC+ estimate now?
wRC+ of 103 would be needed to reach the 90 win mark as estimated by the model - 11 teams achieved that in 2022, 8 in 2021, 13 in 2020, and 9 in 2019.
wRC+ of 106 would be needed now. 10 teams achieved that in 2022, 6 in 2021, 10 in 2020, and 6 in 2019.
Big offense: Assuming a 5.00 ERA and a 0 DRS, how good does wRC+ need to be for a playoff chance?
A 5.00 ERA is quite a stretch, what about 4.50 ERA?
This would be a monumental task with a wRC+ of 125 getting the job done. This would just beat out the total of 124 posted by the 2019 Houston Astros, the highest modern era total for a roster not containing Babe Ruth.
A slightly more reasonable wRC+ total of 116, but still no easy task. Of the 150 teams on our list, only 7 crossed that mark. It is hard to overcome bad pitching.
2022 Projected win totals by the model: Would the playoff picture have changed?
Playoff seedings with our model using 2022 statistics:
Astros 108.63 Dodgers 115.04
Yankees 107.83 Mets 101.35
Guardians 93.65 Cardinals 99.28
Blue Jays 99.96 Braves 100.09
Mariners 96.07 Brewers 91.77
Rays 93.60 Padres 89.23
The AL is 6 for 6 for the exact seeding as seen last October. The NL, however, is 2 for 6 as the NL East teams were flip flopped, the Padres were the 5 seed, and the Brewers were bested by the Phillies. The NL World Series team wasn't even in it! This is why we play the games.
8. Were there any surprising findings?
I wouldn't say there were many, but a few jumped off the page for me:
Left on base percentage (LOB) had quite a high correlation with ERA and with wins through the whole period. Of course keeping runners from scoring is important, but my focus was more on keeping runners off the bases altogether. This statistic proved to show as quite significant.
Defensive statistics DRS, OAA, RAA, and DEF were quite insignificant as a whole. Outs Above Average has seemed to grow in popularity, but I don't see its value when looking at our findings in this study. DRS proved to show the most validity and it has been around the longest. Were these new statistics worth examining?
The correlation between hard hit percentage/exit velocity and wOBA was not nearly as high as I thought it would be. There has been much more emphasis but on exit velocity in particular over time as well as the advanced metrics with runs created and wOBA. Just by the shear fact of them coming on to the scene at a similar time, I assumed there was a connection in some way between the two, but there is nothing significant in actuality.
I wouldn't say the regression model breakdown is "surprising", but definitely intriguing and one that can be useful if expanded.
Variables and their correlation with our main variable of focus
Wins
AVG 0.4595633
OBP 0.7035541
OPS 0.6800451
wOBA 0.7370114
wRC. 0.7725399
HardHitBat 0.4033793
EV 0.4246638
LA 0.3015418
K9 0.6196141
BABIP -0.4817225
ERA -0.8172230
HardHitPitch -0.3637971
LD -0.1299589
GB 0.2619175
FB -0.1571368
Strike 0.4578266
LOB 0.7903487
Framing 0.3903707
DRS 0.5027607
UZR 0.2719224
OAA 0.3363381
RAA 0.3431678
DEF 0.3501773
Wins 1.0000000
All data was pulled from BaseballReference.com and FanGraphs.com. All analysis was done with R programming using aforementioned data sources.
Comentários