top of page
Search
Writer's pictureKevin Busuttil

MLB Stats and How They Correlate

Updated: May 21

I set out to find the statistics that best correlate and answer questions I had about the game. I took every team from the last 5 full seasons (2020 is excluded), and picked a group of variables to look at it. These are the variables:


Main Variable of Focus: Wins


Hitting:

Average

On-Base Percentage

OPS

wOBA

wRC+

Hard Hit Percentage

Average Exit Velocity

Launch Angle


Pitching:

K/9

BABIP Against

ERA

Hard Hit Percentage Against

Line Drive Percentage

Groundball Percentage

Flyball Percentage

Strike Percentage

Left On Base Percentage


Defense:

Framing

Defensive Runs Saved

Ultimate Zone Rating

Outs Above Average

Runs Above Average

Defensive Rating (Fangraphs)


1. AVG vs. OBP with Wins - How much greater is the correlation to OBP?

Year

AVG with Wins

OBP with Wins

2017

0.420

0.710

2018

0.688

0.799

2019

0.572

0.847

2021

0.324

0.586

2022

0.527

0.776

All Years

0.459

0.704

With a quick comparison, my first question was answered. The importance of getting on base in any means necessary and not just focusing on hits can make a big difference in a team's win total.


2. DRS vs. OAA/RAA with ERA - Which defensive metric holds more weight?


For the full 5-year data, in correlation with ERA, r = -0.558 for Defensive Runs Saved and r = -0.359 for Runs Above Average


The highest totals during this period were the 2022 New York Yankees with 129 Defensive Runs Saved (99 Wins) and the 2017 Minnesota Twins with 50 Runs Above Average (85 Wins)


The lowest totals during this period were the 2019 Detroit Tigers with -116 Defensive Runs Saved (47 Wins) and the 2017 New York Mets with -43 Runs Above Average (70 Wins)


3. Are wOBA and wRC+ as good of statistics as they are made out to be?


Across the full period, the two statistics correlated with wins quite well:

wOBA r = 0.737 wRC+ r = 0.773.



















In 2022, the correlation was even higher as wOBA had an r value of 0.808 and wRC+ had an r value of 0.890. This wRC+ correlation coefficient was actually higher than the correlation that ERA had with wins by a slight margin. These statistics in estimating runs are accurate to the degree that it can possibly predict wins at a better rate that actual earned runs scored against. They live up to the hype.


4. Do Hard Hit % and Exit Velocity correlate with wOBA and wRC+?


These ones I thought for sure would be among the most highly correlated statistics. Boy was I wrong. Both X variables in question correlated with wOBA at around 0.3 and with wRC+ at around 0.5.



















The plot is clearly scattered with no pattern whatsoever. It is as if we could pick an average exit velocity and wOBA out of a 150 card hat and it would be a plausible combination no matter the outcome. The high point for wOBA is the 2019 Houston Astros with a 0.355 wOBA. That same team was tied for 74th in average exit velocity and tied for 75th in hard hit percentage of teams on our list. An interesting discovery, but not one I expected to find.


5. What are the best and worst teams of the last 5 full seasons?


Rankings out of 150 total teams

2022 Dodgers

111 Wins (1st)

119 wRC+ (3rd)

2.80 ERA (1st)

86 DRS (T-7th)

2018 Red Sox

108 Wins (2nd)

.340 wOBA (T-5th)

3.75 ERA (27th)

10 DRS (T-80th)

2019 Astros

107 Wins (T-3rd)

124 wRC+ (1st)

3.66 ERA (19th)

97 DRS (5th)

2021 Giants

107 Wins (T-3rd)

.329 wOBA (T-27th)

3.25 ERA (5th)

32 DRS (T-48th)

2022 Astros

106 Wins (T-4th)

112 wRC+ (T-12th)

2.90 ERA (2nd)

67 DRS (T-18th)

2019 Tigers

47 Wins (T-149th)

77 wRC+ (150th)

5.26 ERA (146th)

-116 DRS (150th)

2018 Orioles

47 Wins (T-149th)

.299 wOBA (T-134th)

5.19 ERA (T-144th)

-45 DRS (139th)

2021 Orioles

52 Wins (T-147th)

91 wRC+ (T-110th)

5.85 ERA (150th)

-30 DRS (T-120th)

2021 DBacks

52 Wins (T-147th)

85 wRC+ (T-131st)

5.15 ERA (141st)

-37 DRS (132nd)

2019 Orioles

54 Wins (146th)

.308 wOBA (T-105th)

5.67 ERA (149th)

-53 DRS (T-141st)

6. How drastic are changes from year to year?


2017 Data


This year was an odd case the data in which OBP actually correlated with wins better than OPS and wOBA by a slight margin, and wRC+ by a shockingly wide margin (0.710 to 0.554). Another oddity is that K/9 was much more highly correlated with a lower ERA in 2017 than when looking at the full period data.


2018 Data


Referencing back to our first question, the correlation between average and wins was the greatest in 2018 by a fair margin. It wasn't drastically far behind in the correlation with on-base percentage as seen in the other seasons. Here once again, K/9 correlates well with a lower ERA. That trend doesn't seem to drop off until the middle of our five season period. ERA as a whole correlated with wins in this particular season even better than it did on the full average at 0.876.


2019 Data


In 2019, OBP, OPS, wOBA, and wRC+ were all highly correlated with wins in comparison with other years. Both OBP and wRC+ were actually even more highly correlated with wins than ERA. Hard hit percentage against correlated much more with ERA than in other years. For the full dataset, r = 0.304, while the correlation in 2019 was up to 0.746. Baseball is an interesting game.


2021 Data


There is quite the disparity between pre-2020 and post. The four statistics I mentioned in 2019 went to their lowest correlation levels with wins and ERA shot up to 0.889. Hard hit percentage in correlation with ERA took a big step back, while left on base percentage took a slight step forward in this subset. There were 1,465 less runs scored in 2021 than in 2019, and even less in 2022, which we will look at now.


2022 Data


The statistic that held the most weight this past season wRC+, correlating with wins at an r value of 0.890. Furthermore, Defensive Runs Saved, correlated with wins much more than in other years as well. Looking back at the top teams list from question 5, both the Dodgers and Astros from 2022 appeared on the list and posted the lowest ERAs on our list. In fact, there have not been lower team ERAs since 1972. The league ERA was the lowest since 2014, the home run total was the lowest since 2015, and the league OBP was the lowest since 1972.


7. Can conclusions be drawn from a multiple regression model?


In 2022, the average wins for playoff teams was between 95 and 96, with the lowest win total for a playoff team being the Rays with 86. Let's use 90 wins as our playoff benchmark for good measure.


Using our full data set with model Wins ~ (ERA) + (wRC+) + (DRS)


Model: Wins = 67.63846 - [11.98459 * (ERA)] + [0.65682 * (wRC+)] + [0.03257 * (DRS)]


  • Isolating ERA: Assuming a 0 DRS and an average 100 wRC+, how good does ERA need to be for a 90 win team?


ERA must be 3.61 to win 90 games - 8 teams did this in 2022, 3 in 2021, 4 in 2020, and only 1 in 2019


  • Valuing DRS: Assuming a 4.00 ERA and a very high total of 90 DRS, is a high wRC+ still required for 90 wins?

  • What about a 4.00 ERA with a 15 DRS, what is the wRC+ estimate now?


wRC+ of 103 would be needed to reach the 90 win mark as estimated by the model - 11 teams achieved that in 2022, 8 in 2021, 13 in 2020, and 9 in 2019.


wRC+ of 106 would be needed now. 10 teams achieved that in 2022, 6 in 2021, 10 in 2020, and 6 in 2019.


  • Big offense: Assuming a 5.00 ERA and a 0 DRS, how good does wRC+ need to be for a playoff chance?

  • A 5.00 ERA is quite a stretch, what about 4.50 ERA?

This would be a monumental task with a wRC+ of 125 getting the job done. This would just beat out the total of 124 posted by the 2019 Houston Astros, the highest modern era total for a roster not containing Babe Ruth.


A slightly more reasonable wRC+ total of 116, but still no easy task. Of the 150 teams on our list, only 7 crossed that mark. It is hard to overcome bad pitching.


  • 2022 Projected win totals by the model: Would the playoff picture have changed?


Playoff seedings with our model using 2022 statistics:

  1. Astros 108.63 Dodgers 115.04

  2. Yankees 107.83 Mets 101.35

  3. Guardians 93.65 Cardinals 99.28

  4. Blue Jays 99.96 Braves 100.09

  5. Mariners 96.07 Brewers 91.77

  6. Rays 93.60 Padres 89.23

The AL is 6 for 6 for the exact seeding as seen last October. The NL, however, is 2 for 6 as the NL East teams were flip flopped, the Padres were the 5 seed, and the Brewers were bested by the Phillies. The NL World Series team wasn't even in it! This is why we play the games.


8. Were there any surprising findings?


I wouldn't say there were many, but a few jumped off the page for me:


  • Left on base percentage (LOB) had quite a high correlation with ERA and with wins through the whole period. Of course keeping runners from scoring is important, but my focus was more on keeping runners off the bases altogether. This statistic proved to show as quite significant.


  • Defensive statistics DRS, OAA, RAA, and DEF were quite insignificant as a whole. Outs Above Average has seemed to grow in popularity, but I don't see its value when looking at our findings in this study. DRS proved to show the most validity and it has been around the longest. Were these new statistics worth examining?


  • The correlation between hard hit percentage/exit velocity and wOBA was not nearly as high as I thought it would be. There has been much more emphasis but on exit velocity in particular over time as well as the advanced metrics with runs created and wOBA. Just by the shear fact of them coming on to the scene at a similar time, I assumed there was a connection in some way between the two, but there is nothing significant in actuality.


  • I wouldn't say the regression model breakdown is "surprising", but definitely intriguing and one that can be useful if expanded.


Variables and their correlation with our main variable of focus


Wins AVG 0.4595633 OBP 0.7035541 OPS 0.6800451 wOBA 0.7370114 wRC. 0.7725399 HardHitBat 0.4033793 EV 0.4246638 LA 0.3015418 K9 0.6196141 BABIP -0.4817225 ERA -0.8172230 HardHitPitch -0.3637971 LD -0.1299589 GB 0.2619175 FB -0.1571368 Strike 0.4578266 LOB 0.7903487 Framing 0.3903707 DRS 0.5027607 UZR 0.2719224 OAA 0.3363381 RAA 0.3431678 DEF 0.3501773 Wins 1.0000000




All data was pulled from BaseballReference.com and FanGraphs.com. All analysis was done with R programming using aforementioned data sources.




12 views0 comments

Recent Posts

See All

Comentários


bottom of page