For our second project at the Metis Data Science Bootcamp, we were tasked with creating a Linear Regression model to make a prediction based on existing data.
The Topic
With free reign in our topic selection, I chose something I’m already passionate and knowledgeable about: baseball. My initial idea was to find how college stats would translate to major league performance, however, when scraping data, I discovered that there wasn’t enough publicly available college data to build a model, so I decided instead to shift gears and leverage the much more robust dataset of MLB players’ batted ball profiles to compare them to their wRC+ and build a predictive model on that.
Glossary
Before I talk about my process, a quick glossary of some of the more technical baseball terms that I will be talking about:
wRC+: (Weighted Runs Created Plus). Measures the number of runs a player creates, compared to all other players (weighted), and normalized at 100 (meaning the average player will have a wRC+ score of 100. This is what the plus in the stat stands for).
Exit Velocity: The speed at which the ball leaves the bat.
Launch Angle: The angle at which the ball leaves the bat.
Barrels: The number of times a player hit the ball with the optimal exit velocity and launch angle. We’ll primarily be looking at this as a percentage of barrels per batted ball event (which is just any ball put into play, regardless of whether or not it resulted in a hit.)
The Process
Since I was using a Linear Regression, I looked for continuous data to use for my model. I used data from FanGraphs and StatCast data from MLB’s own Baseball-Savant.
First, I looked at contact, which was split three ways in each of three categories:
Direction | Strength | Type |
Pull | Soft | Groundball |
Center | Medium | Line Drive |
Opposite | Hard | Flyball |
I also looked at non-contact, which consisted of Walks and Strikeouts. I found that the variables with the highest correlation to wRC+ were Hard-Hit Balls, Home Runs Per Flyball, and Walk Rate. Since Hard-Hit Percentage is highly influenced by Exit Velocity, Launch Angle, and Barrels, I tested those factors in my model as well. I also included Line Drive Rate, as they are usually considered the most desirable type of hit in baseball. Additionally, with some feature engineering, I found that strikeouts had a big impact on my model.
With these variables in my model, I found that my model was fairly good at predicting a player’s wRC+, with an R² score of .703 and average error of 9.89 wRC+, which is within 10%.
The Outcome
Unsurprisingly, my research showed me that the most important factors in determining a hitter’s ability to create runs are:
- Exit Velocity
- Launch Angle
- Home Runs Per Flyball
- Line Drive Percentage
- Walk Percentage
- Strikeout Percentage
In fact, the only thing that comes as a mild surprise is the fact that Strikeout Percentage turned out to be so important in my model, as the modern wisdom says that strikeouts aren’t very important for hitters, as long as they hit the ball hard and walk a lot.
Further Study
With more time and resources at my disposal, I’d look at the data on a more granular level, possibly looking at individual pitches, as well as more advanced regression techniques.