Regression Analysis of Salary for Major League Baseball Players

Objective

The San Francisco Giants were a very strong team for most of the last decade, winning World Series titles in 2010, 2012 and 2014. However, since 2016, the team’s performance has declined significantly. They haven’t made the playoffs since then, and most of the team’s star players are no longer there – most recently, with the release of star third baseman Pablo “Panda” Sandoval just this week. In light of this, we have been tasked by the Giants to help improve their roster.

While the strategy in “Moneyball” involved looking for players with decent stats while minimizing salary, our approach involves examining stats vs. salary from players over the last decade, while also examining some player traits outside of performance statistics, such as age. We have narrowed our investigation to study the significance of the following batting statistics have on the salary an MLB player is payed:

Batting Statistics  
Games Played (G) Caught stealing (CS)
Plate Appearances (PA) Bases on Balls (BB)
At-bats (AB) Strike outs (SO)
Runs (R) Ground-into-Double Play (GDP)
Hits (H) Hit by pitch (HBP)
Singles (X1B) Sacrific hits (SH)
Doubles (X2B) Sacrifice flies (SF)
Triple (X3B) Intentional Walks (IBB)
Homeruns (HR) Average Bats (AVG)
Runs-batted-in (RBI) On-base Percentage (OBP)
Stolen Bases (SB) Slugging Percentage (SLG)

Additionally we are interested in testing the significance of the following statistics which we intend to calculate:

  1. Total Bases (TB) calculated from
    TB = X1B + 2(X2B) + 3(X3B) + 4(HR)
  2. Home Run Ratio (HRR) calculated from
    HRR = HR/AB

We also want to look at variables aside from performance statistics such as:

  1. Position (POS)
  2. Year (Season)
  3. Age

Data Exploration

The first step to our exploratory data analysis is to load the necessary packages.

# install.packages('corrplot')
# install.packages('Hmisc')
library(ggplot2)
library(corrplot)
# library(Hmisc)
library(plyr)

We are looking to merge two CSV files that both contain the data necessary for our regression modeling. We first imported and read the two CSV files as data frames baseball and salary. Using head(), we printed the first 3 rows to verify that our files have been imported.

baseball <- read.csv("baseball.csv")
head(baseball, n = 3)

baseball dataset baseball dataset

salary <- read.csv("salary.csv")
head(salary, n = 3)

salary dataset salary dataset

In our next step, we merged both salary and baseball data frames by Name and Season. Each record in our data frame lists one athlete’s performance over one season. Note A.J. Pierzynski with multiple records (2010, 2012, 2013, and 2015 Seasons).

full_stats <- merge(baseball, salary, by = c("Name", "Season"))
head(full_stats)

full_stats dataset full_stats dataset

Because we are unable to view the full data frame, we decided to check the dimensions. We now know that we are working with a 1,336 by 344 data frame.

cat("There are", nrow(full_stats), "rows and", ncol(full_stats), "columns in this table.")

Dataset dimensions Dataset dimensions

Additionally, we used colnames() to retrieve the column names. We currently have 344 variables – many of these are unnecessary for analysis, and will be removed.

colnames(full_stats)

Column names Column names

For the purpose of our study, we removed the irrelevant columns.

full_stats <- full_stats[,c("Season", "Salary", "Age", "G", "PA", "AB", "R", "H", "X1B", "X2B", "X3B", "HR", 
                            "RBI", "SB", "CS", "BB", "SO", "GDP", "HBP", "SH", "SF", "IBB", "AVG", "OBP", "SLG", "POS")]

We also wanted to verify the type of variables we were working with so we used str() to check the variable types. Here, we can see that our response variable, Salary, is currently a Factor variable. In order to use the column, we need to make it a numeric variable. We can also see that the 344 variables from our original data frame has now been trimmed down to 26 variables.

str(full_stats)

Data types Data types

Data cleaning

The next step to our analysis was to check for NULL, NA, and empty string values within the data frame. To do so, we utilized the count() function to provide a count for the number of NULL, NA, and empty string values within a particular column. The logic expression within the function checks to see if an input is NULL, NA, or an empty string.

sum(full_stats[["Season"]] == "" | is.na(full_stats[["Season"]]) | is.null(full_stats[["Season"]]))

Missing values Missing values

As we can see, none of column names were printed which tells us that our data frame does not contain any NULL, NA, or empty strings.

for (col in colnames(full_stats)) {
    x <- sum(full_stats[[col]] == "" | is.na(full_stats[[col]]) | is.null(full_stats[[col]]))
    if (x > 0) {
        print(col)
    }
}

As mentioned we have to transform our Salary column to a column of numeric values.

# Remove '$' and ',' from Salary to convert to numeric
full_stats$Salary = gsub("\\$",'',full_stats$Salary)
full_stats$Salary = as.numeric(gsub(',','',full_stats$Salary))

Now when we run summary() over the Salary column we get a five-number summary verifying that the column is now a numeric variable.

summary(full_stats$Salary)

Summary statistics Summary statistics

We can also see from the summary above that the minimum value for Salary is 0. Because it doesn’t seem reasonable for an MLB player to have a salary of 0 dollars, we decided to subset our data to only include observations where Salary > 0. The minimum value for Salary is now 400,000 which seems a lot more reasonable.

full_stats <- full_stats[full_stats$Salary > 0,]
summary(full_stats$Salary)

Summary statistics Summary statistics

Removing observations where Salary < 0 only removed one row.

dim(full_stats)

Dataset dimensions Dataset dimensions

Adaptation

Additionally we are interested in testing the significance of the batting statistics Total Bases (TB) and Home Run Ratio (HRR) which can be calculated from our existing columns. We created a column for each to be added to our data frame.

full_stats$TB <- full_stats$X1B + 2*full_stats$X2B + 3*full_stats$X3B + 4*full_stats$HR
full_stats$HRR <- full_stats$HR / full_stats$AB
head(full_stats, n = 3)

full_stats dataset full_stats dataset

Descriptive Visualization

Before building our regression models, we wanted to create a few descriptive visualizations that would show different aspects of our data. Our first visualization is a histogram that allows us to visualize the distribution of our response variable, Salary. We can see our response variable’s distribution is left-skewed. In other words, most MLB players earn below $10 million.

options(repr.plot.width = 7.75, repr.plot.height = 6.9)
ggplot(full_stats, aes(Salary)) + geom_histogram(binwidth = 2200000) + theme_classic() +
ggtitle(label = "Distribution of MLB Players' Salaries", subtitle = "In USD$") +
theme(plot.title = element_text(face = "bold", size = 12), plot.subtitle = element_text(size = 10))

Histogram Histogram

Our second visualization is a box plot that allows us to identify any outliers from our response variable. Below, we can see that there are four MLB players that have a significantly larger salary than the rest of the players.

options(repr.plot.width = 16.5, repr.plot.height = 4.55)
ggplot(full_stats, aes(Salary)) + geom_boxplot() +
ggtitle(label = "Distribution of MLB Players' Salaries", subtitle = "In USD$") +
theme(plot.title = element_text(face = "bold", size = 14), plot.subtitle = element_text(size = 12))

Box plot Box plot

One of the variables outside of performance statistics that we are interested in is Season. We assume that salaries will generally increase over time due to inflation, but to verify we created a violin plot to compare the salary distribution by season. Here, we can see there was a slight increase for star players, but we can also see the minimum salary generaly remained constant.

options(repr.plot.width = 16.5, repr.plot.height = 7)
ggplot(full_stats, aes(as.factor(Season), Salary)) + geom_violin(trim=FALSE) +
ggtitle(label = "Distribution of MLB Players' Salaries per Season", subtitle = "In USD$") +
theme(plot.title = element_text(face = "bold", size = 15), plot.subtitle = element_text(size = 12)) +
geom_jitter(shape=16, position=position_jitter(0.2)) + xlab("Season")

Violin plot Violin plot

Another variable aside from performance statistics we are interested in is Age. We assume that salaries will generally increase with age. To verify, we created a scatter plot. Here, we can see a positive linear relationship.

ggplot(full_stats, aes(Age, Salary)) + geom_point() + theme_classic() +
ggtitle(label = "MLB Players' Salaries by Age", subtitle = "In USD$") +
theme(plot.title = element_text(face = "bold", size = 15), plot.subtitle = element_text(size = 12))

Scatter plot Scatter plot

Explore Correlations

The next step to our analysis is to observe the correlation between our variables. Because we are working with so many variables, using a scatterplot matrix wouldn’t give us a clear visualization of the correlations. Instead, we created a correlation table to show the correlation between Salary and each of the predictors, as well as the correlation the predictors have with one another. Notice we had to transform our POS predictor from type factor to type numeric to make our code run.

Focusing on our response variable, we can see that there appears to be a strong positive linear relationship between Salary and Age. There also appears to be a negative linear relationship between Salary and the variables X3B, SB, CS, and SH.

# Get all unique values of the "POS" column
unique_positions <- unique(full_stats$POS)

# Create a mapping from unique positions to numeric values
position_mapping <- setNames(1:length(unique_positions), unique_positions)

# Overwrite POS column with numeric values based on the mapping
full_stats$POS <- position_mapping[full_stats$POS]

# Calculate the correlation matrix including the encoded 'POS' columns
correlation_matrix <- cor(full_stats)

# Create a correlation plot
corrplot(correlation_matrix, method = "color")

Correlation plot Correlation plot

To view the Pearson correlation (r), we created a correlation matrix and subsetted it so we see the Pearson correlations associated with our response variable. Immediately, we can see that Age has the highest Pearson correlation (aside from Salary itself) which reflects what is shown in the table above.

# Calculate the correlation matrix
corr_matrix <- cor(full_stats, use = "pairwise.complete.obs")

# Convert the correlation matrix to a dataframe
corr_df <- as.data.frame(corr_matrix)

# Extract the correlations for the "Salary" column
salary_correlations <- corr_df["Salary", ]

# Print the correlations for the "Salary" column
print(salary_correlations)

Pearson correlation (r) Pearson correlation (r)

Response Variable

Our chosen response variable for our regression analysis is Salary, which represents an MLB player’s paid salary during one season. We have chosen this to be our response variable because we are interested in helping the Giants maximize player strength, given an allocated budget. By predicting player salary, the team will be able to choose the best players that still fit within their budget. Through our analysis, we are expecting to find that performance stats such as home-run ratio and total bases are significant indicators of salary – it is reasonable to assume that star players with high stats would be paid more in a season. We are also expecting to find that non-performance-related variables, such as age and season, are also significant indicators of salary – it would make sense that players who have more experience in the MLB would be paid more as well.

Simple Linear Regression

SLR Model 1

For our first model, we were interested in seeing how Age influences a player’s salary. By our summary printout, we see that our first simple regression model gives us an intercept coefficient of -24,711,053, and an Age coefficient of 1,110,593. Thus, we can express our line of best fit by the equation:
Y=-24711053+1110593X_1

According to our model, if a player’s age increases by one year, we would expect his salary to increase by 1,110,593. If X_1=0, then the player would have a salary of -24,711,053.

For our independent variable, Model 1 gives us a p-value of less than 2(10^-16). This value represents the chance of obtaining our current model, assuming there is no relationship between Salary and Age. When testing for significance, it is common to use a benchmark value of 0.05 – p-values below this benchmark indicate a relationship between the two variables. Since our p-value for Age is less than our benchmark of 0.05, we can say that Age is a significant indicator of Salary.

Finally, Model 1 gives us an adjusted R^2 value of 0.3778, which means that our model explains 37.78% of the variation we see in Salary. It might be reasonable to add another independent variable to this model to improve it.

slr_mod1 <- lm(Salary ~ Age, data = full_stats)
summary(slr_mod1)

SLR Model 1 SLR Model 1

SLR Model 2

For our second model, we were interested in seeing how Season influences a player’s salary. Our model gives us an intercept coefficient of -522,777,674, and a Season coefficient of 262,976. Thus, we can express our line of best fit by the equation:
Y=-522777674+262976X_1

According to our model, for every one unit increase in Season, we would expect a player’s salary to increase by 262,976. If X_1=0, then the player would have a salary of -522,777,674.

Model 2 gives us a p-value of 3.42(10^-5). Since this value is less than our benchmark of 0.05, we can say that Season is a significant indicator of Salary.

Finally, Model 2 gives us an adjusted R^2 value of 0.01206, which means that our model explains 1.206% of the variation we see in Salary.

slr_mod2 <- lm(Salary ~ Season, data = full_stats)
summary(slr_mod2)

SLR Model 2 SLR Model 2

SLR Model 3

For our third model, we were interested in seeing how X3B influences a player’s salary. Our model gives us an intercept coefficient of 8,829,581, and a X3B coefficient of -614,438. Thus, we can express our line of best fit by the equation:
Y=8829581-614438X_1

According to our model, for every one unit increase in X3B, we would expect a player’s salary to decrease by 614,438. If X_1=0, then the player would have a salary of 8,829,581.

Model 3 gives us a p-value of less than 2(10^-16). Since this value is less than our benchmark of 0.05, we can say that X3B is a significant indicator of Salary.

Finally, Model 3 gives us an adjusted R^2 value of 0.05989, which means that our model explains 5.989% of the variation we see in Salary.

slr_mod3 <- lm(Salary ~ X3B, data = full_stats)
summary(slr_mod3)

SLR Model 3 SLR Model 3

SLR Model 4

For our fourth model, we were interested in seeing how HRR influences a player’s salary. Our model gives us an intercept coefficient of 4,619,940, and a HRR coefficient of 64,410,675. Thus, we can express our line of best fit by the equation:
Y=4619940+64410675X_1

According to our model, for every 0.01 unit increase in HRR, we would expect a player’s salary to increase by 644,106.75. If X_1=0, then the player would have a salary of 4,619,940.

Model 4 gives us a p-value of less than 1.13(10^-10). Since this value is less than our benchmark of 0.05, we can say that HRR is a significant indicator of Salary.

Finally, Model 4 gives us an adjusted R^2 value of 0.03, which means that our model explains 3% of the variation we see in Salary.

slr_mod4 <- lm(Salary ~ HRR, data = full_stats)
summary(slr_mod4)

SLR Model 4 SLR Model 4

SLR Model 5

For our fifth model, we were interested in seeing how TB influences a player’s salary. Our final SLR model gives us an intercept coefficient of 1,339,177, and a TB coefficient of 23,217. Thus, we can express our line of best fit by the equation:
Y=1339177+23217X_1

According to our model, for every one unit increase in TB, we would expect a player’s salary to increase by 23,217. If X_1=0, then the player would have a salary of 1,339,177.

Model 5 gives us a p-value of less than 6.21(10^-9). Since this value is less than our benchmark of 0.05, we can say that TB is a significant indicator of Salary.

Finally, Model 5 gives us an adjusted R^2 value of 0.02429, which means that our model explains 2.429% of the variation we see in Salary.

slr_mod5 <- lm(Salary ~ TB, data = full_stats)
summary(slr_mod5)

SLR Model 5 SLR Model 5

Model Comparison

To identify the best SLR model, we select the model that explains the most variation of the dependent variable, Salary, e.g. the model with the highest adjusted R^2 value. The adjusted R^2 values of the 5 SLR models are as follows:

cat("Adjusted R-squared of slr_mod1:", summary(slr_mod1)$adj.r.squared, "\nAdjusted R-squared of slr_mod2:", 
    summary(slr_mod2)$adj.r.squared, "\nAdjusted R-squared of slr_mod3:", summary(slr_mod3)$adj.r.squared, 
    "\nAdjusted R-squared of slr_mod4:", summary(slr_mod4)$adj.r.squared, "\nAdjusted R-squared of slr_mod5:", 
    summary(slr_mod5)$adj.r.squared)

Model comparison Model comparison

From above, we can see that SLR model 1 had the highest R^2 value. In other words, SLR model 1 explains the most variation in our response variable, Salary, out of the five. The plot below also gives us a visualization of the model.

ggplot(full_stats, aes(Age, Salary)) + geom_point() + theme_classic() +
ggtitle(label = "MLB Players' Salaries by Age", subtitle = "In USD$") +
theme(plot.title = element_text(face = "bold", size = 15), plot.subtitle = element_text(size = 12)) + 
geom_smooth(method='lm', formula = y~x, col="red")

Scatter plot Scatter plot

Multiple Linear Regression

MLR Model 1

In our first model, we are interested in studying how well Season, Age, G, PA, GDP, SH, and IBB influence an MLB player’s salary. Running lm() on the specified variables and printing the summary, we can make the following observations:

  1. Residuals: The distribution of residuals is centered around a median -642,452.
  2. Coefficients: The equation of the fitted line is Y = -763020786 + 365783X_1 + 1097837X_2 - 132188X_3 + 31088X_4 + 101560X_5 - 191106X_6 + 308127X_7. This means when X_1, …, X_7 = 0, an MLB player will have a salary of -$763,020,786. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, or X_7, there will be an increase/decrease in salary in the value of the corresponding slope.
  3. P-value: We can also see from the p-value benchmark that every variable is a significant indicator of Salary.
  4. Residual standard error: MLR Model 1 has a residual standard error of 4,703,000 on 1327 degrees of freedom.
  5. Adjusted R-squared: MLR Model 1 has an adjusted R^2 of 0.5071. This means that this model accounts for 50.71% of variation in Salary.
mlr_mod1 <- lm(Salary ~ Season + Age + G + PA + GDP + SH + IBB, data = full_stats)
summary(mlr_mod1)

MLR Model 1 MLR Model 1

MLR Model 2

In our second model, we are interested in studying how well Season, Age, G, PA, X2B, GDP, SH, and IBB influence an MLB player’s salary. Running lm() on the specified variables and printing the summary, we can make the following observations:

  1. Residuals: The distribution of residuals is centered around a median -659,182.
  2. Coefficients: The equation of the fitted line is Y = -748628908 + 358702X_1 + 1089378X_2 - 137297X_3 + 35147X_4 - 55298X_5 + 105658X_6 - 223686X_7 + 312266X_8. This means when X_1, …, X_8 = 0, an MLB player will have a salary of -$748,628,908. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, X_7, or X_8, there will be an increase/decrease in salary in the value of the corresponding slope.
  3. P-value: We can also see from the p-value benchmark that every variable is a significant indicator of Salary.
  4. Residual standard error: MLR Model 2 has a residual standard error of 4,692,000 on 1326 degrees of freedom.
  5. Adjusted R-squared: MLR Model 2 has an adjusted R^2 of 0.5093. This means that this model accounts for 50.93% of variation in Salary.
mlr_mod2 <- lm(Salary ~ Season + Age + G + PA + X2B + GDP + SH + IBB, data = full_stats)
summary(mlr_mod2)

MLR Model 2 MLR Model 2

MLR Model 3

In our second model, we are interested in studying how well Season, Age, G, PA, X2B, RBI, GDP, SH, and IBB influence an MLB player’s salary. Running lm() on the specified variables and printing the summary, we can make the following observations:

  1. Residuals: The distribution of residuals is centered around a median -569,414.
  2. Coefficients: The equation of the fitted line is Y = -736921763 + 353000X_1 + 1088806X_2 - 138625X_3 + 32978X_4 - 63191X_5 + 22108X_6 + 98285X_7 - 160172X_8 + 289396X_9. This means when X_1, …, X_9 = 0, an MLB player will have a salary of -$736,921,763. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, X_7, X_8, or X_9, there will be an increase/decrease in salary in the value of the corresponding slope.
  3. P-value: We can also see from the p-value benchmark that every variable is a significant indicator of Salary.
  4. Residual standard error: MLR Model 3 has a residual standard error of 4,682,000 on 1325 degrees of freedom.
  5. Adjusted R-squared: MLR Model 3 has an adjusted R^2 of 0.5114. This means that this model accounts for 51.14% of variation in Salary.
mlr_mod3 <- lm(Salary ~ Season + Age + G + PA + X2B + RBI + GDP + SH + IBB, data = full_stats)
summary(mlr_mod3)

MLR Model 3 MLR Model 3

Model Comparison pt. 2

To identify the best MLR model, we select the model that explains the most variation of the dependent variable, Salary, e.g. the model with the highest adjusted R^2 value. The adjusted R^2 values of the 3 MLR models are as follows:

cat("Adjusted R-squared of mlr_mod1:", summary(mlr_mod1)$adj.r.squared, "\nAdjusted R-squared of mlr_mod2:", 
    summary(mlr_mod2)$adj.r.squared, "\nAdjusted R-squared of mlr_mod3:", summary(mlr_mod3)$adj.r.squared)

Model comparison Model comparison

From above, we can see that MLR model 3 had the highest R^2 value. In other words, MLR model 3 explains the most variation in our response variable, Salary. We can also see how MLR model 3 compares with the best SLR model we found, SLR Model 1:

cat("Adjusted R-squared of mlr_mod3:", summary(mlr_mod3)$adj.r.squared, "\nAdjusted R-squared of slr_mod1:", 
    summary(slr_mod1)$adj.r.squared)

Model comparison Model comparison

There is a significant improvement in the R^2 value when using more than one variable to explain the variation in Salary. We can conclude that MLR Model 3 is indeed the best model that delivers the highest R^2 value.

Conclusion

Based on our above analysis and findings, we recommend that the SF Giants use our model to choose the ideal players to improve their roster. If the team has an estimated budget for acquiring new players, our model would be helpful in making the best use of their budget by allowing them to maximize stats of the players they can afford. Furthermore, if the team’s allocated budget changes going forward, they can expand or contract their selection of available players and still use our model to pick the best players under the circumstances. We hope the team finds this analysis useful, and wish them all the best this season.

Presentation Slides

The source code is available here.