How Can We Give Our Students a Confidence Boost?
The objective of this study is to build a prediction model, using linear regression, that can inform stakeholders what variables contribute to a student's confidence level.
Objective
It is essential that colleges and universities provide counseling and psychological services to their student body. Mental health difficulties can hinder a student’s academic success; untreated mental health issues may affect a student’s grades and in some cases lead to discontinued enrollment.
Some ways colleges and universities can take action in mental health initiatives include speaking about mental health to reduce stigmas, pursuing partnerships to ensure a campus-wide approach to mental health care, and investing in mental health services ensuring they’re accessible to every student. The most important thing an institution can do, however, is listen and respond to student needs.
The best way students can communicate their mental health concerns comfortably is through an anonymous survey. The psychology department in the San Francisco County Office of Education conducted a study to analyze certain behaviors on college students. The objective of this study is to build a prediction model, using linear regression, that can inform stakeholders what variables contribute (whether positively or negatively) to a student’s confidence level so that they can provide better services that can boost a student’s confidence.
Libraries
The first step to every exploratory data analysis is to import all necessary libraries.
library(ggplot2)
Data Exploration
Read in the Data
I imported and read the CSV file into R as a dataframe called personality
. Additionally, I printed the first 6 rows.
personality <- read.csv(file = "PersonalityTest2020.csv")
head(personality)
personality dataframe
I was curious to know how many survey responses I was working with, so I used nrow()
and ncol()
to print the dimensions of the dataframe. I can now see that I’m working with 32 survey responses to 50 personality questions. Note each column represents a different question.
cat("There are", nrow(personality), "rows and", ncol(personality), "columns in this table.")
Dataframe dimensions
Because there are so many questions within the survey, I could not view all of them in the print out of head(personality)
. To view all the questions, e.g. columns, I ran colnames(personality)
.
colnames(personality)
Column names
I also wanted the type of data I was working with, so I used the str()
function to print the data type of each variable. Here, I can see that only two variables, UNSTRUCTURED
and UNREPENTANT
are of the type Factor. I can also see that some columns contain NA values which will need to be addressed.
str(personality)
Data types
Because I need all of my variables to be of numeric type, I used the as.numeric()
to change UNSTRUCTURED
and UNREPENTANT
to numeric type.
personality$UNSTRUCTURED <- as.numeric(personality$UNSTRUCTURED)
personality$UNREPENTANT <- as.numeric(personality$UNREPENTANT)
To address input errors and missing values, I created a for
loop that runs through every cell to do the following:
- Replace missing values or empty strings with 5
- Replace values above the threshold with 5
- Replace the values below the threshold with 0
for (i in 1:nrow(personality)) {
for (j in 1:ncol(personality)){
if ((personality[i,j]== "" | is.na(personality[i,j]) == TRUE | is.null(personality[i,j]) == TRUE |
personality[i,j] > 10)) {
personality[i,j] <- 5
}else if (personality[i,j] < 0){
personality[i,j] <- 0
}
}
}
Simple Linear Regression
The next step of my investigation is to analyze several simple linear regression models on my predictor variable, CONFIDENT
. Because this is a repetitive regression between my predictor and a different predictor, I decided to make two functions. The first, fit_sum()
takes in the predictor and response variables as arguments and delivers the summary of SLR model. The second, fit_model
takes in the predictor and response variables, x- and y- labels, and the intercept and slope values and delivers a graph of the SLR model.
fit_sum <- function(xvar, yvar){
slr_mod <- lm(yvar ~ xvar, data = personality)
summary(slr_mod)
}
options(repr.plot.width = 4, repr.plot.height = 3.75)
fit_model <- function(xvar, yvar, xlabel, ylabel, intercept, slope){
ggplot(personality, aes(xvar, yvar)) + geom_point() + geom_smooth(method='lm', formula = y~x, col="red") +
ggtitle(label = paste("Regression of",ylabel,"Level"), subtitle = paste("on",xlabel,"Level")) + xlab(xlabel) +
ylab(ylabel) + theme_classic()
}
SLR Model 1
In my first model, I am interested in studying how well being socialable influences someone’s confidence. Running fit_sum
with the specified arguments, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 4.5845 + 0.1217X. This means when X = 0, a person will have a confident level of 4.5845. We can also say for every 1 unit increase in X, there will be a 0.1217 unit increase in Y.
- P-value: We can also see from the p-value benchmark that
SOCIALABLE
is not a very significant indicator ofCONFIDENT
alone. - Residual standard error: Model 1 has a residual standard error of 2.817 on 30 degrees of freedom. Typically, we want to minimize this error.
- Adjusted R-squared: Model 1 has an adjusted R^2 of -0.01361. The negative value means that the regression line is worse than using the mean value of
CONFIDENT
to predict someone’s confident level.
attach(personality)
slr_mod1 <- fit_sum(SOCIALABLE, CONFIDENT)
slr_mod1
SLR Model 1
To better visualize the SLR model, I ran fit_model()
with the specified arguments. Here, we can see that there does not appear to be a strong linear relationship between someone’s confident level and sociable level as indicated in the model summary.
fit_model(personality$SOCIALABLE,personality$CONFIDENT, "Socialable", "Confident")
SLR Model 1 Visualization
SLR Model 2
In my second model, I am interested in studying how well being methodical influences someone’s confidence. Running fit_sum
with the specified arguments, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 2.4588 + 0.4091X. This means when X = 0, a person will have a confident level of 2.4588. We can also say for every 1 unit increase in X, there will be a 0.4091 unit increase in someone’s confident level.
- P-value: We can also see from the p-value benchmark that
METHODICAL
is a significant indicator ofCONFIDENT
. - Residual standard error: Model 2 has a residual standard error of 2.583 on 30 degrees of freedom.
- Adjusted R-squared: Model 2 has an adjusted R^2 of 0.148. This means that
METHODICAL
accounts for 14.8% of variation inCONFIDENT
.
slr_mod2 <- fit_sum(METHODICAL, CONFIDENT)
slr_mod2
SLR Model 2
To better visualize the SLR model, I ran fit_model()
with the specified arguments. We can see that there appears to be a slight positive linear relationship between someone’s confident level and methodical level as indicated in the model summary. This means the greater a person’s methodical level, the greater their confident level is.
fit_model(personality$METHODICAL,personality$CONFIDENT, "Methodical", "Confident")
SLR Model 2 Visualization
SLR Model 3
In my third model, I am interested in studying how well being selfish influences someone’s confidence. Running fit_sum
with the specified arguments, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 6.0288 - 0.2927X. This means when X = 0, a person will have a confident level of 6.0288. We can also say for every 1 unit increase in X, there will be a 0.2927 unit decrease in someone’s confident level.
- P-value: We can also see from the p-value benchmark that
SELFISH
is not a significant indicator ofCONFIDENT
alone. - Residual standard error: Model 3 has a residual standard error of 2.713 on 30 degrees of freedom.
- Adjusted R-squared: Model 3 has an adjusted R^2 of 0.06011. This means that
SELFISH
accounts for 6.011% of variation inCONFIDENT
.
slr_mod3 <- fit_sum(SELFISH, CONFIDENT)
slr_mod3
SLR Model 3
To better visualize the SLR model, I ran fit_model()
with the specified arguments. Here, we can see that there appears to be a slight negative linear relationship between someone’s confident level and selfish level as indicated in the model summary. This means the greater a person’s selfish level, the lower their confident level is.
fit_model(personality$SELFISH,personality$CONFIDENT, "Selfish", "Confident")
SLR Model 3 Visualization
SLR Model 4
In my fourth model, I am interested in studying how well being goal oriented influences someone’s confidence. Running fit_sum
with the specified arguments, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 2.7818 + 0.3799X. This means when X = 0, a person will have a confident level of 2.7818. We can also say for every 1 unit increase in X, there will be a 0.3799 unit increase in someone’s confident level.
- P-value: We can also see from the p-value benchmark that
GOAL.ORIENTED
is not a significant indicator ofCONFIDENT
alone. - Residual standard error: Model 4 has a residual standard error of 2.702 on 30 degrees of freedom.
- Adjusted R-squared: Model 4 has an adjusted R^2 of 0.06772. This means that
GOAL.ORIENTED
accounts for 6.772% of variation inCONFIDENT
.
slr_mod4 <- fit_sum(GOAL.ORIENTED, CONFIDENT)
slr_mod4
SLR Model 4
To better visualize the SLR model, I ran fit_model()
with the specified arguments. Here, we can see that there appears to be a slight postive linear relationship between someone’s goal oriented level and selfish level as indicated in the model summary. This means the greater a person’s goal oriented level, the greater their confident level is.
fit_model(personality$GOAL.ORIENTED,personality$CONFIDENT, "Goal Oriented", "Confident")
SLR Model 4 Visualization
SLR Model 5
In my last model, I am interested in studying how well being revengeful influences someone’s confidence. Running fit_sum
with the specified arguments, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 5.6889 - 0.1071X. This means when X = 0, a person will have a confident level of 5.6889. We can also say for every 1 unit increase in X, there will be a 0.1071 unit sdecrease in someone’s confident level.
- P-value: We can also see from the p-value benchmark that
REVENGEFUL
is not a significant indicator ofCONFIDENT
alone. - Residual standard error: Model 5 has a residual standard error of 2.82 on 30 degrees of freedom.
- Adjusted R-squared: Model 5 has an adjusted R^2 of -0.01554. The negative value means that the regression line is worse than using the mean value of
CONFIDENT
to predict someone’s confident level.
slr_mod5 <- fit_sum(REVENGEFUL, CONFIDENT)
slr_mod5
SLR Model 5
To better visualize the SLR model, I ran fit_model()
with the specified arguments. Here, we can see that there does not appear to be a strong linear relationship between someone’s revengeful level and sociable level as indicated in the model summary.
fit_model(personality$REVENGEFUL,personality$CONFIDENT, "Revengeful", "Confident")
SLR Model 5 Visualization
Model Comparison
Because neither SOCIALABLE
and REVENGEFUL
a linear relationship with CONFIDENT
, it is safe to rule them out as significant predictors of someone’s confident level. In comparing the remaining 3 models, we can see a positive linear relationship in both Module 2 (CONFIDENT
on METHODICAL
) and Module 4 (CONFIDENT
on GOAL.ORIENTED
), while Module 3 exhibits a negative linear relationship (CONFIDENT
on SELFISH
). To identify the best model out the 3, we select the independent variable that explains the most variation of the dependent variable, CONFIDENT
, e.g. the model with the highest adjusted R^2 value. Based on this criteria, we can conclude that Module 2 (CONFIDENT
on METHODICAL
) is the best model with an adjusted R^2 value of 0.1480286. This means that METHODICAL
accounts for 14.8% of variation in CONFIDENT
. Because the proportion is fairly low, it is possible that the variation in someone’s confident level is dependent on more than just one variable. We can try to add more variables to build a multiple linear regression model that could possibly deliver a greater adjusted R^2 value in the following section.
cat("Adjusted R-squared of slr_mod2:", slr_mod2$adj.r.squared, "\nAdjusted R-squared of slr_mod3:",
slr_mod3$adj.r.squared, "\nAdjusted R-squared of slr_mod4:", slr_mod4$adj.r.squared)
Model Comparison
Multiple Linear Regression
MLR Model 1
In my first model, I am interested in studying how well DETAIL.ORIENTED
, AFFECTIONATE
, METHODICAL
, and INTROVERT
influence someone’s confidence. Running lm()
on the specified variables and printing the summary, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = -0.02369 - 0.45722X_1 + 0.41523X_2 + 0.40954X_3 + 0.28809X_4. This means when X_1, …, X_4 = 0, a person will have a confident level of -0.02369. We can also say for every one unit increase in either X_1, X_2, X_3, or X_4, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
- P-value: We can also see from the p-value benchmark that every variable is a significant indicator of
CONFIDENT
. - Residual standard error: Model 1 has a residual standard error of 2.126 on 27 degrees of freedom.
- Adjusted R-squared: Model 1 has an adjusted R^2 of 0.4225. This means that MLR model 1 accounts for 42.25% of variation in
CONFIDENT
.
mlr_mod1 <- lm(CONFIDENT ~ DETAIL.ORIENTED + AFFECTIONATE + METHODICAL + INTROVERT, data = personality)
summary(mlr_mod1)
MLR Model 1
MLR Model 2
In my second model, I am interested in studying how well VIBRANT
, TEAM.PLAYER
, DETAIL.ORIENTED
, METHODICAL
, and INTROVERT
influence someone’s confidence. Running lm()
on the specified variables and printing the summary, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = -0.4626 + 0.6161X_1 - 0.4424X_2 - 0.5928X_3 + 0.6492X_4 + 0.3589X_5. This means when X_1, …, X_5 = 0, a person will have a confident level of -0.4626. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, or X_5, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
- P-value: We can also see from the p-value benchmark that every variable is a significant indicator of
CONFIDENT
. - Residual standard error: Model 2 has a residual standard error of 1.909 on 26 degrees of freedom.
- Adjusted R-squared: Model 2 has an adjusted R^2 of 0.5343. This means that MLR model 2 accounts for 53.43% of variation in
CONFIDENT
.
mlr_mod2 <- lm(CONFIDENT ~ VIBRANT + TEAM.PLAYER + DETAIL.ORIENTED + METHODICAL + INTROVERT, data = personality)
summary(mlr_mod2)
MLR Model 2
MLR Model 3
In my third model, I am interested in studying how well VIBRANT
, REMORSEFUL
, TEAM.PLAYER
, DETAIL.ORIENTED
, METHODICAL
, and SENSE.EXPERIENCED
influence someone’s confidence. Running lm()
on the specified variables and printing the summary, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 0.1557 + 0.5518X_1 + 0.4473X_2 - 0.3793X_3 - 0.5419X_4 + 0.5888X_5 - 0.2446X_6. This means when X_1, …, X_6 = 0, a person will have a confident level of 0.1557. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, or X_6, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
- P-value: We can also see from the p-value benchmark that every variable, aside from
SENSE.EXPERIENCED
is a significant indicator ofCONFIDENT
. - Residual standard error: Model 3 has a residual standard error of 1.847 on 25 degrees of freedom.
- Adjusted R-squared: Model 3 has an adjusted R^2 of 0.5644. This means that MLR model 3 accounts for 56.44% of variation in
CONFIDENT
.
mlr_mod3 <- lm(CONFIDENT ~ VIBRANT + REMORSEFUL + TEAM.PLAYER + DETAIL.ORIENTED + METHODICAL + SENSE.EXPERIENCED,
data = personality)
summary(mlr_mod3)
MLR Model 3
MLR Model 4
In my fourth model, I am interested in studying how well VIBRANT
, REMORSEFUL
, SELF.ASSERTING
, TEAM.PLAYER
, DETAIL.ORIENTED
, METHODICAL
, and SENSE.EXPERIENCED
influence someone’s confidence. Running lm()
on the specified variables and printing the summary, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = -2.2294 + 0.6608X_1 + 0.5224X_2 + 0.2650X_3 - 0.3738X_4 - 0.5009X_5 + 0.5918X_6 - 0.3186X_7. This means when X_1, …, X_7 = 0, a person will have a confident level of -2.2294. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, or X_7, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
- P-value: We can also see from the p-value benchmark that every variable is a significant indicator of
CONFIDENT
. - Residual standard error: Model 4 has a residual standard error of 1.695 on 24 degrees of freedom.
- Adjusted R-squared: Model 4 has an adjusted R^2 of 0.6329. This means that MLR model 4 accounts for 63.29% of variation in
CONFIDENT
.
mlr_mod4 <- lm(CONFIDENT ~ VIBRANT + REMORSEFUL + SELF.ASSERTING + TEAM.PLAYER + DETAIL.ORIENTED + METHODICAL +
SENSE.EXPERIENCED, data = personality)
summary(mlr_mod4)
MLR Model 4
MLR Model 5
In my fifth model, I am interested in studying how well LAID.BACK
, REMORSEFUL
, DETAIL.ORIENTED
, AFFECTIONATE
, GOAL.ORIENTED
, NON.PHILOSOPHICAL
, VULNERABLE
, and SENTIMENTAL
influence someone’s confidence. Running lm()
on the specified variables and printing the summary, I can make the following observations:
- Residuals: The distribution of residuals is centered around a median close to 0.
- Coefficients: The equation of the fitted line is Y = 5.7576 - 0.3384X_1 + 0.5604X_2 - 0.8235X_3 + 0.6033X_4 + 0.5619X_5 - 0.6828X_6 - 0.3707X_7 - 0.2216X_8. This means when X_1, …, X_8 = 0, a person will have a confident level of 5.7576. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, X_7, or X_8, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
- P-value: We can also see from the p-value benchmark that every variable, except for
SENTIMENTAL
is a significant indicator ofCONFIDENT
. - Residual standard error: Model 5 has a residual standard error of 1.632 on 23 degrees of freedom.
- Adjusted R-squared: Model 5 has an adjusted R^2 of 0.6597. This means that MLR model 5 accounts for 65.97% of variation in
CONFIDENT
.
mlr_mod5 <- lm(CONFIDENT ~ LAID.BACK + REMORSEFUL + DETAIL.ORIENTED + AFFECTIONATE + GOAL.ORIENTED + NON.PHILOSOPHICAL
+ VULNERABLE + SENTIMENTAL, data = personality)
summary(mlr_mod5)
MLR Model 5
Model Comparison pt. 2
Similar to the SLR Model Comparison, in identifying the best MLR model, we select the model that explains the most variation of the dependent variable, CONFIDENT
, e.g. the model with the highest adjusted R^2 value. The adjusted R^2 values of the 5 MLR models are as follows:
cat("Adjusted R-squared of mlr_mod1:", summary(mlr_mod1)$adj.r.squared, "\nAdjusted R-squared of mlr_mod2:",
summary(mlr_mod2)$adj.r.squared, "\nAdjusted R-squared of mlr_mod3:", summary(mlr_mod3)$adj.r.squared,
"\nAdjusted R-squared of mlr_mod4:", summary(mlr_mod4)$adj.r.squared, "\nAdjusted R-squared of mlr_mod5:",
summary(mlr_mod5)$adj.r.squared)
Model comparison
From above, we can conclude that MLR Model 5 is the best model with the highest R^2 value. Another observation that can be made from comparing the summary print outs is that the residual standard error decreased as our model improved. Typically, we want to minimize the residual standard error so that is a very good sign. Now let’s compare MLR Model 5 with the best model we found from our SLR, SLR Model 2.
cat("Adjusted R-squared of mlr_mod5:", summary(mlr_mod5)$adj.r.squared, "\nAdjusted R-squared of slr_mod2:",
slr_mod2$adj.r.squared)
Model comparison
There is a significant improvement in the R^2 value when using more than one variable to explain the variation in CONFIDENT
. We can conclude that MLR Model 5 is indeed the best model out of all the models tested.
Summary
In deciding what kind of psychological and counseling services are necessary to help student’s build the confidence to achieve academically, socially, and personally, it is essential to listen and respond to student needs. Recently, the San Francisco County Office of Education conducted a study to analyze certain behaviors on college students. With this study, I was able to use the data to build a prediction model, with linear regression, that can predict what variables contribute to a student’s confidence.
Before I could build any model, however, it was essential that I first cleansed my data of any incomplete, inaccurate, or unreasonable data. I decided that for any missing values or for any values above the threshold of 10 I would replace these inputs with the balanced value of 5. Because I could not tell if students intended to put a 10 or accidentally misplaced decimal points, I decided to use the value 5 so that it would not skew my data in any way. Additionally, I decided to replace all values below the threshold of 0 with 0. This seemed reasonable for any negative value. Although there weren’t any, this would still help cleanse any other future survey data.
After data cleansing, I began to build my simple linear regression models. Using the variables SOCIALABLE
, METHODICAL
, SELFISH
, GOAL.ORIENTED
, and REVENGEFUL
, I created 5 different SLR models to see how well each variable influenced a student’s confidence. I found that METHODICAL
did the best at explaining the variation in our response variable, CONFIDENT
, with an R^2 of 0.148. This R^2 seemed faily low, however, so I decided to add some more variables to the model to build a multiple linear regression that could possibly deliver a better R^2 value. After some testing, I found that the model with the variables LAID.BACK
, REMORSEFUL
, DETAIL.ORIENTED
, AFFECTIONATE
, GOAL.ORIENTED
, NON.PHILOSOPHICAL
, VULNERABLE
, and SENTIMENTAL
best predicts a student’s confidence with an R^2 of 0.6597, a significant improvement from our SLR model. These are the variables that influence a student’s confidence and can help the San Francisco County Office of Education identify the kind of services they should provide to their students.
The source code is available here.