r eda linear-regression data-visualization data-cleaning project

How Can We Give Our Students a Confidence Boost?

20 Sep 2020

The objective of this study is to build a prediction model, using linear regression, that can inform stakeholders what variables contribute to a student's confidence level.

How Can We Give Our Students a Confidence Boost?

Objective

It is essential that colleges and universities provide counseling and psychological services to their student body. Mental health difficulties can hinder a student’s academic success; untreated mental health issues may affect a student’s grades and in some cases lead to discontinued enrollment.

Some ways colleges and universities can take action in mental health initiatives include speaking about mental health to reduce stigmas, pursuing partnerships to ensure a campus-wide approach to mental health care, and investing in mental health services ensuring they’re accessible to every student. The most important thing an institution can do, however, is listen and respond to student needs.

The best way students can communicate their mental health concerns comfortably is through an anonymous survey. The psychology department in the San Francisco County Office of Education conducted a study to analyze certain behaviors on college students. The objective of this study is to build a prediction model, using linear regression, that can inform stakeholders what variables contribute (whether positively or negatively) to a student’s confidence level so that they can provide better services that can boost a student’s confidence.

Libraries

The first step to every exploratory data analysis is to import all necessary libraries.

library(ggplot2)

Data Exploration

Read in the Data

I imported and read the CSV file into R as a dataframe called personality. Additionally, I printed the first 6 rows.

personality <- read.csv(file = "PersonalityTest2020.csv")
head(personality)

personality dataframe

I was curious to know how many survey responses I was working with, so I used nrow() and ncol() to print the dimensions of the dataframe. I can now see that I’m working with 32 survey responses to 50 personality questions. Note each column represents a different question.

cat("There are", nrow(personality), "rows and", ncol(personality), "columns in this table.")

Dataframe dimensions

Because there are so many questions within the survey, I could not view all of them in the print out of head(personality). To view all the questions, e.g. columns, I ran colnames(personality).

colnames(personality)

Column names

I also wanted the type of data I was working with, so I used the str() function to print the data type of each variable. Here, I can see that only two variables, UNSTRUCTURED and UNREPENTANT are of the type Factor. I can also see that some columns contain NA values which will need to be addressed.

str(personality)

Data types

Because I need all of my variables to be of numeric type, I used the as.numeric() to change UNSTRUCTURED and UNREPENTANT to numeric type.

personality$UNSTRUCTURED <- as.numeric(personality$UNSTRUCTURED)
personality$UNREPENTANT <- as.numeric(personality$UNREPENTANT)

To address input errors and missing values, I created a for loop that runs through every cell to do the following:

Replace missing values or empty strings with 5
Replace values above the threshold with 5
Replace the values below the threshold with 0

for (i in 1:nrow(personality)) {
    for (j in 1:ncol(personality)){
        if ((personality[i,j]== "" | is.na(personality[i,j]) == TRUE | is.null(personality[i,j]) == TRUE |
            personality[i,j] > 10)) {
            personality[i,j] <- 5
        }else if (personality[i,j] < 0){
            personality[i,j] <- 0
        }
    }
}

Simple Linear Regression

The next step of my investigation is to analyze several simple linear regression models on my predictor variable, CONFIDENT. Because this is a repetitive regression between my predictor and a different predictor, I decided to make two functions. The first, fit_sum() takes in the predictor and response variables as arguments and delivers the summary of SLR model. The second, fit_model takes in the predictor and response variables, x- and y- labels, and the intercept and slope values and delivers a graph of the SLR model.

fit_sum <- function(xvar, yvar){
    slr_mod <- lm(yvar ~ xvar, data = personality)
    summary(slr_mod)
} 

options(repr.plot.width = 4, repr.plot.height = 3.75)
fit_model <- function(xvar, yvar, xlabel, ylabel, intercept, slope){
    ggplot(personality, aes(xvar, yvar)) + geom_point() + geom_smooth(method='lm', formula = y~x, col="red") +
    ggtitle(label = paste("Regression of",ylabel,"Level"), subtitle = paste("on",xlabel,"Level")) + xlab(xlabel) + 
    ylab(ylabel) + theme_classic()
}

SLR Model 1

In my first model, I am interested in studying how well being socialable influences someone’s confidence. Running fit_sum with the specified arguments, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 4.5845 + 0.1217X. This means when X = 0, a person will have a confident level of 4.5845. We can also say for every 1 unit increase in X, there will be a 0.1217 unit increase in Y.
P-value: We can also see from the p-value benchmark that SOCIALABLE is not a very significant indicator of CONFIDENT alone.
Residual standard error: Model 1 has a residual standard error of 2.817 on 30 degrees of freedom. Typically, we want to minimize this error.
Adjusted R-squared: Model 1 has an adjusted R^2 of -0.01361. The negative value means that the regression line is worse than using the mean value of CONFIDENT to predict someone’s confident level.

attach(personality)

slr_mod1 <- fit_sum(SOCIALABLE, CONFIDENT)
slr_mod1

SLR Model 1

To better visualize the SLR model, I ran fit_model() with the specified arguments. Here, we can see that there does not appear to be a strong linear relationship between someone’s confident level and sociable level as indicated in the model summary.

fit_model(personality$SOCIALABLE,personality$CONFIDENT, "Socialable", "Confident")

SLR Model 1 Visualization

SLR Model 2

In my second model, I am interested in studying how well being methodical influences someone’s confidence. Running fit_sum with the specified arguments, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 2.4588 + 0.4091X. This means when X = 0, a person will have a confident level of 2.4588. We can also say for every 1 unit increase in X, there will be a 0.4091 unit increase in someone’s confident level.
P-value: We can also see from the p-value benchmark that METHODICAL is a significant indicator of CONFIDENT.
Residual standard error: Model 2 has a residual standard error of 2.583 on 30 degrees of freedom.
Adjusted R-squared: Model 2 has an adjusted R^2 of 0.148. This means that METHODICAL accounts for 14.8% of variation in CONFIDENT.

slr_mod2 <- fit_sum(METHODICAL, CONFIDENT)
slr_mod2

SLR Model 2

To better visualize the SLR model, I ran fit_model() with the specified arguments. We can see that there appears to be a slight positive linear relationship between someone’s confident level and methodical level as indicated in the model summary. This means the greater a person’s methodical level, the greater their confident level is.

fit_model(personality$METHODICAL,personality$CONFIDENT, "Methodical", "Confident")

SLR Model 2 Visualization

SLR Model 3

In my third model, I am interested in studying how well being selfish influences someone’s confidence. Running fit_sum with the specified arguments, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 6.0288 - 0.2927X. This means when X = 0, a person will have a confident level of 6.0288. We can also say for every 1 unit increase in X, there will be a 0.2927 unit decrease in someone’s confident level.
P-value: We can also see from the p-value benchmark that SELFISH is not a significant indicator of CONFIDENT alone.
Residual standard error: Model 3 has a residual standard error of 2.713 on 30 degrees of freedom.
Adjusted R-squared: Model 3 has an adjusted R^2 of 0.06011. This means that SELFISH accounts for 6.011% of variation in CONFIDENT.

slr_mod3 <- fit_sum(SELFISH, CONFIDENT)
slr_mod3

SLR Model 3

To better visualize the SLR model, I ran fit_model() with the specified arguments. Here, we can see that there appears to be a slight negative linear relationship between someone’s confident level and selfish level as indicated in the model summary. This means the greater a person’s selfish level, the lower their confident level is.

fit_model(personality$SELFISH,personality$CONFIDENT, "Selfish", "Confident")

SLR Model 3 Visualization

SLR Model 4

In my fourth model, I am interested in studying how well being goal oriented influences someone’s confidence. Running fit_sum with the specified arguments, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 2.7818 + 0.3799X. This means when X = 0, a person will have a confident level of 2.7818. We can also say for every 1 unit increase in X, there will be a 0.3799 unit increase in someone’s confident level.
P-value: We can also see from the p-value benchmark that GOAL.ORIENTED is not a significant indicator of CONFIDENT alone.
Residual standard error: Model 4 has a residual standard error of 2.702 on 30 degrees of freedom.
Adjusted R-squared: Model 4 has an adjusted R^2 of 0.06772. This means that GOAL.ORIENTED accounts for 6.772% of variation in CONFIDENT.

slr_mod4 <- fit_sum(GOAL.ORIENTED, CONFIDENT)
slr_mod4

SLR Model 4

To better visualize the SLR model, I ran fit_model() with the specified arguments. Here, we can see that there appears to be a slight postive linear relationship between someone’s goal oriented level and selfish level as indicated in the model summary. This means the greater a person’s goal oriented level, the greater their confident level is.

fit_model(personality$GOAL.ORIENTED,personality$CONFIDENT, "Goal Oriented", "Confident")

SLR Model 4 Visualization

SLR Model 5

In my last model, I am interested in studying how well being revengeful influences someone’s confidence. Running fit_sum with the specified arguments, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 5.6889 - 0.1071X. This means when X = 0, a person will have a confident level of 5.6889. We can also say for every 1 unit increase in X, there will be a 0.1071 unit sdecrease in someone’s confident level.
P-value: We can also see from the p-value benchmark that REVENGEFUL is not a significant indicator of CONFIDENT alone.
Residual standard error: Model 5 has a residual standard error of 2.82 on 30 degrees of freedom.
Adjusted R-squared: Model 5 has an adjusted R^2 of -0.01554. The negative value means that the regression line is worse than using the mean value of CONFIDENT to predict someone’s confident level.

slr_mod5 <- fit_sum(REVENGEFUL, CONFIDENT)
slr_mod5

SLR Model 5

To better visualize the SLR model, I ran fit_model() with the specified arguments. Here, we can see that there does not appear to be a strong linear relationship between someone’s revengeful level and sociable level as indicated in the model summary.

fit_model(personality$REVENGEFUL,personality$CONFIDENT, "Revengeful", "Confident")

SLR Model 5 Visualization

Model Comparison

Because neither SOCIALABLE and REVENGEFUL a linear relationship with CONFIDENT, it is safe to rule them out as significant predictors of someone’s confident level. In comparing the remaining 3 models, we can see a positive linear relationship in both Module 2 (CONFIDENT on METHODICAL) and Module 4 (CONFIDENT on GOAL.ORIENTED), while Module 3 exhibits a negative linear relationship (CONFIDENT on SELFISH). To identify the best model out the 3, we select the independent variable that explains the most variation of the dependent variable, CONFIDENT, e.g. the model with the highest adjusted R^2 value. Based on this criteria, we can conclude that Module 2 (CONFIDENT on METHODICAL) is the best model with an adjusted R^2 value of 0.1480286. This means that METHODICAL accounts for 14.8% of variation in CONFIDENT. Because the proportion is fairly low, it is possible that the variation in someone’s confident level is dependent on more than just one variable. We can try to add more variables to build a multiple linear regression model that could possibly deliver a greater adjusted R^2 value in the following section.

cat("Adjusted R-squared of slr_mod2:", slr_mod2$adj.r.squared, "\nAdjusted R-squared of slr_mod3:", 
    slr_mod3$adj.r.squared, "\nAdjusted R-squared of slr_mod4:", slr_mod4$adj.r.squared)

Model Comparison

Multiple Linear Regression

MLR Model 1

In my first model, I am interested in studying how well DETAIL.ORIENTED, AFFECTIONATE, METHODICAL, and INTROVERT influence someone’s confidence. Running lm() on the specified variables and printing the summary, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = -0.02369 - 0.45722X_1 + 0.41523X_2 + 0.40954X_3 + 0.28809X_4. This means when X_1, …, X_4 = 0, a person will have a confident level of -0.02369. We can also say for every one unit increase in either X_1, X_2, X_3, or X_4, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
P-value: We can also see from the p-value benchmark that every variable is a significant indicator of CONFIDENT.
Residual standard error: Model 1 has a residual standard error of 2.126 on 27 degrees of freedom.
Adjusted R-squared: Model 1 has an adjusted R^2 of 0.4225. This means that MLR model 1 accounts for 42.25% of variation in CONFIDENT.

mlr_mod1 <- lm(CONFIDENT ~ DETAIL.ORIENTED + AFFECTIONATE + METHODICAL + INTROVERT, data = personality)
summary(mlr_mod1)

MLR Model 1

MLR Model 2

In my second model, I am interested in studying how well VIBRANT, TEAM.PLAYER, DETAIL.ORIENTED, METHODICAL, and INTROVERT influence someone’s confidence. Running lm() on the specified variables and printing the summary, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = -0.4626 + 0.6161X_1 - 0.4424X_2 - 0.5928X_3 + 0.6492X_4 + 0.3589X_5. This means when X_1, …, X_5 = 0, a person will have a confident level of -0.4626. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, or X_5, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
P-value: We can also see from the p-value benchmark that every variable is a significant indicator of CONFIDENT.
Residual standard error: Model 2 has a residual standard error of 1.909 on 26 degrees of freedom.
Adjusted R-squared: Model 2 has an adjusted R^2 of 0.5343. This means that MLR model 2 accounts for 53.43% of variation in CONFIDENT.

mlr_mod2 <- lm(CONFIDENT ~ VIBRANT + TEAM.PLAYER + DETAIL.ORIENTED + METHODICAL + INTROVERT, data = personality)
summary(mlr_mod2)

MLR Model 2

MLR Model 3

In my third model, I am interested in studying how well VIBRANT, REMORSEFUL, TEAM.PLAYER, DETAIL.ORIENTED, METHODICAL, and SENSE.EXPERIENCED influence someone’s confidence. Running lm() on the specified variables and printing the summary, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 0.1557 + 0.5518X_1 + 0.4473X_2 - 0.3793X_3 - 0.5419X_4 + 0.5888X_5 - 0.2446X_6. This means when X_1, …, X_6 = 0, a person will have a confident level of 0.1557. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, or X_6, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
P-value: We can also see from the p-value benchmark that every variable, aside from SENSE.EXPERIENCED is a significant indicator of CONFIDENT.
Residual standard error: Model 3 has a residual standard error of 1.847 on 25 degrees of freedom.
Adjusted R-squared: Model 3 has an adjusted R^2 of 0.5644. This means that MLR model 3 accounts for 56.44% of variation in CONFIDENT.

mlr_mod3 <- lm(CONFIDENT ~ VIBRANT + REMORSEFUL + TEAM.PLAYER + DETAIL.ORIENTED + METHODICAL + SENSE.EXPERIENCED, 
               data = personality)
summary(mlr_mod3)

MLR Model 3

MLR Model 4

In my fourth model, I am interested in studying how well VIBRANT, REMORSEFUL, SELF.ASSERTING, TEAM.PLAYER, DETAIL.ORIENTED, METHODICAL, and SENSE.EXPERIENCED influence someone’s confidence. Running lm() on the specified variables and printing the summary, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = -2.2294 + 0.6608X_1 + 0.5224X_2 + 0.2650X_3 - 0.3738X_4 - 0.5009X_5 + 0.5918X_6 - 0.3186X_7. This means when X_1, …, X_7 = 0, a person will have a confident level of -2.2294. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, or X_7, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
P-value: We can also see from the p-value benchmark that every variable is a significant indicator of CONFIDENT.
Residual standard error: Model 4 has a residual standard error of 1.695 on 24 degrees of freedom.
Adjusted R-squared: Model 4 has an adjusted R^2 of 0.6329. This means that MLR model 4 accounts for 63.29% of variation in CONFIDENT.

mlr_mod4 <- lm(CONFIDENT ~ VIBRANT + REMORSEFUL + SELF.ASSERTING + TEAM.PLAYER + DETAIL.ORIENTED + METHODICAL + 
               SENSE.EXPERIENCED, data = personality)
summary(mlr_mod4)

MLR Model 4

MLR Model 5

In my fifth model, I am interested in studying how well LAID.BACK, REMORSEFUL, DETAIL.ORIENTED, AFFECTIONATE, GOAL.ORIENTED, NON.PHILOSOPHICAL, VULNERABLE, and SENTIMENTAL influence someone’s confidence. Running lm() on the specified variables and printing the summary, I can make the following observations:

Residuals: The distribution of residuals is centered around a median close to 0.
Coefficients: The equation of the fitted line is Y = 5.7576 - 0.3384X_1 + 0.5604X_2 - 0.8235X_3 + 0.6033X_4 + 0.5619X_5 - 0.6828X_6 - 0.3707X_7 - 0.2216X_8. This means when X_1, …, X_8 = 0, a person will have a confident level of 5.7576. We can also say for every one unit increase in either X_1, X_2, X_3, X_4, X_5, X_6, X_7, or X_8, there will be an increase/decrease in someone’s confident level in the value of the corresponding slope.
P-value: We can also see from the p-value benchmark that every variable, except for SENTIMENTAL is a significant indicator of CONFIDENT.
Residual standard error: Model 5 has a residual standard error of 1.632 on 23 degrees of freedom.
Adjusted R-squared: Model 5 has an adjusted R^2 of 0.6597. This means that MLR model 5 accounts for 65.97% of variation in CONFIDENT.

mlr_mod5 <- lm(CONFIDENT ~ LAID.BACK + REMORSEFUL + DETAIL.ORIENTED + AFFECTIONATE + GOAL.ORIENTED + NON.PHILOSOPHICAL
               + VULNERABLE + SENTIMENTAL, data = personality)
summary(mlr_mod5)

MLR Model 5

Model Comparison pt. 2

Similar to the SLR Model Comparison, in identifying the best MLR model, we select the model that explains the most variation of the dependent variable, CONFIDENT, e.g. the model with the highest adjusted R^2 value. The adjusted R^2 values of the 5 MLR models are as follows:

cat("Adjusted R-squared of mlr_mod1:", summary(mlr_mod1)$adj.r.squared, "\nAdjusted R-squared of mlr_mod2:", 
    summary(mlr_mod2)$adj.r.squared, "\nAdjusted R-squared of mlr_mod3:", summary(mlr_mod3)$adj.r.squared, 
    "\nAdjusted R-squared of mlr_mod4:", summary(mlr_mod4)$adj.r.squared, "\nAdjusted R-squared of mlr_mod5:", 
    summary(mlr_mod5)$adj.r.squared)

Model comparison

From above, we can conclude that MLR Model 5 is the best model with the highest R^2 value. Another observation that can be made from comparing the summary print outs is that the residual standard error decreased as our model improved. Typically, we want to minimize the residual standard error so that is a very good sign. Now let’s compare MLR Model 5 with the best model we found from our SLR, SLR Model 2.

cat("Adjusted R-squared of mlr_mod5:", summary(mlr_mod5)$adj.r.squared, "\nAdjusted R-squared of slr_mod2:", 
    slr_mod2$adj.r.squared)

Model comparison

There is a significant improvement in the R^2 value when using more than one variable to explain the variation in CONFIDENT. We can conclude that MLR Model 5 is indeed the best model out of all the models tested.

Summary

In deciding what kind of psychological and counseling services are necessary to help student’s build the confidence to achieve academically, socially, and personally, it is essential to listen and respond to student needs. Recently, the San Francisco County Office of Education conducted a study to analyze certain behaviors on college students. With this study, I was able to use the data to build a prediction model, with linear regression, that can predict what variables contribute to a student’s confidence.

Before I could build any model, however, it was essential that I first cleansed my data of any incomplete, inaccurate, or unreasonable data. I decided that for any missing values or for any values above the threshold of 10 I would replace these inputs with the balanced value of 5. Because I could not tell if students intended to put a 10 or accidentally misplaced decimal points, I decided to use the value 5 so that it would not skew my data in any way. Additionally, I decided to replace all values below the threshold of 0 with 0. This seemed reasonable for any negative value. Although there weren’t any, this would still help cleanse any other future survey data.

After data cleansing, I began to build my simple linear regression models. Using the variables SOCIALABLE, METHODICAL, SELFISH, GOAL.ORIENTED, and REVENGEFUL, I created 5 different SLR models to see how well each variable influenced a student’s confidence. I found that METHODICAL did the best at explaining the variation in our response variable, CONFIDENT, with an R^2 of 0.148. This R^2 seemed faily low, however, so I decided to add some more variables to the model to build a multiple linear regression that could possibly deliver a better R^2 value. After some testing, I found that the model with the variables LAID.BACK, REMORSEFUL, DETAIL.ORIENTED, AFFECTIONATE, GOAL.ORIENTED, NON.PHILOSOPHICAL, VULNERABLE, and SENTIMENTAL best predicts a student’s confidence with an R^2 of 0.6597, a significant improvement from our SLR model. These are the variables that influence a student’s confidence and can help the San Francisco County Office of Education identify the kind of services they should provide to their students.

The source code is available here.

Share this article

How Can We Give Our Students a Confidence Boost?

Objective

Libraries

Data Exploration

Read in the Data

Simple Linear Regression

SLR Model 1

SLR Model 2

SLR Model 3

SLR Model 4

SLR Model 5

Model Comparison

Multiple Linear Regression

MLR Model 1

MLR Model 2

MLR Model 3

MLR Model 4

MLR Model 5

Model Comparison pt. 2

Summary

Menu

Explore tags

How Can We Give Our Students a Confidence Boost?

Objective

Libraries

Data Exploration

Read in the Data

Simple Linear Regression

SLR Model 1

SLR Model 2

SLR Model 3

SLR Model 4

SLR Model 5

Model Comparison

Multiple Linear Regression

MLR Model 1

MLR Model 2

MLR Model 3

MLR Model 4

MLR Model 5

Model Comparison pt. 2

Summary

You may also like

A Call to Action for Public Health Care

Syllabi Parsing

Human Resources and Candidate Search

Get interesting news

Explore tags