S.F. Jobs With The Most Competitive Salary Growth
The objective of this investigation is to provide The San Francisco Chronicle with a report that depicts the main insights regarding the income in the city.
Objective
One of the goals of The San Francisco Controller’s Office is to promote financial and economic security, which means that they try to be proactive in ensuring that employees receive the appropriate amount of pay. They study and maintain a database of the salary and benefits paid to City employees in order to identify any significant trends. We can use the trends we see in income to compare salary growth amongst occupations, as well as identify the occupations that exhibit the highest or lowest salary growth.
The objective of this investigation is to provide The San Francisco Chronicle with a report that depicts the main insights regarding the income in the city (based on the data from The Controller’s Office) that would be particularly useful for any student or jobseeker looking for a secure occupation with a competitive salary growth rate. In doing so, I conducted my research to answer the following questions:
Questions of Investigation:
- Which jobs exhibited positive salary growth from 2015 to 2019?
- Which jobs exhibited negative salary growth from 2015 to 2019?
- What are the jobs with the highest salary growth rate from 2015 to 2019?
- What are the jobs with the lowest salary growth rate from 2015 to 2019?
- Which jobs are in the bottom 25th percentile based on salary growth rate? Upper 25th percentile? In between?
Libraries
The first step to every exploratory data analysis is to import all necessary libraries.
# install.packages("plyr")
# install.packages("formattable")
# install.packages("dplyr")
# install.packages("rlang")
library(plyr)
library(formattable)
About the Dataset
The San Francisco Controller’s Office maintains a database of the salary and benefits paid to City employees since fiscal year 2013.
Data Exploration
Read in the Data
Because Github has a limit on file size, I have to read the dataset from a zipped file using unz()
and viewing the first 6 rows.
Because the dimensions are so large, head()
could not display every column. I will retrieve the column names in the next few steps. Another thing I immediately notice is that there are some missing values that I would like to address in my data cleaning subsection.
# Open a connection to the zip file
unzip_conn <- unz("employee-compensation.csv.zip", "employee-compensation.csv")
# Read the CSV data from the zip archive
employee <- read.csv(unzip_conn)
# Close the connection
close.(unzip_conn)
head(employee)
The first 6 rows of the dataset
Because I am unable to view the full dimensions, I used nrow()
and ncol()
to check the dimensions. I now know that I am working with a 886,102 by 22 dataframe.
cat("There are", nrow(employee), "rows and", ncol(employee), "columns in this table.")
The dimensions of the dataset
Additionally, I used colnames()
to retrieve the column names. For the purpose of my investigation, I am only interested in the variables Year
, Job
, Salaries
, which I will subset in a later section.
colnames(employee)
The column names of the dataset
Data Cleaning
The next step to my analysis is to check for NULL values and empty strings within the data frame. To do so, I utilized the count()
function to provide a count for the number of NULL and empty strings within a particular column. The logic expression within the function checks to see if an input is NULL or an empty string.
count(employee["Job"] == "" | is.na(employee["Job"]))
Null values in the ‘Job’ column
Because I wanted to check for NULL values and empty strings in every column, I built a for
loop to do just that. The conditional expression within the loop returns the column name if it contains NULL values or empty strings.
for (col in colnames(employee)) {
x <- count(employee[col] == "" | is.na(employee[col]))
if (x[1,2] < 886102) {
print(col)
}
}
Columns with Null values
R has a useful function called na.omit()
that can be applied to an entire data frame to clean NULL values from data frames. Running it, however, I learned that my data frame does not contain any NULL values because the dimensions remained the same.
clean_employee <- na.omit(employee)
dim(clean_employee)
Dimensions of the dataset
Because Job
was the only one of my three variables of interest that contained empty strings, I decided to filter only that one column to remove those rows. I decided not to filter the remaining four columns that contained empty strings because they could potentially contain data that could be of use to my investigation. Filtering the dataset resulted in the removal of just 4 rows.
filtered_employee <- clean_employee[!(employee$Job==""),]
nrow(clean_employee) - nrow(filtered_employee)
dim(filtered_employee)
Dimensions of the dataset
Subsets of Data
Subset 1: Dataframe of Yearly Median Salary & Percent Change
Because I am observing occupational salary growth, the only relevant columns I need are Year
, Job
, and Salaries
. I subsetted my data accordingly and printed a summary to verify. In the summary, I noticed that Salaries
contains negative inputs that could be input errors.
# subset 1
subset <- filtered_employee[, c("Year", "Job", "Salaries")]
summary(subset)
Subset 1
I am only interested in observing occupational salary growth within full time positions so I subsetted my data to only contain rows where with salaries above $30,000.
fulltime <- subset[subset$Salaries > 30000,]
summary(fulltime)
Filtered Subset 1
I decided to look at the occupational salary growth between 2015 and 2019. Using a for
loop, I subsetted the fulltime
data frame based on the Year specified within the loop. The second command within the loop then creates another subset where the data is grouped Job
.
The third command within the loop then creates another subset that contains a column for Job
and another column for the median salary. I decided to use median salary as measurement for comparison because it is a measure of central tendency regarding compensation.
The forth command within the loop just renames the median salary income to specify the Year. The fifth command adds the subset to a list called dataframes
that will be used to merge all the data frames.
dataframes <- list()
x <- 1
for (i in 2015:2019) {
salary <- fulltime[fulltime$Year == i,]
by_job <- salary_2015 %>%
group_by(Job) %>%
summarise(Salary = median(Salaries, na.rm = TRUE))
colnames(by_job)[2] <- paste("Salary", i, sep=".")
dataframes[[x]] <- by_job
x <- x + 1
}
I assigned the first dataframe in dataframes
to the variable full
. Using a for
loop, I merged the remaining dataframes to the first data frame by Job
. This creates a general data frame that contains the median salary for every year between 2015 and 2019. Additionally, I created a column called Percent.Change
that contains the salary growth from 2015 to 2019.
full <- dataframes[[1]]
for (i in 2:length(dataframes)) {
full <- merge(full, dataframes[[i]], by="Job")
}
full$Percent.Change <- percent((full$Salary.2019 - full$Salary.2015) / full$Salary.2015)
head(full)
Yearly Median Salary and Percent Change
Subset 2: Positive Salary Growth
I am interested in occupations that exhibited a positive salary growth. To make this observation, I created another subset called positive
that contains data for the occupations that exhibited a positive salary growth.
positive <- full[full$Percent.Change > 0,]
positive <- head(positive, n=5)
positive
Positive Salary Growth
Subset 3: Negative Salary Growth
negative <- full[full$Percent.Change < 0,]
negative <- head(negative, n=5)
negative
Negative Salary Growth
Subset 4: Jobs with the Highest Salary Growth Rate
I am interested in the jobs with the highest salary growth from 2015 to 2019. To make this I ordered the full
data frame in a descending manner based on Percent.Change
. I then created another subset called fast_growth
that contains data for the top 5 jobs with the highest salary growth.
by_percent_change <- full[order(-full$Percent.Change),]
fast_growth <- head(by_percent_change, n=5)
fast_growth
Fast Salary Growth
Subset 5: Jobs with the Lowest Salary Growth Rate
I am interested in the jobs with the lowest salary growth from 2015 to 2019. I created another subset called slow_growth
that contains data for the top 5 jobs with the lowest salary growth.
slow_growth <- tail(by_percent_change, n=5)
slow_growth
Lowest Salary Growth
Single Vector
Top 5 Jobs with the Highest Salary Growth Rate
I am interested in the top 5 occupations that have the highest salary growth rate. To retrieve this information, I stored the Job
column from the data frame fast_growth
into a vector using as.vector()
. This information is particularly helpful for any jobseeker or student looking to find a secure job with the fastest-rising salary growth rate in the future.
fast_jobs <- as.vector(fast_growth[,"Job"])
fast_jobs
Highest Salary Growth
Top 5 Jobs with the Lowest Salary Growth Rate
I am interested in the top 5 occupations that have the lowest salary growth rate. To retrieve this information, I stored the Job
column from the data frame slow_growth
into a vector using as.vector()
. This information is particularly helpful because these are the jobs that any jobseeker or student may want to stray away from since their salary growth rate is very low.
slow_jobs <- as.vector(slow_growth[,"Job"])
slow_jobs
Lowest Salary Growth
5 different Lists
I am interested in dividing the Jobs from the full
data frame into 5 different lists based on their salary growth rate from 2015 to 2019 using the following criteria:
Criteria Used for Listing:
- Lower 25th Percentile
- Between 25th and 50th Percentile
- Between 50th and 75th Percentile
- Upper 25th Percentile
- Positive Salary Growth
To get a statistical summary—minimum, first quartile, median, mean, third quartile, and maximum—of Percent.Change
, I ran the summary()
function over the column.
summary(full$Percent.Change)
Summary
To begin my organization, I first made four different lists for the lower, middle.1 (between 25th and 50th percentile), middle.2 ((between 50th and 75th percentile), and upper quartiles. I then made a for
loop that will organize Jobs from the full
data frame into these five lists based on their salary growth rate from 2015 to 2019. Additionally, I created a function clean
to first filter the lists of any possible NULL values and then print the first 5 elements.
lower_quartile <- list()
middle.1 <- list()
middle.2 <- list()
upper_quartile <- list()
for (i in 1:nrow(full)){
if (full[i,"Percent.Change"] < 0.03561){
lower_quartile[[x]] <- full[i,"Job"]
} else if (full[i,"Percent.Change"] >= 0.03561 & full[i,7] < 0.12208) {
middle.1[[x]] <- full[i,"Job"]
} else if (full[i,"Percent.Change"] >= 0.12208 & full[i,7] < 0.14230){
middle.2[[x]] <- full[i,"Job"]
} else {
upper_quartile[[x]] <- full[i,"Job"]
}
x <- x + 1
}
clean <- function(list.name) {
filtered_list <- list.clean(list.name, fun = is.null, recursive = TRUE)
filtered_list[1:5]
}
Lower 25th Percentile
I am interested in the jobs that are in the lower 25th percentile based on their salary growth rate from 2015 to 2019. I used the clean()
function to filter the list and print the first 5 elements. This information is particularly useful because we can see the jobs that have the lowest salary growth rate.
clean(lower_quartile)
Lower Quartile
Between 25th and 50th Percentile
I am interested in the jobs that are in between the lower 25th and 50th percentile based on their salary growth rate from 2015 to 2019. I used the clean()
function to filter the list and print the first 5 elements.
clean(middle.1)
Middle Quartile 1
Between 50th and 75th Percentile
I am interested in the jobs that are in between the lower 50th and 75th percentile based on their salary growth rate from 2015 to 2019. I used the clean()
function to filter the list and print the first 5 elements.
clean(middle.2)
Middle Quartile 2
Upper 25th Percentile
I am interested in the jobs that are in the upper 25th percentile based on their salary growth rate from 2015 to 2019. I used the clean()
function to filter the list and print the first 5 elements. This information is particularly useful because we can see the jobs that have the highest salary growth rate.
clean(upper_quartile)
Upper Quartile
Summary
In comparing an occupation’s compensation tendencies to that of other occupations within the same market, I had to aggregate my data in such a manner that would allow me to observe the changes in median salary over the course of a few years. I decided to narrow my study so that I would only focus my study on the salary growth rate for full time employees from 2015 to 2019. In doing so, I created several subsets that contained the yearly median salary per occupation, as well as the salary growth rate. These subsets were built to be of particular use for any student or jobseeker looking for a secure occupation with a competitive salary growth rate.
In the positive
subset, for example, these students or jobseekers can learn that jobs like Account Clerk, Accountant II, Accountant III, etc., all exhibited a positive growth rate. A student or jobseeker may want to know which jobs exhibited negative salary growth so that they can stray clear from these occupations. In this case, they can see from the negative
subset that jobs like Admin Analyst 3, Administrative Services Mgr, Airport Noise Abatement Spec, etc., all exhibited a negative growth rate.
A student or jobseeker may also be interested in the occupations with the highest or lowest salary growth rate. To find the occupations that exhibited the highest salary growth rate, they can see from the fast_jobs
vector that these include Telecommunications Tech Supv, Sr Airport Noise Abatement Spe, Statistician, etc. To find the occupations that exhibited the lowest salary growth rate, they can see from the slow_jobs
vector that these include Pr Investigator, Tax Collector, Chief Surveyor, Sprv Adult Prob Ofc (SFERS), etc.
Recommendations
When observing changes in income over a period of time, we can see how an occupation’s compensation tendencies compare to other occupations within the same market. This is particularly helpful for any student or jobseeker that is looking for a secure occupation with a competitive salary growth rate. For these students or jobseekers, I would recommend them to look at the jobs with the highest salary growth rate which can be found in the fast_jobs
vector.
Additionally, I would recommend them to look at the jobs with salary growth rates in the upper 50th percentile. These can be found in the middle.2
and upper_quartile
lists. I would also recommend them to look at the jobs that only exhibited a positive salary growth rate.
Lastly, I would recommend them to stray away from jobs with the lowest salary growth rate or with a growth rate in the lower 50th percentile. These can be found in the slow_jobs
vector as well as the middle.1
and lower_quartile
lists.
The source code is available here.