r eda data-cleaning data-manipulation project

S.F. Jobs With The Most Competitive Salary Growth

12 Sep 2020

The objective of this investigation is to provide The San Francisco Chronicle with a report that depicts the main insights regarding the income in the city.

S.F. Jobs With The Most Competitive Salary Growth

Objective

One of the goals of The San Francisco Controller’s Office is to promote financial and economic security, which means that they try to be proactive in ensuring that employees receive the appropriate amount of pay. They study and maintain a database of the salary and benefits paid to City employees in order to identify any significant trends. We can use the trends we see in income to compare salary growth amongst occupations, as well as identify the occupations that exhibit the highest or lowest salary growth.

The objective of this investigation is to provide The San Francisco Chronicle with a report that depicts the main insights regarding the income in the city (based on the data from The Controller’s Office) that would be particularly useful for any student or jobseeker looking for a secure occupation with a competitive salary growth rate. In doing so, I conducted my research to answer the following questions:

Questions of Investigation:

Which jobs exhibited positive salary growth from 2015 to 2019?
Which jobs exhibited negative salary growth from 2015 to 2019?
What are the jobs with the highest salary growth rate from 2015 to 2019?
What are the jobs with the lowest salary growth rate from 2015 to 2019?
Which jobs are in the bottom 25th percentile based on salary growth rate? Upper 25th percentile? In between?

Libraries

The first step to every exploratory data analysis is to import all necessary libraries.

# install.packages("plyr")
# install.packages("formattable")
# install.packages("dplyr")
# install.packages("rlang")
library(plyr)
library(formattable)

About the Dataset

The San Francisco Controller’s Office maintains a database of the salary and benefits paid to City employees since fiscal year 2013.

Data Exploration

Read in the Data

Because Github has a limit on file size, I have to read the dataset from a zipped file using unz() and viewing the first 6 rows.

Because the dimensions are so large, head() could not display every column. I will retrieve the column names in the next few steps. Another thing I immediately notice is that there are some missing values that I would like to address in my data cleaning subsection.

# Open a connection to the zip file
unzip_conn <- unz("employee-compensation.csv.zip", "employee-compensation.csv")

# Read the CSV data from the zip archive
employee <- read.csv(unzip_conn)

# Close the connection
close.(unzip_conn)

head(employee)

The first 6 rows of the dataset

Because I am unable to view the full dimensions, I used nrow() and ncol() to check the dimensions. I now know that I am working with a 886,102 by 22 dataframe.

cat("There are", nrow(employee), "rows and", ncol(employee), "columns in this table.")

The dimensions of the dataset

Additionally, I used colnames() to retrieve the column names. For the purpose of my investigation, I am only interested in the variables Year, Job, Salaries, which I will subset in a later section.

colnames(employee)

The column names of the dataset

Data Cleaning

The next step to my analysis is to check for NULL values and empty strings within the data frame. To do so, I utilized the count() function to provide a count for the number of NULL and empty strings within a particular column. The logic expression within the function checks to see if an input is NULL or an empty string.

count(employee["Job"] == "" | is.na(employee["Job"]))

Null values in the ‘Job’ column

Because I wanted to check for NULL values and empty strings in every column, I built a for loop to do just that. The conditional expression within the loop returns the column name if it contains NULL values or empty strings.

for (col in colnames(employee)) {
    x <- count(employee[col] == "" | is.na(employee[col]))
    
    if (x[1,2] < 886102) {
        print(col)
    }
}

Columns with Null values

R has a useful function called na.omit() that can be applied to an entire data frame to clean NULL values from data frames. Running it, however, I learned that my data frame does not contain any NULL values because the dimensions remained the same.

clean_employee <- na.omit(employee) 
dim(clean_employee)

Dimensions of the dataset

Because Job was the only one of my three variables of interest that contained empty strings, I decided to filter only that one column to remove those rows. I decided not to filter the remaining four columns that contained empty strings because they could potentially contain data that could be of use to my investigation. Filtering the dataset resulted in the removal of just 4 rows.

filtered_employee <- clean_employee[!(employee$Job==""),]
nrow(clean_employee) - nrow(filtered_employee)
dim(filtered_employee)

Dimensions of the dataset

Subsets of Data

Subset 1: Dataframe of Yearly Median Salary & Percent Change

Because I am observing occupational salary growth, the only relevant columns I need are Year, Job, and Salaries. I subsetted my data accordingly and printed a summary to verify. In the summary, I noticed that Salaries contains negative inputs that could be input errors.

# subset 1
subset <- filtered_employee[, c("Year", "Job", "Salaries")]
summary(subset)

Subset 1

I am only interested in observing occupational salary growth within full time positions so I subsetted my data to only contain rows where with salaries above $30,000.

fulltime <- subset[subset$Salaries > 30000,]
summary(fulltime)

Filtered Subset 1

I decided to look at the occupational salary growth between 2015 and 2019. Using a for loop, I subsetted the fulltime data frame based on the Year specified within the loop. The second command within the loop then creates another subset where the data is grouped Job.

The third command within the loop then creates another subset that contains a column for Job and another column for the median salary. I decided to use median salary as measurement for comparison because it is a measure of central tendency regarding compensation.

The forth command within the loop just renames the median salary income to specify the Year. The fifth command adds the subset to a list called dataframes that will be used to merge all the data frames.

dataframes <- list()
x <- 1

for (i in 2015:2019) {
    salary <- fulltime[fulltime$Year == i,]
    
    by_job <- salary_2015 %>%
              group_by(Job) %>%
              summarise(Salary = median(Salaries, na.rm = TRUE))
    colnames(by_job)[2] <- paste("Salary", i, sep=".")
    
    dataframes[[x]] <- by_job
    x <- x + 1
}

I assigned the first dataframe in dataframes to the variable full. Using a for loop, I merged the remaining dataframes to the first data frame by Job. This creates a general data frame that contains the median salary for every year between 2015 and 2019. Additionally, I created a column called Percent.Change that contains the salary growth from 2015 to 2019.

full <- dataframes[[1]]

for (i in 2:length(dataframes)) {
    full <- merge(full, dataframes[[i]], by="Job")
}

full$Percent.Change <- percent((full$Salary.2019 - full$Salary.2015) / full$Salary.2015)
head(full)

Yearly Median Salary and Percent Change

Subset 2: Positive Salary Growth

I am interested in occupations that exhibited a positive salary growth. To make this observation, I created another subset called positive that contains data for the occupations that exhibited a positive salary growth.

positive <- full[full$Percent.Change > 0,]
positive <- head(positive, n=5)
positive

Positive Salary Growth

Subset 3: Negative Salary Growth

negative <- full[full$Percent.Change < 0,]
negative <- head(negative, n=5)
negative

Negative Salary Growth

Subset 4: Jobs with the Highest Salary Growth Rate

I am interested in the jobs with the highest salary growth from 2015 to 2019. To make this I ordered the full data frame in a descending manner based on Percent.Change. I then created another subset called fast_growth that contains data for the top 5 jobs with the highest salary growth.

by_percent_change <- full[order(-full$Percent.Change),]
fast_growth <- head(by_percent_change, n=5)
fast_growth

Fast Salary Growth

Subset 5: Jobs with the Lowest Salary Growth Rate

I am interested in the jobs with the lowest salary growth from 2015 to 2019. I created another subset called slow_growth that contains data for the top 5 jobs with the lowest salary growth.

slow_growth <- tail(by_percent_change, n=5)
slow_growth

Lowest Salary Growth

Single Vector

Top 5 Jobs with the Highest Salary Growth Rate

I am interested in the top 5 occupations that have the highest salary growth rate. To retrieve this information, I stored the Job column from the data frame fast_growth into a vector using as.vector(). This information is particularly helpful for any jobseeker or student looking to find a secure job with the fastest-rising salary growth rate in the future.

fast_jobs <- as.vector(fast_growth[,"Job"])
fast_jobs

Highest Salary Growth

Top 5 Jobs with the Lowest Salary Growth Rate

I am interested in the top 5 occupations that have the lowest salary growth rate. To retrieve this information, I stored the Job column from the data frame slow_growth into a vector using as.vector(). This information is particularly helpful because these are the jobs that any jobseeker or student may want to stray away from since their salary growth rate is very low.

slow_jobs <- as.vector(slow_growth[,"Job"])
slow_jobs

Lowest Salary Growth

5 different Lists

I am interested in dividing the Jobs from the full data frame into 5 different lists based on their salary growth rate from 2015 to 2019 using the following criteria:

Criteria Used for Listing:

Lower 25th Percentile
Between 25th and 50th Percentile
Between 50th and 75th Percentile
Upper 25th Percentile
Positive Salary Growth

To get a statistical summary—minimum, first quartile, median, mean, third quartile, and maximum—of Percent.Change, I ran the summary() function over the column.

summary(full$Percent.Change)

Summary Summary

To begin my organization, I first made four different lists for the lower, middle.1 (between 25th and 50th percentile), middle.2 ((between 50th and 75th percentile), and upper quartiles. I then made a for loop that will organize Jobs from the full data frame into these five lists based on their salary growth rate from 2015 to 2019. Additionally, I created a function clean to first filter the lists of any possible NULL values and then print the first 5 elements.

lower_quartile <- list()
middle.1 <- list()
middle.2 <- list()
upper_quartile <- list()

for (i in 1:nrow(full)){
    if (full[i,"Percent.Change"] < 0.03561){
        lower_quartile[[x]] <- full[i,"Job"]
    } else if (full[i,"Percent.Change"] >= 0.03561 & full[i,7] < 0.12208) {
        middle.1[[x]] <- full[i,"Job"]
    } else if (full[i,"Percent.Change"] >= 0.12208 & full[i,7] < 0.14230){
        middle.2[[x]] <- full[i,"Job"]
    } else  {
        upper_quartile[[x]] <- full[i,"Job"]
    }
    x <- x + 1
}


clean <- function(list.name) {
    filtered_list <- list.clean(list.name, fun = is.null, recursive = TRUE)
    filtered_list[1:5]
}

Lower 25th Percentile

I am interested in the jobs that are in the lower 25th percentile based on their salary growth rate from 2015 to 2019. I used the clean() function to filter the list and print the first 5 elements. This information is particularly useful because we can see the jobs that have the lowest salary growth rate.

clean(lower_quartile)

Lower Quartile

Between 25th and 50th Percentile

I am interested in the jobs that are in between the lower 25th and 50th percentile based on their salary growth rate from 2015 to 2019. I used the clean() function to filter the list and print the first 5 elements.

clean(middle.1)

Middle Quartile 1

Between 50th and 75th Percentile

I am interested in the jobs that are in between the lower 50th and 75th percentile based on their salary growth rate from 2015 to 2019. I used the clean() function to filter the list and print the first 5 elements.

clean(middle.2)

Middle Quartile 2

Upper 25th Percentile

I am interested in the jobs that are in the upper 25th percentile based on their salary growth rate from 2015 to 2019. I used the clean() function to filter the list and print the first 5 elements. This information is particularly useful because we can see the jobs that have the highest salary growth rate.

clean(upper_quartile)

Upper Quartile

Summary

In comparing an occupation’s compensation tendencies to that of other occupations within the same market, I had to aggregate my data in such a manner that would allow me to observe the changes in median salary over the course of a few years. I decided to narrow my study so that I would only focus my study on the salary growth rate for full time employees from 2015 to 2019. In doing so, I created several subsets that contained the yearly median salary per occupation, as well as the salary growth rate. These subsets were built to be of particular use for any student or jobseeker looking for a secure occupation with a competitive salary growth rate.

In the positive subset, for example, these students or jobseekers can learn that jobs like Account Clerk, Accountant II, Accountant III, etc., all exhibited a positive growth rate. A student or jobseeker may want to know which jobs exhibited negative salary growth so that they can stray clear from these occupations. In this case, they can see from the negative subset that jobs like Admin Analyst 3, Administrative Services Mgr, Airport Noise Abatement Spec, etc., all exhibited a negative growth rate.

A student or jobseeker may also be interested in the occupations with the highest or lowest salary growth rate. To find the occupations that exhibited the highest salary growth rate, they can see from the fast_jobs vector that these include Telecommunications Tech Supv, Sr Airport Noise Abatement Spe, Statistician, etc. To find the occupations that exhibited the lowest salary growth rate, they can see from the slow_jobs vector that these include Pr Investigator, Tax Collector, Chief Surveyor, Sprv Adult Prob Ofc (SFERS), etc.

Recommendations

When observing changes in income over a period of time, we can see how an occupation’s compensation tendencies compare to other occupations within the same market. This is particularly helpful for any student or jobseeker that is looking for a secure occupation with a competitive salary growth rate. For these students or jobseekers, I would recommend them to look at the jobs with the highest salary growth rate which can be found in the fast_jobs vector.

Additionally, I would recommend them to look at the jobs with salary growth rates in the upper 50th percentile. These can be found in the middle.2 and upper_quartile lists. I would also recommend them to look at the jobs that only exhibited a positive salary growth rate.

Lastly, I would recommend them to stray away from jobs with the lowest salary growth rate or with a growth rate in the lower 50th percentile. These can be found in the slow_jobs vector as well as the middle.1 and lower_quartile lists.

The source code is available here.

Share this article

S.F. Jobs With The Most Competitive Salary Growth

Objective

Libraries

About the Dataset

Data Exploration

Read in the Data

Data Cleaning

Subsets of Data

Subset 1: Dataframe of Yearly Median Salary & Percent Change

Subset 2: Positive Salary Growth

Subset 3: Negative Salary Growth

Subset 4: Jobs with the Highest Salary Growth Rate

Subset 5: Jobs with the Lowest Salary Growth Rate

Single Vector

Top 5 Jobs with the Highest Salary Growth Rate

Top 5 Jobs with the Lowest Salary Growth Rate

5 different Lists

Lower 25th Percentile

Between 25th and 50th Percentile

Between 50th and 75th Percentile

Upper 25th Percentile

Summary

Recommendations

Menu

Explore tags

S.F. Jobs With The Most Competitive Salary Growth

Objective

Libraries

About the Dataset

Data Exploration

Read in the Data

Data Cleaning

Subsets of Data

Subset 1: Dataframe of Yearly Median Salary & Percent Change

Subset 2: Positive Salary Growth

Subset 3: Negative Salary Growth

Subset 4: Jobs with the Highest Salary Growth Rate

Subset 5: Jobs with the Lowest Salary Growth Rate

Single Vector

Top 5 Jobs with the Highest Salary Growth Rate

Top 5 Jobs with the Lowest Salary Growth Rate

5 different Lists

Lower 25th Percentile

Between 25th and 50th Percentile

Between 50th and 75th Percentile

Upper 25th Percentile

Summary

Recommendations

You may also like

A Call to Action for Public Health Care

Syllabi Parsing

Human Resources and Candidate Search

Get interesting news

Explore tags