Human Resources and Candidate Search
The objective of this investigation is to write a Python script that can filter the most fit candidates for a Data Scientist role at Apple.
Objective
Human Resources is an essential department of any company and/or organization. HR works in many areas including office management, administrative functions, and most importantly, recruiting. One of the key responsibilities of HR is to attract, select, and recommend the most qualified candidates given any pool size. In doing so, HR needs to understand the needs and expectations of any given position so that they select candidates that meet those needs.
While recruitment sounds straight forward, it is not a simple task. Much research and consulting must be done to make sure the right candidates are being selected. Furthermore, HR may receive 100’s of applications they must review. Manually, this can take days, if not weeks, to accomplish. This is where Python script can make things fairly simple for HR. In this notebook, I will show how we can use Python script to filter the most fit candidates for a Data Scientist role at Apple.
Data Sources
One of the first steps to every exploratory data analysis is to import libraries and to familiarize ourselves with our data sources. A general outline of our input sources is as follows:
- 15 text files
- One .csv file called Candidates.csv
- 1 flat file (.txt)
1. 15 text files
From scratch, I created 15 text files, each containing the resume for one candidate. The process included selecting resume samples from the web, copying + pasting these resumes to a .doc file, and converting this .doc file into a .txt file. Within this candidate pool, I thought I’d throw my own resume in just for fun. As expected, these resumes contains among other data, the skills of the candidate. A resume example can be seen below:
# Import libraries
import pandas as pd
# Set options for displaying pandas dataframes
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 30)
example = open("data/resumes/MarisolHernandezResume.txt", "r", encoding='utf-8')
# Print first 3 lines
for i in range(3):
line = example.readline()
print(line)
example.close()
Example resume
type(example)
Data type
2. One .csv file called Candidates.csv
Our second data source is a .csv file called Candidates.csv which contains a list of our 15 candidates, including the name of each candidate and the name of the text file containing their resume. Let’s see what this looks like:
candidates = pd.read_csv("data/Candidates.csv")
candidates
Candidates
From the print out above, we can see that some of our candidates are missing data under the Name
column, as indicated by NaN. Let’s go ahead and address this with some data cleaning:
In this first step, we want to subset our rows where Name
is missing.
missing_data = candidates[candidates.isnull().values.any(axis=1)]
missing_data
Missing data
In this next step, we open the resume of each of these 4 candidates, extract their name (generally, the first two words of a resume is the first and last name of a candidate), and save it as their Name
input in Candidates.csv.
for i in missing_data["File"]:
myfile = open(f"data/resumes/{i}.txt", "r", encoding='utf-8')
myfile = myfile.read()
content = myfile.split()
candidates.loc[candidates["File"] == i,"Name"] = content[0] + " " + content[1]
Now let’s check to see if this worked:
candidates
Candidates
Yes, it does indeed work.
3. 1 flat file (.txt)
Our last data sources is 1 flat file in the form of .txt. This text file contains the job description, as well as a general list of preferred skills:
job = open("data/JobDescription.txt", "r")
# Print first 4 lines
for line in range(15):
line = job.readline()
print(line)
job.close()
Job description
For the sake of space, I didn’t print out the entire text file, but if you open it within your own system you will see that a list of preferred skills follows.
Preferred Skills
Now that I’ve introduced our input files, my next objective is to write Python script that reads through the job description and gathers the required skills. To begin, I create an empty list called skills
to hold the required skills. I then open the job description, read the lines following the string "Skills"
, and add each skill to my empty list.
You may notice two things. First, I apply lower()
to each skill so that it converts each character to lowercase. In the following section, you will see that I also apply lower()
when reading each resume so that we don’t miss any skills due to uppercase / lowercase discrepencies. Secondly, I also had to strip the new line character '\n'
from each skill. We can now see the required skills stored in a list.
skills = []
with open("data/JobDescription.txt", "r") as file:
for line in file:
if "Skills" in line:
for line in file:
skills.append(line.lower().rstrip('\n'))
file.close()
print(skills)
Preferred skills
Resume Filtering
In this next section, I want to write Python script that loops through Candidates.csv and checks if the candidate holds any of the skills listed above so that we can identify the most fit candidates. In the script below, I create two empty lists. The first will hold the number of skills each candidate meets. The second will hold the skills each candidate meets. The first for loop loops through Candidates.csv, opening each file for us to read. The nested for loop then checks to see if each skill is mentioned in the contents of the resume. If so, it tallies it and adds the skills to an empty string. Once the tally and the skills are recorded, the tally is added to the num_skills
list and the recorded skills are added to the skills_by_candidate
list.
num_skills = []
skills_by_candidate = []
for i in candidates["File"]:
myfile = open(f"data/resumes/{i}.txt", "r", encoding='utf-8')
content = myfile.read().lower()
x = 0
empty = str()
for i in skills:
if i in content:
x += 1
empty = empty + i[0].upper() + i[1:] + ", "
num_skills.append(x)
skills_by_candidate.append(empty)
myfile.close()
I then create a dictionary consisting of the candidate names, the number of skills met, and the skills, which is then transformed into an ordered dataframe based on the number of skills met. Printing the first five rows gives us our 5 most qualified candidates.
dictionary = {"Name" : candidates["Name"], "Num. Of Skills Met" : num_skills, "Skills" : skills_by_candidate}
dictionary = pd.DataFrame(dictionary)
dictionary = dictionary.sort_values(by="Num. Of Skills Met", ascending=False)
dictionary.head()
Dataframe
From the print out above, I immediately notice an error within the Skills column. There seems to be a trailing ‘, ‘ following the last skill lasted. Running the following script strips the ‘, ‘.
for i in range(len(dictionary)):
dictionary.iloc[i,2] = dictionary.iloc[i,2].strip(', ')
dictionary.head()
Dataframe
Top 5 Candidates
Because I am only interested in the top 5 candidates, I subset the first five rows:
top = dictionary[0:5]
top
Top 5 Candidates
I want to create a text file called ToInterview.csv that contains a list of the 5 most fit candidates in order of priority, along with their skills. To do so, I open a .csv file to write to and I add the candidate names, followed by the skills.
top5 = open("data/ToInterview.csv", "w")
for i in range(0,5):
top5.write(str(top.iloc[i,0] + ", " + top.iloc[i,2] + "\n"))
top5.close()
Results
The objective of this investigation was to write Python script that would identify the top 5 most qualified candidates for a Data Scientist role at Apple. In doing so, I wrote a text file ToInterview.csv that contains a list of the 5 candidates, along with their skills. We can view the results by opening and reading the .csv file:
to_interview = open("data/ToInterview.csv", "r")
for line in to_interview:
print(line)
to_interview.close()
Top 5 Candidates
The source code is available here.