python data-parsing data-wrangling data-cleaning project

Human Resources and Candidate Search

24 Jan 2021

The objective of this investigation is to write a Python script that can filter the most fit candidates for a Data Scientist role at Apple.

Objective

Human Resources is an essential department of any company and/or organization. HR works in many areas including office management, administrative functions, and most importantly, recruiting. One of the key responsibilities of HR is to attract, select, and recommend the most qualified candidates given any pool size. In doing so, HR needs to understand the needs and expectations of any given position so that they select candidates that meet those needs.

While recruitment sounds straight forward, it is not a simple task. Much research and consulting must be done to make sure the right candidates are being selected. Furthermore, HR may receive 100’s of applications they must review. Manually, this can take days, if not weeks, to accomplish. This is where Python script can make things fairly simple for HR. In this notebook, I will show how we can use Python script to filter the most fit candidates for a Data Scientist role at Apple.

Data Sources

One of the first steps to every exploratory data analysis is to import libraries and to familiarize ourselves with our data sources. A general outline of our input sources is as follows:

15 text files
One .csv file called Candidates.csv
1 flat file (.txt)

1. 15 text files

From scratch, I created 15 text files, each containing the resume for one candidate. The process included selecting resume samples from the web, copying + pasting these resumes to a .doc file, and converting this .doc file into a .txt file. Within this candidate pool, I thought I’d throw my own resume in just for fun. As expected, these resumes contains among other data, the skills of the candidate. A resume example can be seen below:

# Import libraries
import pandas as pd

# Set options for displaying pandas dataframes
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 30)

example = open("data/resumes/MarisolHernandezResume.txt", "r", encoding='utf-8')

# Print first 3 lines
for i in range(3):
    line = example.readline()
    print(line)
    
example.close()

Example resume

type(example)

Data type

2. One .csv file called Candidates.csv

Our second data source is a .csv file called Candidates.csv which contains a list of our 15 candidates, including the name of each candidate and the name of the text file containing their resume. Let’s see what this looks like:

candidates = pd.read_csv("data/Candidates.csv")
candidates

Candidates

From the print out above, we can see that some of our candidates are missing data under the Name column, as indicated by NaN. Let’s go ahead and address this with some data cleaning:

In this first step, we want to subset our rows where Name is missing.

missing_data = candidates[candidates.isnull().values.any(axis=1)]
missing_data

Missing data

In this next step, we open the resume of each of these 4 candidates, extract their name (generally, the first two words of a resume is the first and last name of a candidate), and save it as their Name input in Candidates.csv.

for i in missing_data["File"]:
    myfile = open(f"data/resumes/{i}.txt", "r", encoding='utf-8')
    myfile = myfile.read()
    content = myfile.split()
    candidates.loc[candidates["File"] == i,"Name"] = content[0] + " " + content[1]

Now let’s check to see if this worked:

candidates

Candidates

Yes, it does indeed work.

3. 1 flat file (.txt)

Our last data sources is 1 flat file in the form of .txt. This text file contains the job description, as well as a general list of preferred skills:

job = open("data/JobDescription.txt", "r")

# Print first 4 lines
for line in range(15):
    line = job.readline()
    print(line)
    
job.close()

Job description

For the sake of space, I didn’t print out the entire text file, but if you open it within your own system you will see that a list of preferred skills follows.

Preferred Skills

Now that I’ve introduced our input files, my next objective is to write Python script that reads through the job description and gathers the required skills. To begin, I create an empty list called skills to hold the required skills. I then open the job description, read the lines following the string "Skills", and add each skill to my empty list.

You may notice two things. First, I apply lower() to each skill so that it converts each character to lowercase. In the following section, you will see that I also apply lower() when reading each resume so that we don’t miss any skills due to uppercase / lowercase discrepencies. Secondly, I also had to strip the new line character '\n' from each skill. We can now see the required skills stored in a list.

skills = []

with open("data/JobDescription.txt", "r") as file:
    for line in file:
        if "Skills" in line:
            for line in file:
                skills.append(line.lower().rstrip('\n'))
    file.close()
                
print(skills)

Preferred skills

Resume Filtering

In this next section, I want to write Python script that loops through Candidates.csv and checks if the candidate holds any of the skills listed above so that we can identify the most fit candidates. In the script below, I create two empty lists. The first will hold the number of skills each candidate meets. The second will hold the skills each candidate meets. The first for loop loops through Candidates.csv, opening each file for us to read. The nested for loop then checks to see if each skill is mentioned in the contents of the resume. If so, it tallies it and adds the skills to an empty string. Once the tally and the skills are recorded, the tally is added to the num_skills list and the recorded skills are added to the skills_by_candidate list.

num_skills = []
skills_by_candidate = []

for i in candidates["File"]:
    myfile = open(f"data/resumes/{i}.txt", "r", encoding='utf-8')
    content = myfile.read().lower()
    
    x = 0
    empty = str()
    
    for i in skills:
        if i in content:
            x += 1
            empty = empty + i[0].upper() + i[1:] + ", "
    num_skills.append(x)
    skills_by_candidate.append(empty)
    myfile.close()

I then create a dictionary consisting of the candidate names, the number of skills met, and the skills, which is then transformed into an ordered dataframe based on the number of skills met. Printing the first five rows gives us our 5 most qualified candidates.

dictionary = {"Name" : candidates["Name"], "Num. Of Skills Met" : num_skills, "Skills" : skills_by_candidate}
dictionary = pd.DataFrame(dictionary)
dictionary = dictionary.sort_values(by="Num. Of Skills Met", ascending=False)
dictionary.head()

Dataframe

From the print out above, I immediately notice an error within the Skills column. There seems to be a trailing ‘, ‘ following the last skill lasted. Running the following script strips the ‘, ‘.

for i in range(len(dictionary)):
    dictionary.iloc[i,2] = dictionary.iloc[i,2].strip(', ')
dictionary.head()

Dataframe

Top 5 Candidates

Because I am only interested in the top 5 candidates, I subset the first five rows:

top = dictionary[0:5]
top

Top 5 Candidates

I want to create a text file called ToInterview.csv that contains a list of the 5 most fit candidates in order of priority, along with their skills. To do so, I open a .csv file to write to and I add the candidate names, followed by the skills.

top5 = open("data/ToInterview.csv", "w")

for i in range(0,5):
    top5.write(str(top.iloc[i,0] + ", " + top.iloc[i,2] + "\n"))
    
top5.close()

Results

The objective of this investigation was to write Python script that would identify the top 5 most qualified candidates for a Data Scientist role at Apple. In doing so, I wrote a text file ToInterview.csv that contains a list of the 5 candidates, along with their skills. We can view the results by opening and reading the .csv file:

to_interview = open("data/ToInterview.csv", "r")

for line in to_interview:
    print(line)
    
to_interview.close()

Top 5 Candidates

The source code is available here.

Share this article

Human Resources and Candidate Search

Objective

Data Sources

1. 15 text files

2. One .csv file called Candidates.csv

3. 1 flat file (.txt)

Preferred Skills

Resume Filtering

Top 5 Candidates

Results

Menu

Explore tags

Human Resources and Candidate Search

Objective

Data Sources

1. 15 text files

2. One .csv file called Candidates.csv

3. 1 flat file (.txt)

Preferred Skills

Resume Filtering

Top 5 Candidates

Results

You may also like

A Call to Action for Public Health Care

Syllabi Parsing

Clustering U.S. Universities

Get interesting news

Explore tags