Syllabi Parsing
The objective of this investigation is to write a Python script that can parse and extract important features from a collection of syllabi.
Objective
A syllabus is a fundamental resource students should not take for granted. A written contract between students and the instructor(s), the syllabus conveys the essential information about a class all in one document, including class times, instructor contact information, due dates, class resources, and much more. If a student has any questions regarding the class expectations, a instructor will most likely direct them to the syllabus.
It is very common for students to take multiple classes at once. In both semester and quarter systems, students are typically enrolled in 4-5 classes. As a result, that is 4-5 syllabi students must thoroughly familiarize themselves with. With multiple syllabi of various lengths, it can be somewhat overwhelming. But what if there was a way to easily parse these syllabi so that these students collect only the most essential information? In this notebook, I will show you how we can use Pythons script to parse through these documents and extract important features of each syllabus.
Getting started
To begin, I load some libraries necessary for this investigation. Then, I use os.listdir()
to get a list of all files stored in the syllabi folder. My program will later loop through this list and read the contents of each file. Here each file represents one syllabus.
import os
from pdfminer.high_level import extract_text
import re
import numpy as np
import pandas as pd
# Get all file names from syllabi folder
files = os.listdir(r"syllabi/")
print(files)
Files
Feature Extraciton
In this next section, I define a function for each feature extraction. To start, I first define a list for each of the 8 features that I will be extracting from each of the syllabi. The features I will be extracting are as follows:
- Instructor(s) Name(s)
- Instructor Emails
- Phone Numbers
- URLs
- Dates
- Lecture Times
- Whether a Textbook is Required
- Percentages
instructor_names, instructor_emails, phone_numbers, important_urls = [],[],[],[]
important_dates, important_times, requires_textbook, important_percentages = [],[],[],[]
I first define a function that will extract the instructor(s) name(s) from the syllabus. My regular expression pretty much searches for the terms instructors or instructor, followed by 0-5 non-digit single characters (to account for spacing and indentation), a potential title, and then two words representing the first and last name. Using the regular expression, I created an if-else statement that appends the instructor name to my instructor_names
list if it exists, else append NaN
if it does not exist.
# Find instructor(s) name(s)
def my_function(content):
my_pattern = re.compile("(?:instructors|instructor)\D{0,5}(?:[a-z]\.|)(?:\s|)[a-z]*\s[a-z]*")
name_exists = my_pattern.search(content.lower())
if name_exists:
name = my_pattern.findall(content.lower())
name = name[0].split()[1:]
full_name = ' '.join(name).title()
instructor_names.append(full_name)
else:
instructor_names.append(np.nan)
My second function is designed to extract important emails from the syllabus, belonging to the instructor and/or TAs. My regular expression searches for a sequence of non-white characters, followed by @, and followed by another sequence of non-white chatacters. Using the regular expression, I created an if-else statement that appends the email(s) to my instructor_emails
list if it exists, else append NaN
if it does not exist.
# Find important emails
def my_function2(content):
my_pattern2 = re.compile("\S+@\S+")
email_exists = my_pattern2.search(content)
if email_exists:
email = my_pattern2.findall(content)
all_emails = ', '.join(email)
instructor_emails.append(all_emails)
else:
instructor_emails.append(np.nan)
My third function is designed to extract phone numbers from the syllabus. My regular expression searches for phone numbers of the following format:
(000) - 000 - 0000
or(000)-000-0000
(000) . 000 . 0000
or(000).000.0000
000 - 000 - 0000
or000-000-0000
000 . 000 . 0000
or000.000.0000
Using the regular expression, I created an if-else statement that appends the phone number(s) to my phone_numbers
list if it exists, else append NaN
if it does not exist.
# Find phone numbers
def my_function3(content):
my_pattern3 = re.compile("(?:\(|)\d{3}(?:\)|)(?:\s|)(?:-|\.)(?:\s|)\d{3}(?:\s|)(?:-|\.)(?:\s|)\d{4}")
phone_exists = my_pattern3.search(content)
if phone_exists:
phone = my_pattern3.findall(content)
all_phones = ', '.join(phone)
phone_numbers.append(all_phones)
else:
phone_numbers.append(np.nan)
My fourth function is designed to extract URLs from the syllabus, either directing to the class website or additional resources. My regular expression searches for URLs beginning with either http:// or https://. Using the regular expression, I created an if-else statement that appends the URL(s) to my important_urls
list if it exists, else append NaN
if it does not exist.
# Find URLs
def my_function4(content):
my_pattern4 = re.compile("(http|https)(://)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?")
urls_exist = my_pattern4.search(content)
if urls_exist:
url = my_pattern4.findall(content)
full_urls = [''.join(tuples) for tuples in url]
all_urls = ', '.join(full_urls)
important_urls.append(all_urls)
else:
important_urls.append(np.nan)
My fifth function is designed to extract dates from the syllabus. My regular expression searches for dates of the following format:
\d{1,2}/\d{1,2}/d{2,4}
(e.g. 00/0/0000)\d{1,2}-\d{1,2}-d{2,4}
(e.g. 00-00-00)\d{1,2}\.\d{1,2}\.d{2,4}
(e.g. 0.0.0000)\d{1,2}/d{1,4}
(e.g. 0/0000)\d{1,2}-d{1,4}
(e.g. 00-00)\d{1,2}\.d{1,4}
(e.g. 0.0)
Using the regular expression, I created an if-else statement that appends the date(s) to my important_dates
list if it exists, else append NaN
if it does not exist.
# Find dates
def my_function5(content):
my_pattern5 = re.compile("[\d]{1,2}[/|-|\.][\d]{1,2}[/|-|.][[\d]{2}|[\d]{4}]|[\d]{1,2}[/|-|\.][[\d]{2}|[\d]{4}]")
dates_exist = my_pattern5.search(content)
if dates_exist:
dates = my_pattern5.findall(content)
all_dates = ', '.join(dates)
important_dates.append(all_dates)
else:
important_dates.append(np.nan)
My sixth function is designed to extract time intervals from the syllabus, typically representing lecture times. Using the regular expression, I created an if-else statement that appends the time(s) to my important_times
list if it exists, else append NaN
if it does not exist.
# Find times
def my_function6(content):
my_pattern6 = re.compile("\d{1,2}:\d{2}(?:\s|)(?:am|pm|a\.m\.|p\.m\.|)(?:\s|)(?:–|-|)(?:\s|)\d{1,2}:\d{2}(?:\s|)(?:am|pm|a\.m\.|p\.m\.|)")
times_exist = my_pattern6.search(content.lower())
if times_exist:
times = my_pattern6.findall(content.lower())
for i in range(len(times)):
times[i] = ''.join(times[i].split())
all_times = ', '.join(times)
important_times.append(all_times)
else:
important_times.append(np.nan)
My seventh function is designed to check to see if a textbook is required for the class. Using a regular expression, I check to see if the pattern textbook
exists in the syllabus. If it does, then I append Yes
to my requires_textbook
list. If it doesn’t, then I append No
.
# Requires textbook?
def my_function7(content):
my_pattern7 = re.compile("textbook")
textbook_exists = my_pattern7.search(content.lower())
if textbook_exists:
requires_textbook.append("Yes")
else:
requires_textbook.append("No")
My eighth function is designed to extract percentages from the syllabus, typically representing the grading rubric. Using the regular expression, I created an if-else statement that appends the percentage(s) to my important_percentages
list if it exists, else append NaN
if it does not exist.
# Find all percentages
def my_function8(content):
my_pattern8 = re.compile("[0-9]{1,2}%")
percentage_exists = my_pattern8.search(content)
if percentage_exists:
percentage = my_pattern8.findall(content)
all_percentages = ', '.join(percentage)
important_percentages.append(all_percentages)
else:
important_percentages.append(np.nan)
Syllabi Parsing
Now that I have defined all eight functions from the previous section, I can now run them on each of my syllabi stored in the syllabi folder. Using a for
loop, my program reads the text of each syllabus and extracts the eight features.
for i in range(0,len(files)):
text = extract_text("syllabi/" + files[i])
my_function(text)
my_function2(text)
my_function3(text)
my_function4(text)
my_function5(text)
my_function6(text)
my_function7(text)
my_function8(text)
The for
loop above has now populated the 8 lists I defined. Using these lists, I create a dictionary. I then transform this dictionary into a pandas dataframe as shown below. We can now see that the dataframe consists of the features we have extracted.
data = {"Instructor Names": instructor_names,
"Instructor Emails": instructor_emails,
"Phone Numbers": phone_numbers,
"URLs": important_urls,
"Significant Dates": important_dates,
"Lecture Times": important_times,
"Requires Textbook?": requires_textbook,
"Percentages": important_percentages}
df = pd.DataFrame(data)
df
Feature extraction
CSV File
The last step of this program is to save the dataframe as a csv file, containg the features extracted:
df.to_csv('output/features-retrieved-by-MarisolHernandez.csv', index=False)
We can verify by reading in the csv file:
pd.read_csv('output/features-retrieved-by-MarisolHernandez.csv')
Output
Summary
In summary, regular expressions can be extremely useful in the extraction of essential information from files of all kinds of formats. In this notebook, I have shown how they can be used to extract features from a collection of syllabi. Though my program accounts for missing features, there is still room for improvement. Every syllabi is different, so it was quite difficult to write a universal regular expression for every feature. However, as I mentioned, if my regular expression does not pick up on the selected feature, it will return the NaN
value instead. I tried to account for any missing or undetected features. Overall, I am very happy and proud of the work I have accomplished.
The source code is available here.