python data-parsing data-wrangling data-cleaning regex project

Syllabi Parsing

31 Jan 2021

The objective of this investigation is to write a Python script that can parse and extract important features from a collection of syllabi.

Objective

A syllabus is a fundamental resource students should not take for granted. A written contract between students and the instructor(s), the syllabus conveys the essential information about a class all in one document, including class times, instructor contact information, due dates, class resources, and much more. If a student has any questions regarding the class expectations, a instructor will most likely direct them to the syllabus.

It is very common for students to take multiple classes at once. In both semester and quarter systems, students are typically enrolled in 4-5 classes. As a result, that is 4-5 syllabi students must thoroughly familiarize themselves with. With multiple syllabi of various lengths, it can be somewhat overwhelming. But what if there was a way to easily parse these syllabi so that these students collect only the most essential information? In this notebook, I will show you how we can use Pythons script to parse through these documents and extract important features of each syllabus.

Getting started

To begin, I load some libraries necessary for this investigation. Then, I use os.listdir() to get a list of all files stored in the syllabi folder. My program will later loop through this list and read the contents of each file. Here each file represents one syllabus.

import os
from pdfminer.high_level import extract_text
import re
import numpy as np
import pandas as pd

# Get all file names from syllabi folder
files = os.listdir(r"syllabi/")
print(files)

Files Files

Feature Extraciton

In this next section, I define a function for each feature extraction. To start, I first define a list for each of the 8 features that I will be extracting from each of the syllabi. The features I will be extracting are as follows:

Instructor(s) Name(s)
Instructor Emails
Phone Numbers
URLs
Dates
Lecture Times
Whether a Textbook is Required
Percentages

instructor_names, instructor_emails, phone_numbers, important_urls = [],[],[],[]
important_dates, important_times, requires_textbook, important_percentages = [],[],[],[]

I first define a function that will extract the instructor(s) name(s) from the syllabus. My regular expression pretty much searches for the terms instructors or instructor, followed by 0-5 non-digit single characters (to account for spacing and indentation), a potential title, and then two words representing the first and last name. Using the regular expression, I created an if-else statement that appends the instructor name to my instructor_names list if it exists, else append NaN if it does not exist.

# Find instructor(s) name(s)
def my_function(content):
    my_pattern = re.compile("(?:instructors|instructor)\D{0,5}(?:[a-z]\.|)(?:\s|)[a-z]*\s[a-z]*")
    name_exists = my_pattern.search(content.lower())
    
    if name_exists:
        name = my_pattern.findall(content.lower())
        name = name[0].split()[1:]
        full_name = ' '.join(name).title()
        instructor_names.append(full_name)
    else:
        instructor_names.append(np.nan)

My second function is designed to extract important emails from the syllabus, belonging to the instructor and/or TAs. My regular expression searches for a sequence of non-white characters, followed by @, and followed by another sequence of non-white chatacters. Using the regular expression, I created an if-else statement that appends the email(s) to my instructor_emails list if it exists, else append NaN if it does not exist.

# Find important emails
def my_function2(content):
    my_pattern2 = re.compile("\S+@\S+")
    email_exists = my_pattern2.search(content)
    
    if email_exists:
        email = my_pattern2.findall(content)
        all_emails = ', '.join(email)
        instructor_emails.append(all_emails)
    else:
        instructor_emails.append(np.nan)

My third function is designed to extract phone numbers from the syllabus. My regular expression searches for phone numbers of the following format:

(000) - 000 - 0000 or (000)-000-0000
(000) . 000 . 0000 or (000).000.0000
000 - 000 - 0000 or 000-000-0000
000 . 000 . 0000 or 000.000.0000

Using the regular expression, I created an if-else statement that appends the phone number(s) to my phone_numbers list if it exists, else append NaN if it does not exist.

# Find phone numbers
def my_function3(content):
    my_pattern3 = re.compile("(?:\(|)\d{3}(?:\)|)(?:\s|)(?:-|\.)(?:\s|)\d{3}(?:\s|)(?:-|\.)(?:\s|)\d{4}")
    phone_exists = my_pattern3.search(content)
    
    if phone_exists:
        phone = my_pattern3.findall(content)
        all_phones = ', '.join(phone)
        phone_numbers.append(all_phones)
    else:
        phone_numbers.append(np.nan)

My fourth function is designed to extract URLs from the syllabus, either directing to the class website or additional resources. My regular expression searches for URLs beginning with either http:// or https://. Using the regular expression, I created an if-else statement that appends the URL(s) to my important_urls list if it exists, else append NaN if it does not exist.

# Find URLs
def my_function4(content):
    my_pattern4 = re.compile("(http|https)(://)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?")
    urls_exist = my_pattern4.search(content)
    
    if urls_exist:
        url = my_pattern4.findall(content)
        full_urls = [''.join(tuples) for tuples in url] 
        all_urls = ', '.join(full_urls)
        important_urls.append(all_urls)
    else:
        important_urls.append(np.nan)

My fifth function is designed to extract dates from the syllabus. My regular expression searches for dates of the following format:

\d{1,2}/\d{1,2}/d{2,4} (e.g. 00/0/0000)
\d{1,2}-\d{1,2}-d{2,4} (e.g. 00-00-00)
\d{1,2}\.\d{1,2}\.d{2,4} (e.g. 0.0.0000)
\d{1,2}/d{1,4} (e.g. 0/0000)
\d{1,2}-d{1,4} (e.g. 00-00)
\d{1,2}\.d{1,4} (e.g. 0.0)

Using the regular expression, I created an if-else statement that appends the date(s) to my important_dates list if it exists, else append NaN if it does not exist.

# Find dates
def my_function5(content):
    my_pattern5 = re.compile("[\d]{1,2}[/|-|\.][\d]{1,2}[/|-|.][[\d]{2}|[\d]{4}]|[\d]{1,2}[/|-|\.][[\d]{2}|[\d]{4}]")
    dates_exist = my_pattern5.search(content)
    
    if dates_exist:
        dates = my_pattern5.findall(content)
        all_dates = ', '.join(dates)
        important_dates.append(all_dates)
    else:
        important_dates.append(np.nan)

My sixth function is designed to extract time intervals from the syllabus, typically representing lecture times. Using the regular expression, I created an if-else statement that appends the time(s) to my important_times list if it exists, else append NaN if it does not exist.

# Find times
def my_function6(content):
    my_pattern6 = re.compile("\d{1,2}:\d{2}(?:\s|)(?:am|pm|a\.m\.|p\.m\.|)(?:\s|)(?:–|-|)(?:\s|)\d{1,2}:\d{2}(?:\s|)(?:am|pm|a\.m\.|p\.m\.|)")
    times_exist = my_pattern6.search(content.lower())
    
    if times_exist:
        times = my_pattern6.findall(content.lower())
        for i in range(len(times)):
            times[i] = ''.join(times[i].split())
        all_times = ', '.join(times)
        important_times.append(all_times)
    else:
        important_times.append(np.nan)

My seventh function is designed to check to see if a textbook is required for the class. Using a regular expression, I check to see if the pattern textbook exists in the syllabus. If it does, then I append Yes to my requires_textbook list. If it doesn’t, then I append No.

# Requires textbook?
def my_function7(content):
    my_pattern7 = re.compile("textbook")
    textbook_exists = my_pattern7.search(content.lower())
    
    if textbook_exists:
        requires_textbook.append("Yes")
    else:
        requires_textbook.append("No")

My eighth function is designed to extract percentages from the syllabus, typically representing the grading rubric. Using the regular expression, I created an if-else statement that appends the percentage(s) to my important_percentages list if it exists, else append NaN if it does not exist.

# Find all percentages
def my_function8(content):
    my_pattern8 = re.compile("[0-9]{1,2}%")
    percentage_exists = my_pattern8.search(content)
    
    if percentage_exists:
        percentage = my_pattern8.findall(content)
        all_percentages = ', '.join(percentage)
        important_percentages.append(all_percentages)
    else:
        important_percentages.append(np.nan)

Syllabi Parsing

Now that I have defined all eight functions from the previous section, I can now run them on each of my syllabi stored in the syllabi folder. Using a for loop, my program reads the text of each syllabus and extracts the eight features.

for i in range(0,len(files)):
    text = extract_text("syllabi/" + files[i])

    my_function(text)
    my_function2(text)
    my_function3(text)
    my_function4(text)
    my_function5(text)
    my_function6(text)
    my_function7(text)
    my_function8(text)

The for loop above has now populated the 8 lists I defined. Using these lists, I create a dictionary. I then transform this dictionary into a pandas dataframe as shown below. We can now see that the dataframe consists of the features we have extracted.

data = {"Instructor Names": instructor_names,
       "Instructor Emails": instructor_emails,
       "Phone Numbers": phone_numbers,
       "URLs": important_urls,
       "Significant Dates": important_dates,
       "Lecture Times": important_times,
       "Requires Textbook?": requires_textbook,
       "Percentages": important_percentages}

df = pd.DataFrame(data)
df

Feature extraction

CSV File

The last step of this program is to save the dataframe as a csv file, containg the features extracted:

df.to_csv('output/features-retrieved-by-MarisolHernandez.csv', index=False)

We can verify by reading in the csv file:

pd.read_csv('output/features-retrieved-by-MarisolHernandez.csv')

Output Output

Summary

In summary, regular expressions can be extremely useful in the extraction of essential information from files of all kinds of formats. In this notebook, I have shown how they can be used to extract features from a collection of syllabi. Though my program accounts for missing features, there is still room for improvement. Every syllabi is different, so it was quite difficult to write a universal regular expression for every feature. However, as I mentioned, if my regular expression does not pick up on the selected feature, it will return the NaN value instead. I tried to account for any missing or undetected features. Overall, I am very happy and proud of the work I have accomplished.

The source code is available here.

Share this article

Syllabi Parsing

Objective

Getting started

Feature Extraciton

Syllabi Parsing

CSV File

Summary

Menu

Explore tags

Syllabi Parsing

Objective

Getting started

Feature Extraciton

Syllabi Parsing

CSV File

Summary

You may also like

A Call to Action for Public Health Care

Human Resources and Candidate Search

Clustering U.S. Universities

Get interesting news

Explore tags