Team Kellogg's Next Olympic Athlete

Objective

There are many reasons why companies establish athlete sponsorships: increase sales, develop new markets, broaden customer base, launch new products, etc. Sponsoring an athlete can allow a company to benefit from their influence especially when their popularity is at its highest. An athlete sponsorship can be one way to appeal to a market of fans and if successful, can multiply a company’s brand visibility.

The objective of this investigation is to select the right sport and athlete for Kellogg’s to sponsor for the next Olympic Games. In doing so, I conducted my research based on the following criteria:

Criteria Used:

  • Kellogg’s Worldwide Market Share
  • The Upcoming Olympic Games (Summer vs. Winter)
  • The Most Watched Sports Within the Targeted Market
  • The Top Olympic Gold Medalists Within the Aforementioned Sport

Libraries

The first step to every exploratory data analysis is to import all necessary libraries. The only library required for this project is plyr.

# install.packages('plyr')
library(plyr)

About Dataset

The dataset, athlete_events.csv, is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016.

Data Exploration

Before making any recommendations, it is essential to assess the structure and summary of the data. To import and read the CSV file in R, I used the read.csv(...) function so that I could explore and analyze the data.

Because I don’t know how much data I am working with, I used head() to print only the first six rows. From this, I can see that there are some NULL values, , under the Medal column. It's safe to assume that these athletes didn't place at all.

# Read CSV file and print first six rows
athletes <- read.csv(file = "athlete_events.csv")
head(athletes)

First 6 Rows The first 6 rows of the dataset

Because I am only interested in athletes who earned a medal, I used the !is.na() function on the Medal column. Doing so drops all the athletes who did not earn a medal, i.e. athletes with <NA> under the Medal column.

athletes <- athletes[!is.na(athletes$Medal),]

I still wanted to know how much data I was working with so I used dim() to check the dimensions of the data frame. In addition, I used colnames() to retrieve the column names.

I now know that I am working with a 39,783 by 15 dataframe. Just by glance, I already know that I won’t need every single column for my investigation, but I will address this in a later section.

dim(athletes)
colnames(athletes)

Dataset Dimensions and Column Names Dataset dimensions and column names

I was also curious about the type of variables I was working with so I used str() to check the variable types.

Here, I can see that this data frame contains data from 51 different Olympic games. I can also see a discrepancy between the levels of Team and the levels of NOC. It is safe to assume that there may be some input errors under Team so I will use NOC (Name of Country) to subset my data according to my country of interest.

According to Kellogg’s 2019 Annual Report, the United States is Kellogg’s largest market. I will direct my research on athletes that represented the United States in the 2016 Summer Olympic Games who placed first, i.e. Gold and subset my data accordingly in the following section.

str(athletes)

Dataset Data Types Dataset data types

One column selection

I am interested in the top 10 Summer 2016 Olympic Gold medalists from the United States. To retrieve this information, I created a function medalist() which transforms this notebook into an interactive environment. Running this function will deliver three prompts that ask for the NOC, games, and medal. These are stored in variables country, games, and medal. These variables are then used to subset the data frame athletes according to the country, games, and medal specified.

Within the function, I also created an ordered data frame, a, with a count for each athlete. The last line of the function prints only the first 10 rows of a, which is the ten athletes with the most occurrences, i.e. the 10 athletes who earned the most medal medals.

Because I am interested in finding the top 10 Olympic gold medalists from the 2016 Summer Olympic Games for the United States, I input the required arguments. The beauty of this function is that users can change the arguments to their desired country, games, and medal to retrieve the top 10 Olympic medalists.

medalists <- function() {
    country <- readline(prompt = "Enter NOC: ")
    games <- readline(prompt = "Enter games: ")
    medal <- readline(prompt = "Enter medal: ")
    
    country.athletes <- athletes[athletes$NOC == country & athletes$Games == games & athletes$Medal == medal,]
    
    a <- count(country.athletes, "Name")
    a <- a[order(-a$freq),]
    print(head(a, n=10))
}

medalists()

medalists() function output medalists() function output

Series

The only variables of value to my investigation are Name, NOC, Games, Sport, and Medal so I created a function series() that would subset the data frame athletes with data that pertained to those five variables and provide a summary for each of the five columns.

Just like the previous section, running this function will deliver three prompts that ask for the NOC, games, and medal. These are stored in variables country, games, and medal. These variables are then used to subset the data frame athletes according to the country, games, and medal specified.

This information is particularly helpful in finding the right sport and athlete Kellogg’s should sponsor because we can see which sports and athletes earned the most medals of interest. We can also verify that we are looking at data for our country, games, and medal of interest.

series <- function() {
    country <- readline(prompt = "Enter NOC: ")
    games <- readline(prompt = "Enter games: ")
    medal <- readline(prompt = "Enter medal: ")
    
    country.athletes <- athletes[athletes$NOC == country & athletes$Games == games & athletes$Medal == medal,
                                c("Name", "NOC", "Games", "Sport", "Medal")]
    
    summary(country.athletes)  
}

series()

series() function output series() function output

Plot

While the series(country, games, medal) function provides a numeric summary of the top Olympic medalists, visualizing the data can help deliver the bigger picture. To plot the data, I created a function olympics(). Running this function will deliver four prompts that ask for the NOC, games, sport and medal. These are stored in variables country, games, sport and medal. These variables are then used to subset the data frame athletes according to the country, games, sport and medal specified.

Within the function, I also created a data frame, x, with a count for each Olympic medalist to account for the number of medal medals they earned. Lastly, I used the barplot() function to plot my data from x so that I can visually identify the top Olympic medalists.

olympics <- function() {
    country <- readline(prompt = "Enter NOC: ")
    games <- readline(prompt = "Enter games: ")
    sport <- readline(prompt = "Enter sport: ")
    medal <- readline(prompt = "Enter medal: ")
    
    athletes <- athletes[athletes$NOC == country & athletes$Games == games & 
                         athletes$Sport == sport & athletes$Medal == medal,]
    
    x <- count(athletes, "Name")
    
    par(mar=c(5,18,3,1))
    barplot(height= x$freq, names.arg = x$Name, 
            main=paste("Olympic", medal, "Medalists\n in", sport), 
            xlab= paste(medal, "Medals Earned"), horiz=T, las=1, col='navy')
}

Just like with any marketing initiative, it is important to know our audience. The goal of this investigation is to reach Kellogg’s largest market, the United States, so it would work to our own benefit to know what the most watched 2016 Summer Olympic sports were within the United States. According to Forbes, gymnastics, swimming, and athletics (track and field) “have generally been the most-watched sports in the U.S. during the last few Olympic Games.”

With this information, we will create a plot for each of the top 3 most watched sports in the U.S., first being Gymnastics.

olympics()

olympics() function output olympics() function output

In the plot above, we can see that Olympic gymnast Simone Arianne Biles took home the most gold medals for the United States. It would be no surprise if she was drowning in sponsors so I decided to play it safe and look for alternative options within the following 2 sports.

Once again, I ran the function to visualize the top Olympic gold medalists in Swimming for the United States.

olympics()

olympics() function output olympics() function output

In the plot above, we can see that Olympic swimmer Michael Fred Phelps, II took home the most gold medals for the United States. However, it is known that Michael Phelps retired in 2016. We might want to look at the next top Olympic swimmer medalists such as Kathleen Genevieve “Katie” Ledecky or Ryan Murphy as possible options.

Just for extra safety measures, I ran the function one last time to visualize the top Olympic gold medalists in Athletics, the third most watched Olympic sport in the United States.

olympics()

olympics() function output olympics() function output

In the plot above, we can see that Olympic runners Allyson Michelle Felix and Tianna Madison-Bartoletta took home the most gold medals for the United States, making them viable options.

Summary

In finding the right Olympic sport and athletes for Kellogg’s to sponsor I had to narrow the focus of my study. First, I wanted to know Kellogg’s largest market share because I wanted to choose an athlete that appealed to that particular market. I learned—unsurprisingly—that the United States is Kellogg’s largest market, so I used NOC as my first criteria to subset my data frame accordingly. Secondly, I wanted to focus on athletes that pertained to the upcoming Olympic games, 2021 Summer, so I used Games as my second criteria to subset my data frame with data that pertained to U.S. Olympic athletes who competed in 2016 Summer.

Thirdly, I wanted to find the top three most-watched summer Olympic sports within my targeted market, the U.S.. I learned that the top three most-watched summer Olympic sports within the U.S. rank gymnastics, swimming, and athletics, so I used Sport as my third criteria to subset my data frame accordingly. Lastly, I only cared for athletes who placed first, i.e. Gold, because they are the athletes whose popularity is at its highest which could be used to multiply Kellogg’s brand visibility. Therefore, I used Medal as my fourth and last criteria to subset my dataframe accordingly.

I created a general function olympics() that can take user input to subset and plot my data frame based on the aforementioned criteria. With this function, I can visualize the United States’ top Olympic gold medalists for gymnastics, swimming, and athletics. Although I had specific criteria for my investigation, the beauty of my function olympics() is that Kellogg’s executives can change the user input to their desired country, games, sport, and medal and retrieve the top Olympic medalists. This is particularly useful if Kellogg’s decided to market to another country based on a different game, sport, or medal. For example, if Kellogg’s chose to market to China then they might look for the top Gold medalists in Table Tennis, a very popular sport in China, all they would have to do is change their user input.

Recommendations

Returning to our targeted market—the United States—when choosing a sport or athlete to sponsor, Kellogg’s must weigh in many factors before entering into a sports sponsorship agreement. In my investigation I weighed my recommendations based on the most watched Olympic sports with the athletes that won the most Gold medals in those sports coupled with the intangible marketability of those individuals.

Based on this criteria, I have provided a rank of recommendations.

Top Recommendations
Simone Arianne Biles
Kathleen Genevieve “Katie” Ledecky
Ryan Murphy
Allyson Michelle Felix
Tianna Madison-Bartoletta

I wanted to provide a rank of recommendations for the case that my top choice may be unavailable. However, if I had to choose only one athlete, I would recommend Simon Arianne Biles because of her popularity, the popularity of her sport (gymnastics) amongst the U.S. audience, and her marketability.

The source code is available here.