Author

Checho

Published

February 5, 2025

The Simpsons Data

Let’s load the data with the tidytuesdayR package (or directly with the raw files if it doesn’t work.

Ver código
# tuesdata <- tidytuesdayR::tt_load('2025-02-04')
# ## OR
# tuesdata <- tidytuesdayR::tt_load(2025, week = 5)
# 
# simpsons_characters <- tuesdata$simpsons_characters
# simpsons_episodes <- tuesdata$simpsons_episodes
# simpsons_locations <- tuesdata$simpsons_locations
# simpsons_script_lines <- tuesdata$simpsons_script_lines

simpsons_characters <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-04/simpsons_characters.csv')
simpsons_episodes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-04/simpsons_episodes.csv')
simpsons_locations <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-04/simpsons_locations.csv')
simpsons_script_lines <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-04/simpsons_script_lines.csv')

Pairs of characters

My first analysis involves finding out which combinations of characters generate the best IMDb ratings and the highest number of views.

For that, I’m going to use the data from episodes and script_lines and select the columns I need.

Ver código
# Libraries
library(tidyverse)
library(extrafont)

loadfonts()

# Subset de episodes and script_lines
episodes <- simpsons_episodes %>% 
  select(id, imdb_rating, 
         year = original_air_year, 
         us_viewers = us_viewers_in_millions,
         title)

# Keep only rows where character have a dialogue
script_lines <- simpsons_script_lines %>% 
  filter(speaking_line == TRUE) 

# Join dataframes
df <- left_join(script_lines, episodes, by = c("episode_id" = "id"))

# Data frame exploration
glimpse(df)
Rows: 26,169
Columns: 17
$ id                 <dbl> 132917, 134285, 127110, 126516, 126517, 126520, 126…
$ episode_id         <dbl> 473, 478, 452, 450, 450, 450, 450, 450, 450, 450, 4…
$ number             <dbl> 128, 198, 61, 1, 2, 5, 35, 6, 7, 9, 10, 11, 12, 13,…
$ raw_text           <chr> "Fbi Supervisor: No.", "Willis: Oh my.", "Homer Sim…
$ timestamp_in_ms    <dbl> 607000, 1150000, 231000, 40000, 53000, 58000, 20400…
$ speaking_line      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ character_id       <dbl> 5734, 5816, 2, 2, 2, 1, 31, 1, 9, 8, 31, 33, 1, 33,…
$ location_id        <dbl> 6, 2117, 2, 3656, 3656, 3656, 354, 3656, 3656, 3656…
$ raw_character_text <chr> "Fbi Supervisor", "Willis", "Homer Simpson", "Homer…
$ raw_location_text  <chr> "KITCHEN", "COTTAGE", "Car", "Seaworld-type Park", …
$ spoken_words       <chr> "No.", "Oh my.", "Holy moly!", "Oh, I love going to…
$ normalized_text    <chr> "no", "oh my", "holy moly", "oh i love going to aqu…
$ word_count         <dbl> 1, 2, 2, 25, 7, 27, 13, 43, 4, 9, 18, 2, 2, 9, 21, …
$ imdb_rating        <dbl> 7.0, 7.2, 7.1, 6.8, 6.8, 6.8, 6.8, 6.8, 6.8, 6.8, 6…
$ year               <dbl> 2010, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ us_viewers         <dbl> 7.18, 6.35, 5.11, 8.65, 8.65, 8.65, 8.65, 8.65, 8.6…
$ title              <chr> "Donnie Fatso", "Angry Dad: The Movie", "Million Do…

Now we are going to create a list of characters that appear per episode.

Ver código
personajes_por_episodio <- df %>% 
  filter(raw_character_text != "Man") %>% 
  group_by(episode_id) %>% 
  summarise(personajes = list(unique(raw_character_text)))

This step generates a data frame that contains one row per episode, and then a list with the names of all the characters who have at least one line of dialogue in each episode.

The next step is to process the list of characters to find the pairs within each episode.

Ver código
# Generate all combinations of couples per episode
duplas_por_episodio <- personajes_por_episodio %>%
  rowwise() %>% # Makes the analysis by row instead of columns
  mutate(duplas = list(combn(personajes, 2,   # Creates combinations of 2 characters
                             simplify = FALSE))) %>%
  unnest(duplas) %>% # Transform the list into a column
  mutate(dupla = map_chr(duplas, ~ paste(sort(.x), collapse = "-"))) %>%
  select(episode_id, dupla)

I’ll be honest with you. I used DeepSeek for this step, so I’ll share the explanation of the code because I found it to be a very interesting approach:

Code Explanation

1. personajes_por_episodio

This is the dataframe that contains, for each episode (episode_id), a list of unique character_ids that appear in that episode. For example:

episode_id characters
1 [1, 2, 3]
2 [2, 4]

Here, characters is a list-type column, where each element is a vector of character_ids.

2. rowwise()

The rowwise() function is used to apply operations row by row in a dataframe. In this case, we want to generate character combinations for each episode (i.e., for each row).

3. mutate(duplas = list(combn(characters, 2, simplify = FALSE)))

  • combn(): This function generates all possible combinations of a set of elements. Here, we want combinations of 2 characters (2 is the second argument).

  • characters: This is the list of character_ids for the current episode.

  • simplify = FALSE: This ensures that the result is a list of combinations instead of a matrix.

  • list(): Wraps the result in a list to make it compatible with mutate().

  • mutate(duplas = ...): Creates a new column called duplas, which contains a list of all combinations of 2 characters for that episode.

For example, if characters = [1, 2, 3], the combinations would be:

  • [1, 2]

  • [1, 3]

  • [2, 3]

4. unnest(duplas)

The unnest() function is used to “unnest” a column that contains lists. In this case, duplas is a column of lists, where each element is a combination of 2 characters. By using unnest(), each combination becomes a separate row.

For example, if we had:

episode_id duplas
1 [[1,2],[1,3],[2,3]][[1,2],[1,3],[2,3]]

After unnest(), we would have:

episode_id duplas
1 [1, 2]
1 [1, 3]
1 [2, 3]

5. mutate(dupla = map_chr(duplas, ~ paste(sort(.x), collapse = "-")))

  • map_chr(): Applies a function to each element of a list and returns a character vector.

  • sort(.x): Sorts the character_ids in each pair. This ensures that the pair “1-2” is the same as “2-1”.

  • paste(..., collapse = "-"): Converts the sorted pair into a text string, separating the character_ids with a hyphen (-).

  • mutate(dupla = ...): Creates a new column called dupla, which contains the text representation of each combination.

For example, if duplas = [1, 2], after this step we would have dupla = "1-2".

6. select(episode_id, dupla)

Finally, we select only the columns episode_id and dupla to keep a clean dataframe that contains, for each episode, all the pairs of characters that appear together.

Calculate the Average Rating by Pairs

We’ll clean the data a bit more, keeping only the pairs that appear at least 10 times.

Ver código
top_duplas <- duplas_por_episodio %>% 
  count(dupla, name = "cuenta") %>% 
  filter(cuenta >= 10)

# Reducimos el dataframe
duplas_por_episodio <- duplas_por_episodio %>% 
  filter(dupla %in% top_duplas$dupla)

Now we can join the data from duplas_por_episodio and in that way, calculate the average rating for echar character duo.

Ver código
duplas_con_rating <- duplas_por_episodio %>% 
  inner_join(episodes, by = c("episode_id" = "id")) %>% 
  group_by(dupla) %>% 
  summarise(imdb_promedio = mean(imdb_rating, na.rm = TRUE),
            episodios = n())

# Filter couples with at least 10 episode appearances
duplas_con_rating <- duplas_con_rating %>%
  filter(episodios >= 10) %>%
  arrange(desc(imdb_promedio))

And now we can make a plot of the 10 couples with the best average score of imdb_ranking.

Ver código
# Select the best 10 duos
top_10_duplas <- duplas_con_rating %>% 
  head(10)


# Chart
ggplot(top_10_duplas, aes(y = reorder(dupla, imdb_promedio), x = imdb_promedio)) +
  geom_point(size = 3, color = "#4f76df") +
  geom_segment(aes(x = 0, xend = imdb_promedio,
                   y = dupla, yend = dupla), color = "#4f76df") +
  geom_text(aes(label = paste0("Rating: ", round(imdb_promedio, 2), " Episodes: ", episodios)),nudge_y = 0.35,
            size = 3.5, 
            face = "bold",
            color = "#4f76df", 
            family = "Atma Medium") +
  labs(
    title = "Top 10 Character Pairs with the Highest Average IMDb Rating",
    y = "Character Pair",
    x = "Average IMDb Rating"
  ) +
  theme(panel.grid = element_blank(),
                 plot.background = element_rect(fill = "#ffd90f"),
                 panel.background = element_blank(),
                 panel.grid.major.x = element_line(color = "#70d1ff"),
                 text = element_text(face = "bold", family = "Atma Medium"),
                 plot.title.position = "plot") +
  scale_x_continuous(limits = c(0,8.5))

Ver código
ggsave("en_top_duplas.png", dpi = 300)

What if we use donuts instead of points?

Ver código
# Library
library(ggimage)

# Add a column with the name of the picture
top_10_duplas <- top_10_duplas %>% 
  mutate(imagen = "dona.png")

# Chart
ggplot(top_10_duplas, aes(y = reorder(dupla, imdb_promedio), x = imdb_promedio)) +
  geom_segment(aes(x = 0, xend = imdb_promedio,
                   y = dupla, yend = dupla), color = "#4f76df") +
  geom_image(aes(image = imagen), size = 0.06) +
  geom_text(aes(label = paste0("Rating: ", round(imdb_promedio, 2), " Episodes: ", episodios)),
            nudge_y = 0.15,
            nudge_x = -2.15,
            size = 3.7,
            family = "Atma Medium",
            face = "bold",
            color = "#4f76df") +
  labs(
    title = "Top 10 Character Pairs with the Highest Average IMDb Rating",
    y = "Character Pair",
    x = "Average IMDb Rating"
  ) +
  theme(panel.grid = element_blank(),
                 plot.background = element_rect(fill = "#ffd90f"),
                 panel.background = element_blank(),
                 panel.grid.major.x = element_line(color = "#70d1ff"),
                 text = element_text(face = "bold", family = "Atma Medium"),
                 plot.title.position = "plot") +
  scale_x_continuous(limits = c(0,8.5))

Ver código
ggsave("top_duplas_dona.png", dpi = 300)