Advanced Plots with str_glue()

May 3, 2020
stringr visualization data wrangling

Quick Overview

Exploring-Data is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).

Sign up Here to join the many other subscribers who also nerd out on new tips and tricks 🤓

And if you enjoy the post be sure to share it

Let’s Dive Into an Example

I’d like to share a technique using str_glue() that I learned from Matt Dancho, a Data-Science instructor at Business Science University. Check out my favorite course here: Business Analysis With R.

str_glue() from the stringr library is one of my favorite functions in R - It’s super helpful for wrangling and manipulating text in preparation for building advanced plots.

Let’s Get Some Data 🤓

The Tidy Tuesday Project is an awesome repository of useful data for practicing your data Wrangling skills.

We will work with the San Francisco Trees data as a case-study for using str_glue() for advanced plotting techniques.

library(tidyverse)
library(stringr)
library(tidyquant)
library(scales)
library(DataExplorer)

tuesdata <- tidytuesdayR::tt_load('2020-01-28') 
sf_trees_data_raw_tbl <- tuesdata$sf_trees

Data Exploration

Let’s take a quick peak and inspect the SF Trees Data.

plot_missing(
    sf_trees_data_raw_tbl,
    ggtheme = theme_tq(),
    title = str_glue(
    'Exploring Missing Data (N = {count(sf_trees_data_raw_tbl)})'))

This is a fairly small data-set with only 12 columns. For the purpose of demonstration, let’s see what we can do with just the species column.

The Coastal Redwoods in the SF area are incredible and one of my favorite species. I’m curious if other species of Redwoods are in SF and if so, at what proportions do they exist relative to the Coastal Redwoods.

Data Wrangling

# Data Wrangling
redwood_tbl <- sf_trees_data_raw_tbl %>% 
    
    # select species and filter to redwood only
    select(species) %>% 
    filter(str_detect(species, pattern = 'Redwood')) %>% 
    
    # break up species and common-name into separate columns
    separate(
        col = species,
        into = c("species", "common_name"),
        sep  = ' :: ',
        remove = FALSE) %>% 
    
    # calculate absolute and relative distributions
    count(species, common_name) %>% 
    mutate(pct = n / sum(n),
           pct_text = percent(pct)) %>% 
    arrange(desc(pct))

Summary Table

Let’s take a look at our findings.

rmarkdown::paged_table(redwood_tbl %>% select(-pct))

As I expected, the Coastal Redwoods are the dominant species in San Francisco.

And while the table is good, lets craft an awesome plot to display these results.

The Power of str_glue()

With a couple lines of code we can build our label for plotting. As you can see, we can add arguments directly from our table using curly brackets {} - honestly, the options are endless for how creative you can get with stringing together text and adding labels to your plots.

# Data Wrangling
redwood_labeled_text_tbl <- redwood_tbl %>% 
    
    # label text
    mutate(label_text = str_glue(
        'Scientific Name:
        {species}
        Count: {n} of {sum(n)}
        Pct (%): {pct_text}'),
    
    # add 'forward-slash' to wrap-text on our plot
    common_name = str_replace(common_name, pattern = ' ', 
                              replacement = ' \n ')) %>% 
    
    # reorder factors based on percent rank
    mutate(common_name = common_name %>% fct_reorder(pct))

Manipulated Text (ready to plot)

Here is the manipulated text that will be useful once we plot these data; this setup is critical for the labels on our final plot e.g., the \n will wrap the text at those locations.

redwood_labeled_text_tbl %>% 
    select(label_text) %>% 
    rmarkdown::paged_table()

Data Visualization

Now that we’ve done our Data Wrangling, lets get into a bit of Data Visualization.

# Save Plot
g <- redwood_labeled_text_tbl %>% 
    
    # Canvas
    ggplot(aes(pct, common_name), color = '#2c3e50') +
    
    # Geometries
    geom_segment(aes(xend = 0, yend = common_name), size = 2) +
    geom_point(aes(size = 5)) +
    geom_label(aes(label = label_text),hjust = "inward",size = 3) +

    # Formatting
    scale_x_continuous(labels = scales::percent_format()) +
    theme_tq() + 
    labs(
      title = str_glue("Redwoods Trees of San Francisco"),
      subtitle = str_glue(
        "As expected, the coastal Redwoods make up the largest proportion.
        Dawn Redwoods were once thought to be extinct (low #s not suprising).
        Siera Redwoods grow at high elev. and so low numbers are expected."),
      caption = str_glue("The Coastal Redwood is the dominant species in SF."),
      x = "", y = "") +
    theme(legend.position = "none",
          plot.title = element_text(face = "bold"))

Display Awesome Plot

And here is our plot with the engineered labels from the last few steps. And that’s just one example of why I love str_glue() - Simply Awesome!

Wrap Up

That’s it for today!

We used str_glue() to manipulate our text and add awesome labels to our ggplot() - the plot is now Business-Ready 👍

Stay tuned for more posts on Wrangling + Exploring-Data with R.

Get the code here: Github Repo.

Subscribe + Share

Enter your Email Here to get the latest from Exploring-Data in your inbox.

PS: Be Kind and Tidy your Data 😎

Learn R Fast

Interested in expediting your learning path? Head on over to Business Science and join me on the journey.

Business Science: FREE Jumpstart Data-Science Course (opened for a limited time)

How to Explore Data: {DataExplorer} Package

September 16, 2020
r packages visualization eda

Quick Tips for Data Cleaning in R

August 25, 2020
data cleaning data wrangling eda

Exploratory Data Analysis Guide

July 15, 2020
r4ds visualization eda
comments powered by Disqus