“Exploring-Data” is a blog in which I share easily digestible content aimed at making the wrangling and exploration of data more efficient and more fun.
Sign up HERE to join other subscribers who also nerd-out on tips for exploring data using R
.
Click on these links to see where I am improving my skills in handling data using R
:
Join me on the journey 🏃♂ 🏃♀
New Series: Exploring R {packages}
When I discover new and helpful functions
I light up 💡 with excitement.
Interestingly, I have a pattern of finding one or two useful functions
in a {package}
, but rarely explore further to discover other useful functionality
.
That’s what this New Series is all about - Exploration
.
In each post, I will share a bit about how I was using a {package}
and then use a case-study to highlight other functionality
I discovered to be useful.
Leave a Comment
Leave a comment at the end to let me know if you like this style of post. The feedback will be considered in determining the direction of this series.
Examples of feedback:
- Was the post to long?
- Do you like the case-study approach?
- Could I have just shared the functions without the case-study?
- Was sharing about the pre-processing pro-tip helpful or distracting?
janitor {package}
This a wonderful {package}
built by Sam Firke.
Take the time to explore the Github Page for the {janitor:package}
, where Sam describes it as follow:
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
There are many more functions (20+) in the package that we will not cover - head over to the following page to learn more about them: Overview of janitor functions.
Case-Study
The case-study will provide and illustrate the following:
- A pro-tip for setting up a pre-processing data pipepline.
- The
function
I use often:clean_names()
. - Newly discovered
functions
from{janitor}
.
Let’s dive in…
Imagine being tasked with doing an analysis on Starbucks coffee locations. Your manager has provided you with raw-data
from coffee chains and requested that you:
- QA the data for duplicates (by store and by location).
- Tabulate the various types of Starbucks Ownership:
- Worldwide &
- US (lower 48)
- Deliver a US map that identifies patterns in ownership types.
To streamline your efforts and get swiftly to making that map, you decide to leverage the {janitor:package}
.
Load our Libraries
library(tidyverse) # Work-Horse Package
library(janitor) # Data cleaning (+tabulating data)
library(janitor) # Business Ready Plots
library(ggthemes) # Clean ggplot theme for Maps
library(USAboundaries) # Get state name/code mapping
Let’s Get Some Data
For our case-study we are using data from the Tidy Tuesday Project archive.
# Import Data ----
tuesdata <- tidytuesdayR::tt_load("2018-05-07")
Pro-Tip: Pre-Processing Pipeline
When working with new data, I’ll typically setup up a pre-processing step at the beginning of the script. It typically starts out with no steps and then they get added as I move through my analysis.
The idea is that as you conduct your Exploratory Data Analysis (EDA)
, you will discover pre-processing steps that need to be added to your pipeline.
In this post, I’ll illustrate this technique by adding to our pipeline as we go; however, this data pipeline would live near the top of the script and would not move.
Step 1
Save raw
data to a variable.
coffee_chains_raw <- tuesdata$week6_coffee_chains
Step 2
Immediately save the raw
data to new variable labeled with the suffix, processed
.
# Beginning of Pre-Processing Pipeline
coffee_chains_processed <- coffee_chains_raw
This obviously has ZERO pre-processing done to the data at this point. The point though is that as you discover areas of your data that require attention, you then can circle back to this pipeline and add those steps.
This may seem odd, but the beauty comes in not having to get further along in your analysis before realizing that you need to do data cleaning steps; if you approach it that way, then you have to go back and rename your variables created along the way - this method allows you to keep working with your processed
data as you move swiftly through your analysis.
I picked up this pro-tip
while watching David Robinson
in his Tidy Tuesday Screencasts
- check those out here: Tidy Tuesday R Screencasts
# Hat-Tip to D-Rob
Step 3
Begin Exploring
your Data
and conducting your analysis.
At this point, I’ll do a bit of EDA
to familiarize myself with the data I’m working with; this process is always to get a high-level understanding of the data so that I can pick up on nuances along with data integrity issues that need attention (dealt with in the pre-processing pipeline).
Initial Exploration
Let’s look at these raw
data using the tibble::glimpse()
function.
The glimpse()
function allows us to quickly assess column names, data-types, and also view a sample of the values contained in each column - you can read more about the glimpse()
function in my archived post, Examining Data with glimpse().
coffee_chains_processed %>%
tibble::glimpse()
## Rows: 25,600
## Columns: 13
## $ Brand <chr> "Starbucks", "Starbucks", "Starbucks", "Starbucks", …
## $ `Store Number` <chr> "47370-257954", "22331-212325", "47089-256771", "221…
## $ `Store Name` <chr> "Meritxell, 96", "Ajman Drive Thru", "Dana Mall", "T…
## $ `Ownership Type` <chr> "Licensed", "Licensed", "Licensed", "Licensed", "Lic…
## $ `Street Address` <chr> "Av. Meritxell, 96", "1 Street 69, Al Jarf", "Sheikh…
## $ City <chr> "Andorra la Vella", "Ajman", "Ajman", "Abu Dhabi", "…
## $ `State/Province` <chr> "7", "AJ", "AJ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ",…
## $ Country <chr> "AD", "AE", "AE", "AE", "AE", "AE", "AE", "AE", "AE"…
## $ Postcode <chr> "AD500", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "31…
## $ `Phone Number` <chr> "376818720", NA, NA, NA, NA, NA, NA, NA, "26670052",…
## $ Timezone <chr> "GMT+1:00 Europe/Andorra", "GMT+04:00 Asia/Dubai", "…
## $ Longitude <dbl> 1.53, 55.47, 55.47, 54.38, 54.54, 54.49, 54.49, 54.6…
## $ Latitude <dbl> 42.51, 25.42, 25.39, 24.48, 24.51, 24.40, 24.40, 24.…
Immediately, we can see that our column names are not optimal for analysis. Personally, I’m VERY biased towards snake_case
and therefore always like to get column names into that format.
janitor::clean_names()
In comes {janitor::clean_names}
to the rescue ⛑️
By default, clean_names()
outputs column naming with the snake_case
format - maybe this is one of the reasons that it’s in my top 10 for favorite functions in R
.
Let’s test it out on our coffee data.
# clean_names() with default naming
coffee_chains_processed %>%
janitor::clean_names() %>%
base::names()
## [1] "brand" "store_number" "store_name" "ownership_type"
## [5] "street_address" "city" "state_province" "country"
## [9] "postcode" "phone_number" "timezone" "longitude"
## [13] "latitude"
Awesome!
You’ll notice the function
took care of the /
in State/Province
and replaced it with an underscore - simply amazing 😎
Naming Convention Options
If you prefer a different naming convention - I’m not sure why you would 🙄 - then you can use the case
argument.
# clean_names() with diff. naming convention
coffee_chains_processed %>%
clean_names(case = "small_camel") %>%
names()
## [1] "brand" "storeNumber" "storeName" "ownershipType"
## [5] "streetAddress" "city" "stateProvince" "country"
## [9] "postcode" "phoneNumber" "timezone" "longitude"
## [13] "latitude"
Pre-Processing Addition
Now let’s add this step to our pre-processing pipeline.
# Adding to our Pre-Processing Pipeline
coffee_chains_processed <- coffee_chains_raw %>%
# clean up column names
janitor::clean_names()
janitor::get_dupes()
get_dupes()
is at the top of the list for newly discovered functionality
within the {janitor}
package.
This is one of those things you need to do often (check for duplicates) and {janitor}
makes it simple.
Going back to our case-study
, our manager asked us to check for duplicated records (a common data-cleaning
and EDA
step).
Let’s subset our data and investigate.
coffee_chains_processed %>%
# subset data by store and by location
dplyr::select(brand, store_number,
city, state_province, country) %>%
# identify duplicated records
janitor::get_dupes()
## # A tibble: 2 x 6
## brand store_number city state_province country dupe_count
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 Starbucks 19773-160973 Seoul 11 KR 2
## 2 Starbucks 19773-160973 Seoul 11 KR 2
Using janitor::get_dupes()
we’ve quickly identified a potential issue: store number 19773-160973
has duplicated records.
Let’s investigate further.
# filter to store with dupes
coffee_chains_processed %>%
# filter to store and glimpse data
dplyr::filter(store_number == "19773-160973") %>%
glimpse()
## Rows: 2
## Columns: 13
## $ brand <chr> "Starbucks", "Starbucks"
## $ store_number <chr> "19773-160973", "19773-160973"
## $ store_name <chr> "Yoido IFC Mall - 1F", "Yoido IFC Mall - 1F"
## $ ownership_type <chr> "Joint Venture", "Joint Venture"
## $ street_address <chr> "23 & 23-1, Yoido-Dong, Yongdongpo-Gu, 1F, #101", "23 …
## $ city <chr> "Seoul", "Seoul"
## $ state_province <chr> "11", "11"
## $ country <chr> "KR", "KR"
## $ postcode <chr> "153-023", "153-023"
## $ phone_number <chr> NA, NA
## $ timezone <chr> "GMT+09:00 Asia/Seoul", "GMT+09:00 Asia/Seoul"
## $ longitude <dbl> NA, 126.92
## $ latitude <dbl> NA, 37.53
Look carefully and you’ll notice that the latitude/longitude are missing for one of these records.
We need lat/long for mapping and so we will want to prioritize the records with those data. Also, we don’t want duplicated records to interfere with our tabulations later on in this analysis.
Let’s quickly look and see how much data is missing from the lat/long columns.
# plot missing data (using raw data)
DataExplorer::plot_missing(
title = "% of Missing Data (filtered to cols w/missing data)",
data = coffee_chains_raw,
ggtheme = tidyquant::theme_tq(),
missing_only = TRUE)
The plot shows that 0%
of data are missing for lat/long leading me to believe that the store identified earlier is the only record with missing data (insignificant amount when plotted).
We will filter that record out in our data-cleaning
step.
Pre-Processing Addition
Now let’s add this step to our pre-processing pipeline
# Adding to our Pre-Processing Pipeline
coffee_chains_processed <- coffee_chains_raw %>%
# clean up column names
janitor::clean_names() %>%
# filter out records missing lat/long values
dplyr::filter(!is.na(latitude), !is.na(longitude))
Let’s use get_dupes()
to confirm the problem is solved
coffee_chains_processed %>%
# subset data
dplyr::select(brand, store_number, city, state_province, country) %>%
# identify duplicated records
janitor::get_dupes()
## # A tibble: 0 x 6
## # … with 6 variables: brand <chr>, store_number <chr>, city <chr>,
## # state_province <chr>, country <chr>, dupe_count <int>
Starbucks Analysis
Now that we’ve done our due diligence in being sure we’ve dealt with data issues, let’s knock out this analysis by tabulating these data and compiling a map, or two 🤓
Before doing so, let’s add one final step to our pre-processing data pipeline.
Pre-Processing Addition
The final step is to subset the columns needed to complete the analysis.
# Adding to our Pre-Processing Pipeline
coffee_chains_processed <- coffee_chains_raw %>%
# clean up column names
janitor::clean_names() %>%
# filter out records missing lat/long values
dplyr::filter(!is.na(latitude), !is.na(longitude)) %>%
# subset columns for analysis
dplyr::select(brand, ownership_type, country,
state_province, latitude, longitude)
View Data
# view first 5 rows
coffee_chains_processed %>% head(5)
## # A tibble: 5 x 6
## brand ownership_type country state_province latitude longitude
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Starbucks Licensed AD 7 42.5 1.53
## 2 Starbucks Licensed AE AJ 25.4 55.5
## 3 Starbucks Licensed AE AJ 25.4 55.5
## 4 Starbucks Licensed AE AZ 24.5 54.4
## 5 Starbucks Licensed AE AZ 24.5 54.5
Tabulate Data (worldwide)
Let’s start with looking at Ownership Types worldwide.
janitor::tabyl
stuck out to me because the ease with which to generate frequency tables.
Check it out.
# generate frequency table
coffee_chains_processed %>%
# filter data
dplyr::filter(brand == "Starbucks") %>%
# tabulate and arrange data
janitor::tabyl(ownership_type) %>%
arrange(desc(percent)) %>%
# formatting
janitor::adorn_totals() %>%
janitor::adorn_pct_formatting() %>%
rmarkdown::paged_table()
Using just the tabyl
function we were able to generate frequencies along with the percent of total.
However, {janitor}
is packed full of other goodies - the creator(s) have crafted a number of adorn
options for formatting our outputs. I used the adorn_totals
and adorn_pct_formatting
to tidy up and make our table ready for presentation.
Simply Amazing 😎
Tabulate Data (US, lower 48)
# generate frequency table
coffee_chains_processed %>%
# filter data
dplyr::filter(brand == "Starbucks",
country == "US",
state_province != "AK",
state_province != "HI") %>%
# tabulate and arrange data
janitor::tabyl(ownership_type) %>%
arrange(desc(percent)) %>%
# formatting
janitor::adorn_totals() %>%
janitor::adorn_pct_formatting() %>%
rmarkdown::paged_table()
All Starbucks are either company owned, which is almost all of them, or else they’re “licensed” locations, which are the Starbucks in airports, supermarkets, etc. - Charles Partrick
Map Starbucks Locations
Now lets make those maps and get this analysis wrapped up.
Lets start by getting a general sense of where in the US these Starbucks are located.
Data Manipulation
# Data Manipulation
starbucks_lower_48 <- coffee_chains_processed %>%
# filter data
dplyr::filter(brand == "Starbucks",
country == "US",
state_province != "AK",
state_province != "HI")
Data Visualization
# Data Visualization
starbucks_lower_48 %>%
# setup ggplot canvas + US borders
ggplot(aes(longitude, latitude, color = ownership_type)) +
# add geometries
borders("state") +
geom_point(size = .75, alpha = 0.5) +
# formatting
ggthemes::theme_map() + # remove x/y for tidy map
coord_map() + # scales map (simple approach)
scale_color_manual(values = c("#2c3e50", "#18BC9C")) +
labs(title = "Starbucks Locations by Ownership Type (Lower 48)",
color = "Ownership Type")
That’s a solid map but I think we can do better to identify patterns in ownership types.
Let’s calculate the ratio of Corporate (Company Owned) vs. Licensed ownership and map that at the state level.
Data Acquisition (state boundaries)
# Get state level lat/long table
states <- ggplot2::map_data("state") %>%
tibble() %>%
mutate(region = str_to_title(region))
Data Manipulation
# Data Manipulation
ownership_ratios_by_state <- starbucks_lower_48 %>%
# count ownership types by state
group_by(state_province, ownership_type) %>%
summarize(n = n()) %>%
ungroup() %>%
# pivot data and calculate ratios
pivot_wider(names_from = ownership_type,
values_from = n) %>%
clean_names() %>%
mutate(corp_vs_lic = company_owned/licensed) %>%
# join to get state names from codes
left_join(USAboundaries::state_codes %>%
select(state_name, state_abbr),
by = c("state_province" = "state_abbr")) %>%
# reorder columns
select(state_name, everything())
View Data
ownership_ratios_by_state %>% head()
## # A tibble: 6 x 5
## state_name state_province company_owned licensed corp_vs_lic
## <chr> <chr> <int> <int> <dbl>
## 1 Alabama AL 48 36 1.33
## 2 Arkansas AR 35 19 1.84
## 3 Arizona AZ 196 283 0.693
## 4 California CA 1943 839 2.32
## 5 Colorado CO 227 250 0.908
## 6 Connecticut CT 83 35 2.37
Data Visualization
ownership_ratios_by_state %>%
# join to get state boundaries (lat/long)
left_join(states, by = c("state_name" = "region")) %>%
# setup ggplot canvas + US borders
ggplot(aes(long, lat, fill = corp_vs_lic, group = group)) +
# add geometries
geom_polygon() +
ggplot2::borders("state") +
# formatting
ggthemes::theme_map() + # remove x/y for tidy map
theme(legend.position = c(.9, .05)) +
coord_map(projection = "mercator") + # scales map projection
scale_fill_gradient2(low = "white", high = "#18BC9C", ) +
labs(title = "Ratio of Corporate vs. Licensed Starbucks in the US (Lower 48)",
subtitle = "Darker green equates to more corporate locations compared to licensed establishments.",
fill = "Ratio of\nCorporate/\nLicensed")
This represent the data in a way that helps us identify patterns - our manager will be pleased 👍
Wrap Up
I hope you enjoyed the first post in this new series.
Leave a comment and let me know.
Get the code here: Github Repo.
Learn R Fast
I’ve been learning Data Science at Business Science University.
Join me on the journey.
Check out this link to get 15% off of the courses that are helping 1000s of analytics professionals take their careers to the next level: Business Science Courses
Good luck.