Exploratory Data Analysis Guide

July 15, 2020
r4ds visualization eda

Quick Overview

Exploring-Data is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).

Sign up Here to join the many other subscribers who also nerd out on new tips and tricks 🤓

And if you enjoy the post be sure to share it

R for Data Science

Hadley Wickham and Garrett Grolemund wrote an incredible book called R for Data Science (R4DS). In the book they teach how to “turn raw data into understanding, insight, and knowledge.” The authors do this by being laser focused on the tools that help the data-practitioner import, tidy, transform, visualize, and model data (+communicate findings):

R4DS Workflow

I dug into the chapter on Exploratory Data Analysis (EDA) this past week.

The chapter is 🔥 and I highly recommend it 😎

Exploring Data

I was super excited about the chapter and the knowledge it packed!

In my own path to improving my EDA skills I thought I’d capture what stuck out most in the form of a blog post.

EDA Overview

The authors of R4DS explain EDA as the process of “using visualization and transformation to Explore your Data in a systematic way.” They recommend an iterative cycle as follows:

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modeling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

Favorite Quotes (from Intro)

The chapter was packed full of content - here are the quotes that stood out the most from the Introduction:

  • “EDA is not a formal process with a strict set of rules.”

  • “EDA is a state of mind.” - I love this one 😎

  • “During the initial phases of EDA you should feel free to investigate every idea that occurs to you.”

  • “As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.”

Questions Drive EDA

One of my biggest take always from the chapter was the emphasis on EDA being a creative process that is driven by asking questions - and lots of them 🤔

Section on Questions

The section titled Questions was packed full of nuggets - here are a few of the quotes that stuck out:

  • “EDA is fundamentally a creative process.”

  • “Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation.”

  • “the key to asking quality questions is to generate a large quantity of questions.” - So Good 😎

  • “each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.”

  • “There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:”

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

Section on Variation

We are always interested in discovering what explains the variation seen in data and so the section on Variation was very interesting. Here are a few of the quotes that stood out:

  • “Variation is the tendency of the values of a variable to change from measurement to measurement.”

  • “Every variable has its own pattern of variation, which can reveal interesting information.”

  • “The best way to understand that pattern is to visualize the distribution of the variable’s values.”

Visualizing Distributions

This post wouldn’t be complete without a bit of code and a few visualizations 🤓

# load libraries
library(tidyverse)
library(tidyquant)
library(patchwork)
theme_set(theme_tq())

Categorical Variables

  • “To examine the distribution of a categorical variable, use a bar chart:”
p1 <- ggplot(diamonds) +
    geom_bar(aes(x = cut)) +
    labs(title = "Bar Charts for Categorical Variables")
p1

Continuous Variables

  • “To examine the distribution of a continuous variable, use a histogram:”

  • “You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable.”

p2 <- ggplot(diamonds %>% filter(carat < 3)) +
    geom_histogram(aes(x = carat), binwidth = 1.0) +
    labs(title = "Binwidth: 1.0") 

p3 <- ggplot(diamonds %>% filter(carat < 3)) +
    geom_histogram(aes(x = carat), binwidth = 0.5) +
    labs(title = "Binwidth = 0.5") 

p4 <- ggplot(diamonds %>% filter(carat < 3)) +
    geom_histogram(aes(x = carat), binwidth = 0.1) +
    labs(title = "Binwidth = 0.1") 

p5 <- ggplot(diamonds %>% filter(carat < 3)) +
    geom_histogram(aes(x = carat), binwidth = 0.01) +
    labs(title = "Binwidth = 0.01") 
  • “You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.”
# Show plots
patchwork <- p2 + p3 + p4 + p5

patchwork + plot_annotation(
  title   = "Histograms for Continuous Variables",
  caption = "Notice the different patterns that are revealed."
)

Typical Values

Remember how the author’s emphasize the use of questions to drive EDA?

This section on Typical Values continues that thread and suggests the following questions to ask when Visualizing Distributions:

  • “Which values are the most common? Why?”

  • “Which values are rare? Why? Does that match your expectations?”

  • “Can you see any unusual patterns? What might explain them?”

Another learning from this section is to focus on the clusters revealed when Visualizing your Distributions. For example, take a look at the Histogram using a Binwidth of 0.01; obvious clusters are revealed that are not obvious in the other 3 plots.

p5

The author’s point out that “clusters of similar values suggest that subgroups exist in your data.”

What do you think they suggest doing in this situation? Ask more questions 🤔

Understand the subgroups by asking:

  • “How are the observations within each cluster similar to each other?”

  • “How are the observations in separate clusters different from each other?”

  • “How can you explain or describe the clusters?”

  • “Why might the appearance of clusters be misleading?”

Wrap Up

The chapter goes on to discuss unusual values, missing values, covariation, and briefly goes into using models to further your EDA. I highly recommend giving it a read and trying out the exercises in the chapter.

Get the code here: Github Repo.

Questions. Questions. Questions.

The biggest learning I took from this chapter, and what I’ve started using in my own EDA process, is getting better at asking questions. Not only that, but really to let my curiosity drive the EDA early on in the process.

And my favorite quote is probably this one:

“the key to asking quality questions is to generate a large quantity of questions.”

Watch an Expert do EDA

This is a bonus for making it this far in the post 😉

David Robinson is an expert when it comes to all things EDA and R. For the last year or so he has been recording weekly screencasts where he Explores Data he has never seen before.

I watch these videos often to get insights into how to think analytically when doing EDA.

You can check them out here at his YouTube channel: Tidy Tuesday R Screencasts

Subscribe + Share

Enter your Email Here to get the latest from Exploring-Data in your inbox.

PS: Be Kind and Tidy your Data 😎

Learn R Quickly

I’ve expedited my R and Data-Science journey using the courses over at Business Science University

The instructor, Matt Dancho, has given me a 15% discount to share with my audience. Get the discount and join me on the journey.

Link to my favorite R course (with 15% off discount): Data Science for Business 101

How to Explore Data: {DataExplorer} Package

September 16, 2020
r packages visualization eda

How to Clean Data: {janitor} Package

September 2, 2020
r packages data cleaning eda

Quick Tips for Data Cleaning in R

August 25, 2020
data cleaning data wrangling eda
comments powered by Disqus