Exploring-Data is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).
Sign up Here to join the many other subscribers who also nerd out on new tips and tricks 🤓
And if you enjoy the post be sure to share itTweet
R for Data Science
Hadley Wickham and Garrett Grolemund wrote an incredible book called R for Data Science (R4DS). In the book they teach how to “turn raw data into understanding, insight, and knowledge.” The authors do this by being laser focused on the tools that help the data-practitioner import, tidy, transform, visualize, and model data (+communicate findings):
I dug into the chapter on Exploratory Data Analysis (EDA) this past week.
The chapter is 🔥 and I highly recommend it 😎
I was super excited about the chapter and the knowledge it packed!
In my own path to improving my EDA skills I thought I’d capture what stuck out most in the form of a blog post.
The authors of R4DS explain EDA as the process of “using visualization and transformation to
Data in a systematic way.” They recommend an iterative cycle as follows:
questionsabout your data.
- Search for
answersby visualizing, transforming, and modeling your data.
- Use what you learn to refine your
questionsand/or generate new
Favorite Quotes (from Intro)
The chapter was packed full of content - here are the quotes that stood out the most from the Introduction:
“EDA is not a formal process with a strict set of rules.”
“EDA is a state of mind.” - I love this one 😎
“During the initial phases of EDA you should feel free to investigate every idea that occurs to you.”
“As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.”
Questions Drive EDA
One of my biggest take always from the chapter was the emphasis on EDA being a creative process that is
driven by asking
questions - and lots of them 🤔
Section on Questions
The section titled
Questions was packed full of nuggets - here are a few of the quotes that stuck out:
“EDA is fundamentally a creative process.”
“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use
questionsas tools to guide your investigation.”
“the key to asking quality
questionsis to generate a large quantity of
questions.” - So Good 😎
questionthat you ask will expose you to a new aspect of your data and increase your chance of making a discovery.”
“There is no rule about which
questionsyou should ask to guide your research. However, two types of
questionswill always be useful for making discoveries within your data. You can loosely word these
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
Section on Variation
We are always interested in discovering what explains the variation seen in data and so the section on
Variation was very interesting. Here are a few of the quotes that stood out:
“Variation is the tendency of the values of a variable to change from measurement to measurement.”
“Every variable has its own pattern of variation, which can reveal interesting information.”
“The best way to understand that pattern is to visualize the distribution of the variable’s values.”
This post wouldn’t be complete without a bit of code and a few visualizations 🤓
# load libraries library(tidyverse) library(tidyquant) library(patchwork) theme_set(theme_tq())
- “To examine the distribution of a categorical variable, use a bar chart:”
p1 <- ggplot(diamonds) + geom_bar(aes(x = cut)) + labs(title = "Bar Charts for Categorical Variables") p1
“To examine the distribution of a continuous variable, use a histogram:”
“You can set the width of the intervals in a histogram with the
binwidthargument, which is
measured in the units of the x variable.”
p2 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 1.0) + labs(title = "Binwidth: 1.0") p3 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 0.5) + labs(title = "Binwidth = 0.5") p4 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 0.1) + labs(title = "Binwidth = 0.1") p5 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 0.01) + labs(title = "Binwidth = 0.01")
- “You should always
explore a variety of binwidthswhen working with histograms, as different binwidths can reveal different patterns.”
# Show plots patchwork <- p2 + p3 + p4 + p5 patchwork + plot_annotation( title = "Histograms for Continuous Variables", caption = "Notice the different patterns that are revealed." )
Remember how the author’s emphasize the use of
questions to drive EDA?
This section on
Values continues that thread and suggests the following
questions to ask when
“Which values are the most common? Why?”
“Which values are rare? Why? Does that match your expectations?”
“Can you see any unusual patterns? What might explain them?”
Another learning from this section is to focus on the clusters revealed when
Distributions. For example, take a look at the
Histogram using a
0.01; obvious clusters are revealed that are not obvious in the other 3 plots.
The author’s point out that “clusters of similar values suggest that subgroups exist in your data.”
What do you think they suggest doing in this situation? Ask more
Understand the subgroups by asking:
“How are the observations within each cluster similar to each other?”
“How are the observations in separate clusters different from each other?”
“How can you explain or describe the clusters?”
“Why might the appearance of clusters be misleading?”
The chapter goes on to discuss unusual values, missing values, covariation, and briefly goes into using models to further your EDA. I highly recommend giving it a read and trying out the exercises in the chapter.
Get the code here: Github Repo.
The biggest learning I took from this chapter, and what I’ve started using in my own EDA process, is getting better at asking
questions. Not only that, but really to let my curiosity drive the EDA early on in the process.
And my favorite quote is probably this one:
“the key to asking quality questions is to generate a large quantity of questions.”
Watch an Expert do EDA
This is a
bonus for making it this far in the post 😉
David Robinson is an expert when it comes to all things
R. For the last year or so he has been recording weekly screencasts where he
Data he has never seen before.
I watch these videos often to get insights into how to think analytically when doing EDA.
You can check them out here at his YouTube channel: Tidy Tuesday R Screencasts
Learn R Quickly
I’ve expedited my
Data-Science journey using the courses over at Business Science University
The instructor, Matt Dancho, has given me a 15% discount to share with my audience. Get the discount and join me on the journey.
Link to my favorite
R course (with 15% off discount): Data Science for Business 101