The goal of this lab is to acquire skills in running descriptive statistics and creating graphs using R. Make sure you’ve read and fully understood Handout 4 as this guide tracks closely with the material presented there. In this lab, we will be working with census tract data from PolicyMap. As described in Handout 3, census tracts are the traditional measure of neighborhoods in the United States. The objectives of the guide are as follows
This lab guide follows closely and supplements the material presented in Chapters 1,3, 7 and 28 in the textbook R for Data Science (RDS) and the class Handout 4.
Assignment 4 is due by 10:00 am, February 7th on
Canvas. See here for
assignment guidelines. You must submit an .Rmd
file and its
associated .html
file. Name the files:
yourLastName_firstInitial_asgn04. For example: brazil_n_asgn04.
Download the Lab
template into an appropriate folder on your hard drive (preferably,
a folder named ‘Lab 4’), open it in R Studio, and type and run your code
there. The template is also located on Canvas under Files. Change the
title (“Lab 4”) and insert your name and date. Don’t change anything
else inside the YAML (the stuff at the top in between the
---
). Also keep the grey chunk after the YAML. For a
rundown on the use of R Markdown, see the assignment
guidelines
We will be installing two new packages in this lab. Install the following packages. These packages will be needed to create presentation ready tables. Remember, you only do this once, and never within your R Markdown.
install.packages("flextable")
install.packages("webshot")
Load the following packages using library()
. Remember,
you need to do this every time you run an R session, so the following
code should appear at the top of your R Markdown file.
library(tidyverse)
library(flextable)
library(webshot)
You will be working with census tract data for the cities of
Sacramento, San Francisco, San Jose, and Oakland, the four largest
cities in Northern California. Let’s get some more practice working with
data from PolicyMap. To save us time, I downloaded data from PolicyMap,
cleaned the file, and uploaded it on GitHub. The data file and its
metadata are also located on Canvas in the Week 4 Lab folder under
Files. Let’s bring the csv file into R using
read_csv()
.
ncal.tracts <- read_csv("https://raw.githubusercontent.com/crd150/data/master/pmap_lab3.csv")
Make sure you take a look at any dataset you bring into R.
glimpse(ncal.tracts)
## Rows: 599
## Columns: 8
## $ fips <dbl> 6067005205, 6067007204, 6067009601, 6067007106, 6067005301, …
## $ oppzone <chr> "Designated Qualified Opportunity Zone", "Designated Qualifi…
## $ city <chr> "Sacramento", "Sacramento", "Sacramento", "Sacramento", "Sac…
## $ levratio <dbl> 2.280000, 3.220000, 3.450000, 2.750000, 3.367861, 2.990000, …
## $ medincome <dbl> 30585, 56806, 52750, 98056, 12355, 61061, 41614, 30309, 5137…
## $ phisp <dbl> 0.23168831, 0.25734355, 0.29462923, 0.25339747, 0.34152140, …
## $ pblk <dbl> 0.09298701, 0.10451256, 0.21833995, 0.10547204, 0.30427892, …
## $ mhisp <chr> "Not Majority", "Not Majority", "Not Majority", "Not Majorit…
The dataset contains tract-level median household income, percent Hispanic, percent Black, whether the tract is designated as an Opportunity Zone (a high-poverty neighborhood eligible for federal economic development funding), the home mortgage loan-to-income ratio (also known as the leverage ratio), and whether the tract is “Majority” Hispanic (phisp > 50%) or “Not Majority” Hispanic. Make sure you explore PolicyMap on your own, as it offers a wealth of data that may be useful for your final projects. Check the PolicyMap tutorial for steps to downloading and cleaning PolicyMap data.
Recall from Handout 4 our two important data types: categorical and numeric. Let’s first summarize a single numeric variable - neighborhood median household income - using some basic descriptive statistics.
We can use the function summarize()
to calculate mean
neighborhood income. The first argument inside summarize()
is the data object ncal.tracts and the second argument is the
function calculating the specific summary statistic, in this case
mean()
, which unsurprisingly calculates the mean of the
variable you indicate in between the parentheses.
ncal.tracts %>%
summarize(mean(medincome))
## # A tibble: 1 × 1
## `mean(medincome)`
## <dbl>
## 1 NA
We get the value NA, which as we learned in Lab 3
represents a missing value. If a variable has missing values, functions
like mean()
will return an NA
. If we use the
function summary()
, we find that medincome has 8
tracts with missing median income values
summary(ncal.tracts)
## fips oppzone city levratio
## Min. :6.001e+09 Length:599 Length:599 Min. :2.050
## 1st Qu.:6.067e+09 Class :character Class :character 1st Qu.:3.140
## Median :6.075e+09 Mode :character Mode :character Median :3.368
## Mean :6.063e+09 Mean :3.368
## 3rd Qu.:6.086e+09 3rd Qu.:3.570
## Max. :6.086e+09 Max. :5.460
##
## medincome phisp pblk mhisp
## Min. : 12171 Min. :0.0000 Min. :0.00000 Length:599
## 1st Qu.: 49690 1st Qu.:0.1039 1st Qu.:0.01798 Class :character
## Median : 77188 Median :0.1944 Median :0.04365 Mode :character
## Mean : 80407 Mean :0.2447 Mean :0.09279
## 3rd Qu.:104670 3rd Qu.:0.3411 3rd Qu.:0.11576
## Max. :211442 Max. :0.8172 Max. :0.63884
## NA's :8
In order to calculate the mean (or any numeric descriptive statistic)
for a variable with missing values, use the argument
na.rm = TRUE
, which will calculate the mean of the variable
without the missing values.
summarize(ncal.tracts, mean(medincome, na.rm = TRUE))
## # A tibble: 1 × 1
## `mean(medincome, na.rm = TRUE)`
## <dbl>
## 1 80407.
Does the average neighborhood income differ by city? We need to pair
summarize()
with the function group_by()
to
answer this question. The function group_by()
tells R to
run subsequent functions on the data object by a group
characteristic (such as gender, educational attainment, or in this case,
city). Let’s use our new best friend %>%
, who we met in
Lab 2, to
accomplish this task.
ncal.tracts %>%
group_by(city) %>%
summarize(mean(medincome, na.rm = TRUE))
## # A tibble: 4 × 2
## city `mean(medincome, na.rm = TRUE)`
## <chr> <dbl>
## 1 Oakland 66559.
## 2 Sacramento 53798.
## 3 San Francisco 91052.
## 4 San Jose 91805.
The first pipe sends ncal.tracts into the function group_by(), which tells R to group ncal.tracts by the variable city.
ncal.tracts %>%
group_by(city)
How do you know the tibble is grouped? Because it tells you
The second pipe takes this grouped dataset and sends it into the
summarize()
command, which calculates the mean neighborhood
income (by city, because the dataset is grouped by city).
We can calculate more than one summary statistic within
summarize()
. For example, to get the mean, median, standard
deviation and interquartile range (IQR) of median income, and give
column labels for the variables in the resulting summary table, we type
in
ncal.tracts %>%
group_by(city) %>%
summarize(Mean = mean(medincome, na.rm = TRUE),
Median = median(medincome, na.rm = TRUE),
SD = sd(medincome, na.rm = TRUE),
IQR = IQR(medincome, na.rm = TRUE))
## # A tibble: 4 × 5
## city Mean Median SD IQR
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Oakland 66559. 51438. 41417. 40686.
## 2 Sacramento 53798. 49552 24415. 28390.
## 3 San Francisco 91052. 89184. 37913. 50444
## 4 San Jose 91805. 88299 34650. 48412.
Remember from Handout 4 that the IQR is the difference between the
75th and 25th percentiles. It is a measure of spread, and more
generally, an indicator of inequality. Another measure of spread or
inequality is the 90/10 ratio. To calculate this ratio, we’ll first need
to calculate the 90th and 10th percentiles using the
quantile()
command, where we indicate the percentile using
the argument p =
. We can do all of this inside
summarize()
. Make sure you understand what each function in
the code below is doing.
ncal.tracts %>%
group_by(city) %>%
summarize(p90 = quantile(medincome, p = 0.90, na.rm = TRUE),
p10 = quantile(medincome, p = 0.10, na.rm = TRUE),
Ratio9010 = p90/p10) %>%
select(-(c(p90,p10)))
## # A tibble: 4 × 2
## city Ratio9010
## <chr> <dbl>
## 1 Oakland 4.39
## 2 Sacramento 3.15
## 3 San Francisco 3.49
## 4 San Jose 2.73
Let’s next summarize a single categorical variable. oppzone indicates whether a tract is designated as an Opportunity Zone neighborhood. The variable has two categories: designated and not designated as an Opportunity Zone.
To get the proportion of tracts that are Opportunity Zone
neighborhoods, you’ll need to combine the functions
group_by()
, summarize()
and
mutate()
using %>%
.
ncal.tracts %>%
group_by(oppzone) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
freq = n / total)
## # A tibble: 2 × 4
## oppzone n total freq
## <chr> <int> <int> <dbl>
## 1 Designated Qualified Opportunity Zone 75 599 0.125
## 2 Not Designated 524 599 0.875
Let’s break up this chunk of code to show exactly what was done here.
First, ncal.tracts was piped into the group_by()
function. Next, group_by(oppzone)
separates the
neighborhoods by Opportunity Zone designation. We then used
summarize()
to count the number of neighborhoods by
Opportunity Zone designation. The function to get a count is
n()
, and we saved this count in a variable named
n. This gave us the following table.
ncal.tracts %>%
group_by(oppzone) %>%
summarize (n = n())
## # A tibble: 2 × 2
## oppzone n
## <chr> <int>
## 1 Designated Qualified Opportunity Zone 75
## 2 Not Designated 524
Remember, we are trying to get the proportion of neighborhoods that
are and are not designated as Opportunity zones. This means we need a
numerator - the number of neighborhoods that are and are not designated
as Opportunity Zones. This is what n = n()
gives us. There
are 75 neighborhoods that are designated as an Opportunity Zone.
Next, this table is piped into mutate()
, which creates a
variable total which gives you the denominator - the total
number of neighborhoods. The code sum(n)
adds the values of
n: 524+75 = 599. We then create a variable freq, which
divides the value of each n by this sum: 75/599 = 0.125 and
524/599 = 0.875. This yields the proportion of all neighborhoods by
Opportunity Zone designation (how would you transform this to a
percentage?).
ncal.tracts %>%
group_by(oppzone) %>%
summarize (n = n()) %>%
mutate(total = sum(n),
freq = n / total)
## # A tibble: 2 × 4
## oppzone n total freq
## <chr> <int> <int> <dbl>
## 1 Designated Qualified Opportunity Zone 75 599 0.125
## 2 Not Designated 524 599 0.875
We can add city to the group_by()
function to
disaggregate the above result by city.
ncal.tracts %>%
group_by(city, oppzone) %>%
summarize (n = n()) %>%
mutate(total = sum(n),
freq = n / total)
## # A tibble: 8 × 5
## # Groups: city [4]
## city oppzone n total freq
## <chr> <chr> <int> <int> <dbl>
## 1 Oakland Designated Qualified Opportunity Zone 30 113 0.265
## 2 Oakland Not Designated 83 113 0.735
## 3 Sacramento Designated Qualified Opportunity Zone 23 100 0.23
## 4 Sacramento Not Designated 77 100 0.77
## 5 San Francisco Designated Qualified Opportunity Zone 11 197 0.0558
## 6 San Francisco Not Designated 186 197 0.944
## 7 San Jose Designated Qualified Opportunity Zone 11 189 0.0582
## 8 San Jose Not Designated 178 189 0.942
Which city has the highest proportion of Opportunity Zone neighborhoods? Lowest?
The functions we’ve gone through so far describe one variable. It is often the case that we are interested in understanding whether two variables are associated with one another.
Let’s go through the ways we can describe the association between: (1) two categorical variables; (2) one categorical variable and one numeric variable; and (3) two numeric variables.
To summarize the relationship between two categorical variables,
you’ll need to find the proportion of observations for each combination,
also known as a cross tabulation. Let’s create a cross tabulation of the
categorical variables oppzone and mhisp. We do this by
using both variables in the group_by()
command.
ncal.tracts %>%
group_by(oppzone, mhisp) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
freq = n / total)
## # A tibble: 4 × 5
## # Groups: oppzone [2]
## oppzone mhisp n total freq
## <chr> <chr> <int> <int> <dbl>
## 1 Designated Qualified Opportunity Zone Majority 16 75 0.213
## 2 Designated Qualified Opportunity Zone Not Majority 59 75 0.787
## 3 Not Designated Majority 47 524 0.0897
## 4 Not Designated Not Majority 477 524 0.910
A much higher proportion of Opportunity Zone neighborhoods are Majority Hispanic (0.213) compared to non Opportunity Zone neighborhoods (0.0897).
A typical way of summarizing the relationship between a categorical variable and a numeric variable is to take the mean of the numeric variable for each level of the categorical variable. We can get the mean median household income for neighborhoods designated and not designated as an Opportunity Zone using the following code.
ncal.tracts %>%
group_by(oppzone) %>%
summarize("Mean Income" = mean(medincome, na.rm = TRUE))
## # A tibble: 2 × 2
## oppzone `Mean Income`
## <chr> <dbl>
## 1 Designated Qualified Opportunity Zone 40694.
## 2 Not Designated 86003.
Let’s separate by city by adding city to the
group_by()
function.
ncal.tracts %>%
group_by(city, oppzone) %>%
summarize("Mean Income" = mean(medincome, na.rm = TRUE))
## # A tibble: 8 × 3
## # Groups: city [4]
## city oppzone `Mean Income`
## <chr> <chr> <dbl>
## 1 Oakland Designated Qualified Opportunity Zone 36980.
## 2 Oakland Not Designated 77380.
## 3 Sacramento Designated Qualified Opportunity Zone 33468.
## 4 Sacramento Not Designated 59606.
## 5 San Francisco Designated Qualified Opportunity Zone 53810
## 6 San Francisco Not Designated 93291.
## 7 San Jose Designated Qualified Opportunity Zone 53305.
## 8 San Jose Not Designated 93993.
You can summarize the relationship between two numeric variables with
the correlation coefficient. To calculate the correlation coefficient,
use the function cor()
. The first two arguments in
cor()
are the two numeric variables you want to calculate
the correlation for. Let’s calculate the correlation between
neighborhood income and percent race, and neighborhood loan-to-income
ratio and percent black. Note that the argument
use = "complete.obs"
removes the missing values in
medincome.
ncal.tracts %>%
summarize(blk_inc = cor(medincome,pblk, use = "complete.obs"))
## # A tibble: 1 × 1
## blk_inc
## <dbl>
## 1 -0.445
Group these correlations by city.
ncal.tracts %>%
group_by(city) %>%
summarize(blk_inc = cor(medincome,pblk, use = "complete.obs"))
## # A tibble: 4 × 2
## city blk_inc
## <chr> <dbl>
## 1 Oakland -0.487
## 2 Sacramento -0.490
## 3 San Francisco -0.340
## 4 San Jose -0.236
Make sure you understand what these values mean (see Handout 4).
The output from the descriptive statistics we’ve ran so far is not presentation ready. For example, taking a screenshot of the following results table produces unnecessary information that is confusing and messy.
ncal.tracts %>%
group_by(city) %>%
summarize(Mean = mean(medincome, na.rm = TRUE),
Median = median(medincome, na.rm = TRUE),
SD = sd(medincome, na.rm = TRUE),
IQR = IQR(medincome, na.rm = TRUE))
## # A tibble: 4 × 5
## city Mean Median SD IQR
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Oakland 66559. 51438. 41417. 40686.
## 2 Sacramento 53798. 49552 24415. 28390.
## 3 San Francisco 91052. 89184. 37913. 50444
## 4 San Jose 91805. 88299 34650. 48412.
Furthermore, you would like to show a table, say, in your final project that does not require you to take a screenshot, but instead can be produced via code, that way it can be fixed if there is an issue, and is reproducible.
One way of producing presentation tables in R is through the flextable package. First, before creating any table, run the following code to ensure that the tables you save will have a transparent or white background (the default is gray).
set_flextable_defaults(background.color = "white")
Next, you will need to save the tibble or data frame of results into an object. For example, let’s save the above results into an object named ncal.summary
ncal.summary <- ncal.tracts %>%
group_by(city) %>%
summarize(Mean = mean(medincome, na.rm = TRUE),
Median = median(medincome, na.rm = TRUE),
SD = sd(medincome, na.rm = TRUE),
IQR = IQR(medincome, na.rm = TRUE))
You then input the object into the function flextable()
.
Save it into an object called my_table
my_table <- flextable(ncal.summary)
If you type in my_table in the console, you should see a
relatively clean table pop up in the Viewer window. We can progressively
pipe the my_table object through flextable
formatting functions. For example, you can change the column header
names using the function set_header_labels()
and center the
header names using the function align()
my_table <- my_table %>%
set_header_labels(
city = "City",
Mean = "Mean",
Median = "Median",
SD = "Standard Deviation",
IQR = "IQR") %>%
colformat_double(digits = 1) %>%
align(align = "center")
my_table
City | Mean | Median | Standard Deviation | IQR |
---|---|---|---|---|
Oakland | 66,558.9 | 51,437.5 | 41,417.3 | 40,686.5 |
Sacramento | 53,797.9 | 49,552.0 | 24,414.8 | 28,390.5 |
San Francisco | 91,052.1 | 89,184.5 | 37,913.3 | 50,444.0 |
San Jose | 91,805.1 | 88,299.0 | 34,649.9 | 48,412.2 |
There are a slew of options for formatting your table, including adding footnotes, borders, shade and other features. Check out this useful tutorial for an explanation of some of these features.
Once you’re done formatting your table, you can then export it to
Word, PowerPoint or HTML or as an image (PNG) files. To do this, use one
of the following functions: save_as_docx()
,
save_as_pptx()
, save_as_image()
, and
save_as_html()
. For the final project, you will likely be
saving your tables as images. Before saving, make sure your working
diirectory is set to the appropriate folder. Then use the
save_as_image()
function
save_as_image(my_table, path = "ncal_income.png")
You first put in the table my_table, and set the file name
with the png extension. Check your working directory (type in
getwd()
). You should see the file ncal_income.png
in the folder.
Another way of summarizing variables and their relationships is
through graphs and charts. The main package for R graphing is
ggplot2 which is a part of the
tidyverse package. The graphing function is
ggplot()
and it takes on the basic template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(x, y)) +
<OPTIONS>()
ggplot()
is the base function where you specify your
dataset using the data = <DATA>
argument.
You then need to build on this base by using the plus operator
+
and <GEOM_FUNCTION>()
where
<GEOM_FUNCTION>()
is a unique geom
function indicating the type of graph you want to plot. Each unique
function has its unique set of mapping arguments which you specify using
the mapping = aes()
argument. Charts and graphs have an
x-axis, y-axis, or both. Check this
ggplot cheat sheet for all possible geoms.
<OPTIONS>()
are a set of functions you can
specify to change the look of the graph, for example relabelling the
axes or adding a title.
The basic idea is that a ggplot graphic layers geometric objects (circles, lines, etc), themes, and scales on top of data.
You first start out with the base layer. It represents the empty
ggplot layer defined by the ggplot()
function with the data object whose variable(s) you want to graph.
ggplot(ncal.tracts)
We get an empty plot. We haven’t told ggplot()
what type
of geometric object(s) we want to plot, nor how the variables should be
mapped to the geometric objects, so we just have a blank plot. We need
geoms
to paint the blank canvas.
From here, we add a geom
layer to the
ggplot object. Layers are added to
ggplot objects using +
, instead of
%>%
, since you are not explicitly piping an object into
each subsequent layer, but adding layers on top of one another. Each
geom
is associated with a specific type of graph. For
example, below is code that creates a histogram
ggplot(ncal.tracts) +
geom_histogram(mapping = aes(x=medincome))
ncal.tracts is <DATA>
,
geom_histogram()
is the
<GEOM_FUNCTION>()
, and x=medincome
is
the variable in ncal.tracts we are graphing. There is no
y argument specified because a histogram only plots one
variable. Let’s go through how to create the graphs outlined in Handout
4.
Recall from Handout 4 that we use bar charts to summarize categorical
variables. Bar charts, also known as bar plots, show either the number
or frequency of each category. To create a bar chart, use
geom_bar()
for <GEOM_FUNCTION>()
. Let’s
show a bar chart of oppzone. We can borrow from the code we
used earlier to create our oppzone frequency table and pipe
this table directly into ggplot()
.
ncal.tracts %>%
group_by(oppzone) %>%
summarize (n = n()) %>%
mutate(total = sum(n),
freq = n / total) %>%
ggplot() +
geom_bar(mapping=aes(x=oppzone, y=freq),stat="identity")
We didn’t need to specify data = <DATA>
in
ggplot()
because it was piped in. Within
aes()
, we specified the categorical variable
oppzone on the x-axis and then the proportion of neighborhoods
freq on the y-axis. The argument stat = "identity"
tells ggplot()
to plot the exact value listed for the
variable freq.
The X and Y axes labels are not so great. Interpretable labels are
important for getting your message across clearly. We can relabel the
axes using the xlab()
and ylab()
functions,
which are examples of <OPTIONS>()
functions.
ncal.tracts %>%
group_by(oppzone) %>%
summarize (n = n()) %>%
mutate(total = sum(n),
freq = n / total) %>%
ggplot() +
geom_bar(mapping=aes(x=oppzone, y=freq),stat="identity") +
xlab("Opportunity Zone") +
ylab("Proportion")
Histograms are used to summarize a single numeric variable. To create
a histogram, use geom_histogram()
for
<GEOM_FUNCTION()>
. Let’s create a histogram of median
household income.
ggplot(ncal.tracts) +
geom_histogram(mapping = aes(x=medincome)) +
xlab("Median Household Income")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite outside the scale range
## (`stat_bin()`).
As described earlier, because a single variable is plotted on the
x-axis, we specify x =
in aes()
but not a
y =
. The message before the plot tells us that we can use
the bins =
argument to change the number of bins used to
produce the histogram. You can increase the number of bins to make the
bins narrower and thus get a finer grain of detail. Or you can decrease
the number of bins to get a broader visual summary of the shape of the
variable’s distribution. Try changing the number of bins and see what
you get.
We can use a boxplot to visually summarize the distribution of a
single variable or the relationship between a categorical and numeric
variable. Use geom_boxplot()
for
<GEOM_FUNCTION()>
to create a boxplot. Let’s examine
median household income.
ggplot(ncal.tracts) +
geom_boxplot(mapping = aes(y = medincome))+
ylab("Median Household Income")
Remember from Handout 4 that the points outside the whiskers represent outliers. Outliers are defined as having values that are either larger than the 75th percentile plus 1.5 times the IQR or smaller than the 25th percentile minus 1.5 times the IQR. The IQR is $55,103, the 75th percentile is $104,670 and the 25th percentile is $49,568. While we don’t see outliers at the bottom, we do see outliers at the top - these are neighborhoods with median income values greater than $104,670 + 1.5*$55,103 = $187,324.5
Let’s examine the distribution of median income by Opportunity Zone. Because we are examining the association between two variables, we need to specify x and y variables.
ggplot(ncal.tracts) +
geom_boxplot(mapping = aes(x = oppzone, y = medincome)) +
xlab("Opportunity Zone") +
ylab("Median Household Income")
The boxplot is for all neighborhoods combined. We can use the
facet_wrap()
function to separate by city
ggplot(ncal.tracts) +
geom_boxplot(mapping = aes(x = oppzone, y = medincome)) +
xlab("Opportunity Zone") +
ylab("Median Household Income") +
facet_wrap(~city)
Note the tilde operator ~
before city.
The labels for oppzone are really long. We can change the
label names (as an exercise, try this on your own) or we can create
horizontal boxplots. To create horizontal boxplots, add the
coord_flip()
function at the end.
ggplot(ncal.tracts) +
geom_boxplot(mapping = aes(x = oppzone, y = medincome)) +
facet_wrap(~city) +
ylab("Median Household Income") +
xlab("Opportunity Zone") +
coord_flip()
The scatterplot is the traditional graph for visualizing the
association between two numeric variables. For scatterplots, we use
geom_point()
for <GEOM_FUNCTION>()
.
Because we are plotting two variables, we specify an x and
y variable. Does median household income change with greater
percent Hispanic in the neighborhood?
ggplot(ncal.tracts) +
geom_point(mapping = aes(x = phisp, y = medincome)) +
xlab("Percent Hispanic") +
ylab("Median Household Income")
And for each city?
ggplot(ncal.tracts) +
geom_point(mapping = aes(x = phisp, y = medincome)) +
xlab("Percent Hispanic") +
ylab("Median Household Income") +
facet_wrap(~city)
What do these scatter plots suggest about the relationship between income and percent Hispanic across these four cities?
ggplot()
is a powerful function, and you can make a lot
of visually captivating graphs. We have just scratched the surface of
its functions and features. The list of all possible plots for
<GEOM_FUNCTION>()
can be found here. You can also
make your graphs really “pretty” and professional looking by altering
graphing features using <OPTIONS()
, including colors,
labels, titles and axes. For a list of options that alter various
features of a graph, check out Chapter 28
in RDS.
Here’s your ggplot2 badge. Wear it with pride!
You will, on occasion, need to save a plot to a specific file.
Specifically, we expect you to create plots and graphs, save them, and
upload them for your final project. Don’t use the built-in “Export”
button! If you do, your figure is not reproducible – no one will know
how your plot was exported. Instead, use ggsave()
by
explicitly creating the figure and exporting
Let’s save the scatterplot of percent Hispanic and median household
income as a .png file named “phisp_inc.png”. First, we save the plot
produced by ggplot()
into an R object named
phisp_inc
phisp_inc <- ggplot(ncal.tracts) +
geom_point(mapping = aes(x = phisp, y = medincome)) +
xlab("Percent Hispanic") +
ylab("Median Household Income")
We then save phisp_inc using ggsave()
ggsave("phisp_inc.png", phisp_inc)
Navigate to your working directory folder (type in
getwd()
to find what this folder is) and you should see
phisp_inc.png.
This
work is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil