In this guide you will learn the basic fundamentals of the statistical software program R. Because R is not a prerequisite for the class, this guide assumes no background in the language. The objectives of the guide are as follows:
This lab guide follows closely and supplements the material presented in Chapters 2, 4, and 20 in the textbook R for Data Science (RDS).
Assignment 1 is due by 10:00 am, January 17th on
Canvas. See here for
assignment guidelines. You must submit an .Rmd
file and its
associated .html
file. Name the files:
yourLastName_firstInitial_asgn01. For example: brazil_n_asgn01
Lab guides are self-contained and self-guided. The expectation for each guide is to get through all of it either on your own or collaboratively. You do not need to turn in lab guides. However, it is important that you do not skip guides as lab material builds on one another.
R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. R is an interpreted language, not a compiled one. This means that you type something into R and it does it. It is both a command line software and a programming environment. It is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms, which allows for the user to freely distribute, study, change, and improve the software.
R can be downloaded from one of the “CRAN” (Comprehensive R Archive Network) sites. In the US, the main site is at http://cran.us.r-project.org/. Look in the “Download and Install R” area. Click on the appropriate link based on your operating system.
If you already have R on your computer, make sure you have the most updated version of R on your personal computer (4.3.2 “Eye Holes”).
On the “R for macOS” page, there are multiple packages that could be downloaded. If you have a Mac with Apple Silicon, click on R-4.3.2-arm64.pkg; if you don’t have Silicon and are running on macOS 11 (Big Sur) or higher, click on R-4.3.2-x86_64.pkg; if you are running an earlier version of OS X, download the appropriate version listed under “Binary for legacy macOS/OS X systems.”
After the package finishes downloading, locate the installer on your hard drive, double-click on the installer package, and after a few screens, select a destination for the installation of the R framework (the program) and the R.app GUI. Note that you will have to supply your computer’s Administrator’s password. Close the window when the installation is done.
An application will appear in the Applications folder: R.app.
Browse to the XQuartz download page. Click on the most recent version of XQuartz to download the application.
Run the XQuartz installer. XQuartz is needed to create windows to display many types of R graphics: this used to be included in MacOS until version 10.8 but now must be downloaded separately.
On the “R for Windows” page, click on the “base” link, which should take you to the “R-4.3.2 for Windows” page
On this page, click “Download R 4.3.2 for Windows”, and save the exe file to your hard disk when prompted. Saving to the desktop is fine.
To begin the installation, double-click on the downloaded file. Don’t be alarmed if you get unknown publisher type warnings. Window’s User Account Control will also worry about an unidentified program wanting access to your computer. Click on “Run” or “Yes”.
Select the proposed options in each part of the install dialog. When the “Select Components” screen appears, just accept the standard choices (the default). For Startup options, keep the default.
Note: Depending on the age of your computer and version of Windows, you may be running either a “32-bit” or “64-bit” version of the Windows operating system. If you have the 64-bit version (most likely), R will install the appropriate version (R x64 3.5.2) and will also (for backwards compatibility) install the 32-bit version (R i386 3.5.2). You can run either, but you will probably just want to run the 64-bit version.
If you click on the R program you just downloaded, you will find a very basic user interface. For example, below is what I get on a Mac
We will not use R’s direct interface to run analyses. Instead, we will use the program RStudio. RStudio gives you a true integrated development environment (IDE), where you can write code in a window, see results in other windows, see locations of files, see objects you’ve created, and so on. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.
To download and install RStudio, follow the directions below
Note that the most recent version of RStudio works only for certain operating systems (OS). If you have an older OS, you will need to download an older version RStudio, which you can find here.
Open up RStudio. It may ask you to connect to R, which you’ve already downloaded. You should see the interface shown in the figure below which has three windows.
>
is an invitation from R to enter its world. This is
where you type code in, press enter to execute the code, and see the
results.There is actually fourth window. But, we’ll get to this window a little later (if you read the assignment guidelines you already know what this fourth window is).
For more information on each tab, check this resource.
While not required, I strongly suggest that you change preferences in RStudio to never save the workspace so you always open with a clean environment. See Ch. 8.1 of R4DS for some more background
The reason for making these changes is that it is preferable for reproducibility to start each R session with a clean environment. You can restore a previous environment either by rerunning code or by manually loading a previously saved session.
The R Studio environment is modified when you execute code from files or from the console. If you always start fresh, you do not need to be concerned about things not working because of something you typed in the console, but did not save in a file.
You only need to set these preferences once.
Let’s now explore what R can do. R is really just a big fancy
calculator. For example, type in the following mathematical expression
next to the >
in the R console (left window)
1+1
Note that spacing does not matter: 1+1
will generate the
same answer as 1 + 1
. Can you say hello to the
world?
hello world
## Error: <text>:1:7: unexpected symbol
## 1: hello world
## ^
Nope. What is the problem here? We need to put quotes around it.
"hello world"
## [1] "hello world"
“hello world” is a character and R recognizes characters only if there are quotes around it. This brings us to the topic of basic data types in R. There are four basic data types in R: character, logical, numeric, and factors (there are two others - complex and raw - but we won’t cover them because they are rarely used).
Characters are used to represent words or letters in R. We saw this
above with “hello world”. Character values are also known as strings.
You might think that the value "1"
is a number. Well, with
quotes around, it isn’t! Anything with quotes will be interpreted as a
character. No ifs, ands or buts about it.
A logical takes on two values: FALSE or TRUE. Logicals are usually constructed with comparison operators, which we’ll go through more carefully in Lab 2. Think of a logical as the answer to a question like “Is this value greater than (lower than/equal to) this other value?” The answer will be either TRUE or FALSE. TRUE and FALSE are logical values in R. For example, typing in the following
3 > 2
## [1] TRUE
gives us a true. What about the following?
"prof visser" == "prof cannon"
## [1] FALSE
Numerics are separated into two types: integer and double. The
distinction between integers and doubles is usually not important. R
treats numerics as doubles by default because it is a less restrictive
data type. You can do any mathematical operation on numeric values. We
added one and one above. We can also multiply using the *
operator
2*3
## [1] 6
Divide
4/2
## [1] 2
And even take the logarithm!
log(1)
## [1] 0
log(0)
## [1] -Inf
Uh oh. What is -Inf
? Well, you can’t take the logarithm
of 0, so R is telling you that you’re getting a non numeric value in
return. The value -Inf
is another type of value type that
you can get in R. We’ll go through this and other weirdo values in Lab
2.
Think of a factor as a categorical variable. It is sort of like a character, but not really. It is actually a numeric code with character-valued levels. Think of a character as a true string and a factor as a set of categories represented as characters. We won’t use factors too much in this course.
You learned that R has four basic data types. Now, let’s go through how
we can store data in R. That is, you type in the character “hello world”
or the number 3, and you want to store these values. You do this by
using R’s various data structures.
A vector is the most common and basic R data structure and is pretty
much the workhorse of the language. A vector is simply a sequence of
values which can be of any data type but all of the same type. There are
a number of ways to create a vector depending on the data type, but the
most common is to insert the data you want to save in a vector into the
command c()
. For example, to save the values 4, 16 and 9 in
a vector type in
c(4, 16, 9)
## [1] 4 16 9
You can also have a vector of character values
c("martin", "anne", "clare")
## [1] "martin" "anne" "clare"
The above code does not actually “save” the values 4, 16, and 9 - it
just presents it on the screen in a vector. If you want to use these
values again without having to type out c(4, 16, 9)
, you
can save it in a data object. At the heart of almost everything you will
do (or ever likely to do) in R is the concept that everything in R is an
object. These objects can be almost anything, from a single number or
character string (like a word) to highly complex structures like the
output of a plot, a map, or a summary of your statistical analysis.
You assign data to an object using the arrow sign <-
.
This will create an object in R’s memory that can be called back into
the command window at any time. For example, you can save “hello world”
to a vector called b by typing in
b <- "hello world"
b
## [1] "hello world"
You can pronounce the above as “b becomes ‘hello world’”.
Similarly, you can save the numbers 4, 16 and 9 into a vector called v1
v1 <- c(4, 16, 9)
v1
## [1] 4 16 9
You should see the objects b and v1 pop up in the Environment tab on the top right window of your RStudio interface.
Note that the name v1 is nothing special here. You could have named the object x or crd150 or your pet’s name (mine was charlie). You can’t, however, name objects using special characters (e.g. !, @, $) or only numbers (although you can combine numbers and letters, but a number cannot be at the beginning e.g. 2d2). For example, you’ll get an error if you save the vector c(4,16,9) to an object with the following names
123 <- c(4, 16, 9)
!!! <- c(4, 16, 9)
## Error: <text>:2:5: unexpected assignment
## 1: 123 <- c(4, 16, 9)
## 2: !!! <-
## ^
Also note that to distinguish a character value from a variable name, it needs to be quoted. “v1” is a character value whereas v1 is a variable. One of the most common mistakes for beginners is to forget the quotes.
brazil
## Error in eval(expr, envir, enclos): object 'brazil' not found
The error occurs because R tries to print the value of the object
brazil, but there is no such object. So remember that any time
you get the error message object 'something' not found
, the
most likely reason is that you forgot to quote a character value. If
not, it probably means that you have misspelled, or not yet created, the
object that you are referring to. I’ve included the common pitfalls and
R tips in this class resource.
Every vector has two key properties: type and
length. The type property indicates the data type that the
vector is holding. Use the command typeof()
to determine
the type
typeof(b)
## [1] "character"
typeof(v1)
## [1] "double"
Note that a vector cannot hold values of different types. If
different data types exist, R will coerce the values into the highest
type based on its internal hierarchy: logical < integer < double
< character. Type in test <- c("r", 6, TRUE)
in your
R console. What is the vector type of test
?
The command length()
determines the number of data
values that the vector is storing
length(b)
## [1] 1
length(v1)
## [1] 3
You can also directly determine if a vector is of a specific data
type by using the command is.X()
where you replace
X
with the data type. For example, to find out if
v1 is numeric, type in
is.numeric(b)
## [1] FALSE
is.numeric(v1)
## [1] TRUE
There is also is.logical()
, is.character()
,
and is.factor()
. You can also coerce a vector of one data
type to another. For example, save the value “1” and “2” (both in
quotes) into a vector named x1
x1 <- c("1", "2")
typeof(x1)
## [1] "character"
To convert x1 into a numeric, use the command
as.numeric()
x2 <- as.numeric(x1)
typeof(x2)
## [1] "double"
There is also as.logical()
, as.character()
,
and as.factor()
.
An important practice you should adopt early is to keep only
necessary objects in your current R Environment. For example, we will
not be using x2 any longer in this guide. To remove this object
from R forever, use the command rm()
rm(x2)
The data frame object x2 should have disappeared from the Environment tab. Bye bye!
Also note that when you close down R Studio, the objects you created above will disappear for good. Unless you save them onto your hard drive (we’ll touch on saving data in Lab 2), all data objects you create in your current R session will go bye bye when you exit the program.
We learned that data values can be stored in data structures known as vectors. The next step is to learn how to store vectors into an even higher level data structure. The data frame can do this. Data frames store vectors of the same length. Create a vector called v2 storing the values 5, 12, and 25
v2 <- c(5,12,25)
We can create a data frame using the command
data.frame()
storing the vectors v1 and
v2 as columns
data.frame(v1, v2)
## v1 v2
## 1 4 5
## 2 16 12
## 3 9 25
Store this data frame in an object called df1
df1<-data.frame(v1, v2)
df1 should pop up in your Environment window. You’ll notice a next to df1. This tells you that df1 possesses or holds more than one object. Click on and you’ll see the two vectors we saved into df1. Another neat thing you can do is directly click on df1 from the Environment window to bring up an Excel style worksheet on the top left window of your RStudio interface. You can also type in
View(df1)
to bring the worksheet up. You can’t edit this worksheet directly, but it allows you to see the values that a higher level R data object contains.
We can store different types of vectors in a data frame. For example, we can save the numeric vector v1 with a character vector v3.
v3 <- c("martin", "anne", "clare")
df2 <- data.frame(v1, v3)
df2
For higher level data structures like a data frame, use the function
class()
to figure out what kind of object you’re working
with.
class(df2)
## [1] "data.frame"
We can’t use length()
on a data frame because it has
more than one vector. Instead, it has dimensions - the number
of rows and columns. You can find the number of rows in a data fram
using nrow()
nrow(df1)
## [1] 3
Number of columms using ncol(df2)
ncol(df1)
## [1] 2
and the number of row and columns by using the command
dim()
dim(df1)
## [1] 3 2
Here, the data frame df1 has 3 rows and 2 columns. Data frames also have column names, which are characters.
colnames(df1)
## [1] "v1" "v2"
In this case, the data frame used the vector names for the column names.
We can extract columns from data frames by referring to their names
using the $
sign.
df1$v1
## [1] 4 16 9
We can also extract data from data frames using brackets
[ , ]
df1[,1]
## [1] 4 16 9
The value before the comma indicates the row, which you leave empty if you are not selecting by row. The value after the comma indicates the column, which you leave empty if you are not selecting by column. The above line of code selected the first column. Let’s select the 2nd row.
df1[2,]
## v1 v2
## 2 16 12
What is the value in the 2nd row and 1st column?
df1[2,1]
## [1] 16
We’ve been talking about the values in vectors and data frames rather abstractly. In practice, values, vectors and data frames have specific meaning in the context of data analysis. Let’s make things concrete. Take a look at this website showing crimes in California cities in 2016. Sacramento had 3,549 violent crime incidences. This is a data value (numeric!). The collection of violent crime counts for each city is a vector. The data frame has California cities as rows and the population, violent crime, homicide, and so on as columns. You learned about these elements in Handout 1. Now you see them in action in the R environment.
Let’s take a step back and talk about functions (also known as
commands). An R function is a packaged recipe that converts one or more
inputs (called arguments) into a single output. You execute most of your
tasks in R using functions. We have already used a couple of functions
above including typeof()
and colnames()
. Every
function in R will have the following basic format
functionName(arg1 = val1, arg2 = val2, ...)
In R, you type in the function’s name and set a number of options or parameters within parentheses that are separated by commas. Some options need to be set by the user - i.e. the function will spit out an error because a required option is blank - whereas others can be set but are not required because there is a default value established.
Let’s use the function seq()
which makes regular
sequences of numbers. You can find out what a function does and its
options by calling up its help documentation by typing ?
and the function name. The documentation should also provide some
examples of the function at the bottom of the page.
? seq
The help documentation should have popped up in the bottom right
window of your RStudio interface. The function contains
from
, to
, by
, and other
arguments. Under the Arguments section you can find what each of these
parameters means.
The description of the arguments from
and
to
are the starting and (maximal) end values of the
sequence. Of length 1 unless just from is supplied as an unnamed
argument. Type the arguments from = 1, to = 10
inside
the parentheses of seq()
seq(from = 1, to = 10)
## [1] 1 2 3 4 5 6 7 8 9 10
You should get the same result if you type in
seq(1, 10)
## [1] 1 2 3 4 5 6 7 8 9 10
The code above demonstrates something about how R resolves function
arguments. When you use a function, you can always specify all the
arguments in arg = value
form. But if you do not, R
attempts to resolve by position. So in the code seq(1, 10)
,
it is assumed that we want a sequence from = 1
that goes
to = 10
because we typed 1 before 10. Type in 10 before 1
and see what happens.
Each argument requires a certain type of data type. For example,
you’ll get an error when you use a character in seq()
seq("p", "w")
## Error in seq.default("p", "w"): 'from' must be a finite number
Although the lab guides and course textbooks should get you through a lot of the functions that are needed to successfully accomplish tasks for this class, you will need to rely on the help documentation to better understand how functions work. There are also a number of useful online resources on R and RStudio that you can look into if you get stuck or want to learn more. We outline these resources here. If you ever get stuck, check this resource out first to troubleshoot before immediately asking a friend or the instructor/TA.
In running the few lines of code above, we’ve asked you to work
directly in the R Console and issue commands in an interactive
way. That is, you type a command after >
, you hit
enter/return, R responds, you type the next command, hit enter/return, R
responds, and so on. As described in Handout 1, instead of writing the
command directly into the console, you should write it in a script. The
process is now: Type your command in the script. Run the code from the
script. R responds. You get results. You can write two commands in a
script. Run both simultaneously. R responds. You get results. This is
the basic flow. In your homework assignments, we will be asking you to
submit code in an R Markdown file. R Markdown allows you to create
documents that serve as a neat record of your analysis. Think of it as a
word document, but instead of sentences in an essay, you are writing
code for a data analysis.
Rather than copying and pasting code from the lab guides into the R Console as you’ve been doing up to this point, type it into an R Markdown file and then run the code from there. Even though you do not need to turn in the labs, running the lab code in your own R Markdown file will give you practice for your assignments. Plus, the code is in your document, so you can add explanatory text or supplement the guide’s code with your own code.
Just like for each assignment, I will provide an R Markdown template for each lab. Download the R Markdown Lab template into an appropriate folder on your hard drive. It’s best to set up on your hard drive a clean and efficient file management structure for this class as described in the assignment guidelines. For example, below is where I would save Lab 1’s R Markdown file on my Mac laptop (I named the file “Lab 1”).
When you knit this RMarkdown, the resulting html file will be located in this folder.
Open the file in R Studio by clicking on File from the top menu, click on Open File, navigate to your Lab 1 folder, and click on the Lab 1 R Markdown file you downloaded. Once you do this, if there isn’t already one on your console, a fourth window should pop up in the top left showing you an R Markdown file.
In this file, change the title to “Lab 1” and insert your name and
date. Don’t change anything else inside the YAML (the stuff at the top
in between the ---
). Also don’t change the following
chunk.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message = FALSE)
```
All your code should go inside areas designated as
```{r}
#Type your code here
```
For example, you would write the code 1+1
as
```{r}
1+1
```
From the file, run the code. R responds. You get results.
Now is a good time to read through the class assignment guidelines as they go through the basics of R Markdown files. Go through this guide carefully as you will need to submit all your homework assignments using R Markdown.
This
work is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil