Introduction to R

Intro to basics

Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.

Creating object in R

You can get output from R simply by typing math in the console:

5 + 4
[1] 9
15/3
[1] 5

To perform useful tasks, we often need to assign values to objects in R. To create an object, we specify its name followed by the assignment operator <-, and then provide the desired value:

weight_lb <-  110
height_cm = 167

The assignment operator <- assigns values on the right to objects on the left. For instance, executing x <- 5 results in the value of x being set to 5. While = can also be used to assign a value to an object.

In RStudio, on a PC, pressing Alt + - simultaneously will produce <- in a single keystroke. On a Mac, pressing Option + - simultaneously achieves the same result.

Objects vs. variables

In R, what are referred to as objects are commonly known as variables in many other programming languages. Although in different contexts, the terms “object” and “variable” may carry distinct meanings, in this lesson, they are used interchangeably.

Once you created the object you can do arithmetic with it:

height_cm/100
[1] 1.67

You can also assign a new value to the object:

height_cm = 176

We can create a new object from anothr object and creating the new object doesn’t change the value of the other object:

height_m = height_cm/100
print(height_m)
[1] 1.76

Saving your code

So far, your code has been executed directly in the console, which is convenient for quick queries but less helpful for revisiting your work later on. To create a script file, simply press Ctrl + Shift + N (for Mac Command + Shift + N ). Once your script is open, it’s important to save it right away. You can do this by pressing Ctrl + S (for Mac Command + S), which will prompt a dialogue box where you can choose where to save your script and give it a name. The .R file extension is automatically added, ensuring it opens correctly in RStudio.

Remember to save your work regularly by pressing Ctrl + S.

Comments

In R, the comment character is #. Anything written to the right of a # in a script is disregarded by R, making it helpful for leaving notes and explanations. RStudio offers a keyboard shortcut for commenting or uncommenting a paragraph: after selecting the lines you want to comment, press Ctrl + Shift + C simultaneously. Alternatively, if you only need to comment out one line, you can position the cursor anywhere on that line and then press Ctrl + Shift + C.

Arithmetic with R

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%

The last two might need some explaining:

  • The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9.
  • The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.

With this knowledge, follow the instructions to complete the exercise.

Instructions
  • Type 2^5 in the editor to calculate 2 to the power 5.
  • Type 28 %% 6 to calculate 28 modulo 6.
  • Have a look at the R output in the console.
  • Note how the # symbol is used to add comments on the R code.
# An addition
5 + 5 
[1] 10
# A subtraction
5 - 5 
[1] 0
# A multiplication
3 * 5
[1] 15
 # A division
(5 + 5) / 2 
[1] 5
# Exponentiation
2^5
[1] 32
# Modulo
28 %% 6
[1] 4

Variable assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable my_var with the command

my_var <- 4

Instructions
  • Over to you: complete the code in the editor such that it assigns the value 42 to the variable x in the editor. Notice that when you ask R to print x, the value 42 appears.
# Assign the value 42 to x
x = 42

# Print out the value of the variable x
print(x)
[1] 42

Variable assignment (2)

Suppose you have a fruit basket with five apples. As a data analyst in training, you want to store the number of apples in a variable with the name my_apples.

Instructions
  • Type the following code in the editor:
  • Look at the output: you see that the number 5 is printed. So R now links the variable my_apples to the value 5.
# Assign the value 5 to the variable my_apples
my_apples = 5

# Print out the value of the variable my_apples
print(my_apples)
# Assign the value 5 to the variable my_apples
my_apples = 5

# Print out the value of the variable my_apples
print(my_apples)
[1] 5

Variable assignment (3)

Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable my_oranges and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:

my_apples + my_oranges

Instructions
  • Assign to my_oranges the value 6.
  • Add the variables my_apples and my_oranges and have R simply print the result.
  • Assign the result of adding my_apples and my_oranges to a new variable my_fruit.
# Assign a value to the variables my_apples and my_oranges
my_apples = 5
my_oranges = 6

# Add these two variables together
print(my_apples + my_oranges)
[1] 11
# Create the variable my_fruit
my_fruit = my_apples + my_oranges

Apples and oranges

Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The my_apples and my_oranges variables both contained a number in the previous exercise. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges (see the editor), you would be trying to assign the addition of a numeric and a character variable to the variable my_fruit. This is not possible.

Instructions
  • Submit the answer and read the error message. Make sure to understand why this did not work.
# Assign a value to the variable my_apples
my_apples <- 5  

# Assign a value to the variable  my_oranges
my_oranges <- "six"

# Create the variable my_fruit and print it out
my_fruit <- my_apples + my_oranges 
my_fruit
  • Adjust the code so that R knows you have 6 oranges and thus a fruit basket with 11 pieces of fruit.
# Assign a value to the variable my_apples
my_apples <- 5  

# Fix the assignment of my_oranges
my_oranges <- 6

# Create the variable my_fruit and print it out
my_fruit <- my_apples + my_oranges 
my_fruit
[1] 11

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

Decimal values like 4.5 are called numerics.
Whole numbers like 4 are called integers. Integers are also numerics.
Boolean values (TRUE or FALSE) are called logical.
Text (or string) values are called characters.

Note how the quotation marks in the editor indicate that “some text” is a string.

Instructions

Change the value of the:

  • my_numeric variable to 42.
  • my_character variable to “universe”. Note that the quotation marks indicate that “universe” is a character.
  • my_logical variable to FALSE.

Note that R is case sensitive!

# Change my_numeric to be 42
my_numeric <- 42

# Change my_character to be "universe"
my_character <- "universe"

# Change my_logical to be FALSE
my_logical <- FALSE

What’s that data type?

Do you remember that when you added 5 + “six”, you got an error due to a mismatch in data types? You can avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function, as the code in the editor shows.

Instructions
  • Complete the code in the editor and also print out the classes of my_character and my_logical.
# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

# Check class of my_numeric
class(my_numeric)
# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

# Check class of my_numeric
class(my_numeric)
[1] "numeric"
# Check class of my_character
class(my_character)
[1] "character"
# Check class of my_logical
class(my_logical)
[1] "logical"

Functions and their arguments

What are functions?

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually takes one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number.

Calling a function

Executing a function (‘running it’) is called calling the function. An example of a function call is:

x = sqrt(25)

Here, the value of 25 is given to the sqrt() function, the sqrt() function calculates the square root, and returns the value which is then assigned to the object x

Function argument(s)

  • Function arguments are the values passed to a function when it is called.
  • They provide the necessary input for the function to perform its task.
  • Functions can have zero or more arguments.
  • Arguments can be of any data type, such as integers, strings, lists, or even other functions.
x = sqrt(21) # 21 is the argument for sqrt()

print(x)
[1] 4.582576
# Function with multiple argument
round(x, digits = 2)
[1] 4.58

Function return value

  • The return value of a function is the result that the function produces after performing its task.
  • It is the output of the function that is returned to the part of the program that called the function.
  • Functions can return values of any data type, including integers, strings, lists, dictionaries, and more.
  • A function may return a single value or multiple values (in the form of tuples or other data structures).
# Define a function called add_numbers
add_numbers <- function(num1, num2) {
  sum <- num1 + num2
  return(sum)
}

# Call the add_numbers function with arguments 5 and 3
result <- add_numbers(5, 3)
print(result)  # Output: 8
[1] 8

In this R example, num1 and num2 are the function arguments, and sum is the return value of the add_numbers() function. When the function is called with arguments 5 and 3, it returns the sum of the two numbers, which is then assigned to the variable result and printed.

Vectors

In R, a vector is a basic data structure that contains elements of the same type. It can hold numeric, character, logical, or other data types. It serves as a versatile container for storing and manipulating data.

Create a vector

To create a vector, you can use the c() function, which concatenates its arguments into a vector.

# Creating a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
print(numeric_vector)
[1] 1 2 3 4 5
# Creating a character vector
character_vector <- c("apple", "banana", "orange")
print(character_vector)
[1] "apple"  "banana" "orange"

The quotes around “apple”, “banana”, etc. are essential here. Without the quotes R will assume objects have been created called apple, banana and orange As these objects don’t exist in R’s memory, there will be an error message.

Inspecting a vector

In R, several functions help you inspect the content of a vector:

  • length() : Returns the number of elements in a vector.
  • class() : Returns the class or data type of an object.
  • str() : Displays the internal structure of an R object.
  • c() : Adds other elements to your vector:

Here’s how you can use these functions with examples:

# Creating a numeric vector
numeric_vector = c(1, 2, 3, 4, 5)

# Using length() to determine the number of elements
print(length(numeric_vector))
[1] 5
# Using class() to determine the data type
print(class(numeric_vector))
[1] "numeric"
# Using str() to display the internal structure
str(numeric_vector)
 num [1:5] 1 2 3 4 5
# Add an element to the numeric_vector
numeric_vector = c(numeric_vector, 10)
print(numeric_vector)
[1]  1  2  3  4  5 10

In this example, we created a numeric vector numeric_vector and used the length(), class(), and str() functions to inspect its properties. The length() function returns the number of elements in the vector, class() returns its data type (which is “numeric” in this case), and str() displays the structure of the vector, indicating it has 5 numeric elements.

Types of vectors

In R, there are several types of vectors, each suited for different types of data:

  • Numeric vectors: These vectors contain numeric (integer or decimal) values.
  • Character vectors: These vectors contain strings of characters.
  • Logical vectors: These vectors contain logical (TRUE/FALSE) values.
  • Integer vectors: These vectors specifically contain integer values.
  • Complex vectors: These vectors contain complex numbers (real and imaginary parts).
  • Raw vectors: These vectors contain raw bytes of data.

Each type of vector is optimized for its specific data type and provides different functionalities and operations in R.

You can check the type of your vector using the typeof() function and inputting your vector as the argument.

typeof(numeric_vector)
[1] "double"
Q:

We’ve seen that vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?

R implicitly converts them to all be the same type

Q:
  • What will happen in each of these examples? (hint: use class() to check the data type of your objects):
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
  • Why do you think it happens?

Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.

Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets([]).

Indices or indexes refer to the position of elements within a data structure, such as a vector. In R, indices start from 1, meaning that the first element of a vector is located at index 1, the second element at index 2, and so on.

For instance:

# Create a numeric vector
numbers <- c(10, 20, 30, 40, 50)

# Subsetting using numerical indices
subset_numeric <- numbers[c(1, 3, 5)]
print(subset_numeric)  # Output: 10 30 50
[1] 10 30 50
# Create a character vector
fruits <- c("apple", "banana", "orange", "mango", "grape")

# Subsetting using numerical indices
subset_numeric <- fruits[c(1, 3, 5)]
print(subset_numeric)  # Output: "apple" "orange" "grape"
[1] "apple"  "orange" "grape" 

Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

# Create a numeric vector
numbers <- c(10, 20, 30, 40, 50)

# Subsetting using logical condition
subset_condition <- numbers[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
print(subset_condition)  # Output: 10 40 50
[1] 10 40 50

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 20:

numbers > 20
[1] FALSE FALSE  TRUE  TRUE  TRUE
## so we can use this to select only the values above 20
# Subsetting using logical condition
subset_condition <- numbers[numbers > 20]
print(subset_condition)  # Output: 30 40 50
[1] 30 40 50

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

numbers>30 & numbers<50
[1] FALSE FALSE FALSE  TRUE FALSE
numbers[numbers>30 & numbers<50]
[1] 40
numbers<30 & numbers==50
[1] FALSE FALSE FALSE FALSE FALSE
numbers[numbers<30 & numbers==50]
numeric(0)
# Create a character vector
fruits = c("apple", "banana", "orange", "mango","banana", "orange","avo","banana","apple", "grape")

fruits[fruits=="apple" | fruits == "orange"]
[1] "apple"  "orange" "orange" "apple" 

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

# Create a character vector
fruits = c("apple", "banana", "orange", "mango", "grape")

"avo" %in% fruits
[1] FALSE
c("avo","apple","mango") %in% fruits
[1] FALSE  TRUE  TRUE

Missing data

In R, missing data, represented as NA, is a concept ingrained into the language to accommodate datasets. Unlike many other programming languages, R handles missing data seamlessly within vectors.

During numerical operations, functions typically return NA if any missing values are present in the dataset being analyzed. This behavior ensures that missing data are not overlooked. To calculate results as if missing values were excluded (i.e., removed), you can include the argument na.rm = TRUE. The “rm” in na.rm stands for “ReMoved,” indicating that missing values are removed before computation.

heights <- c(2, 4, 4, NA, 6)
mean(heights)
[1] NA
max(heights)
[1] NA
mean(heights, na.rm = TRUE)
[1] 4
max(heights, na.rm = TRUE)
[1] 6

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit().

heights[!is.na(heights)]
[1] 2 4 4 6
na.omit(heights)
[1] 2 4 4 6
attr(,"na.action")
[1] 4
attr(,"class")
[1] "omit"

Other data structures in R

In R, vectors are just one of the many data structures available for organizing and working with data. Let’s briefly explore some of the other essential data structures:

1. Matrices:

Matrices are two-dimensional arrays where each element is of the same data type. They are created using the matrix() function. Here’s a simple example:

# Create a matrix
matrix_data <- matrix(1:9, nrow = 3, ncol = 3)
print(matrix_data)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

2. Lists:

Lists are collections of objects that can be of different types. They are created using the list() function. Here’s an example:

# Create a list
list_data <- list(name = "John", age = 30, is_student = TRUE)
print(list_data)
$name
[1] "John"

$age
[1] 30

$is_student
[1] TRUE

3. Arrays:

Arrays are similar to matrices but can have more than two dimensions. They are created using the array() function. Here’s a demonstration:

# Create an array
array_data <- array(1:12, dim = c(2, 3, 2))
print(array_data)
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

4. Data Frames:

Data frames are tabular structures with rows and columns, where each column can have a different data type. They are commonly used for representing datasets. Here’s an example:

# Create a data frame
df_data <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                      Age = c(25, 30, 35),
                      Gender = c("Female", "Male", "Male"))
print(df_data)
     Name Age Gender
1   Alice  25 Female
2     Bob  30   Male
3 Charlie  35   Male

These are just a few examples of the many data structures available in R. Understanding these structures and how to work with them is essential for effective data analysis and manipulation in R.

Importing data into R

In this section, we will learn how to import data into R from text (txt) and comma-separated values (csv) files. Text and CSV files are common formats for storing tabular data, with CSV being a widely used standard due to its simplicity and compatibility with various software applications.

Understanding Text (txt) and CSV Files

Text File (txt):
A text file is a simple file format that stores data as plain text. Each line in a text file typically represents a record, with fields separated by delimiters such as commas, tabs, or spaces.

Comma-Separated Values (CSV) File:
A CSV file is a type of text file that uses commas to separate values in each row. It’s a popular format for storing spreadsheet or database data and can be easily imported into R using built-in functions.

Example: Importing Data from a Text (txt) File

To import data from a text file into R, you can use the read.table() function. Let’s assume we have a text file named “data.txt” with tab-separated values (TSV). Here’s how you can import it:

# Import data from a text file
data <- read.table("datasets/data.txt", header = TRUE, sep = "\t")

In this example, we specify the filename “data.txt”, set header = TRUE to indicate that the first row contains column names, and specify sep = "\t" to indicate that the values are tab-separated.

Example: Importing Data from a CSV File

To import data from a CSV file into R, you can use the read.csv() function. Let’s assume we have a CSV file named “data.csv”. Here’s how you can import it:

# Import data from a CSV file
data <- read.csv("datasets/data.csv")

In this example, we simply specify the filename “data.csv”, and R automatically detects the delimiter (comma) and reads the data into a data frame.

Reading and Exploring a CSV File in R

We’ll use the gapminder.csv dataset, which contains information about life expectancy, GDP per capita, and population for various countries over multiple years. The CSV file is in datasets folder.

Reading the Data

Now, let’s read the data from the CSV file into R. We’ll use the read.csv() function for this purpose:

# Read the data from the CSV file
gapminder = read.csv("datasets/gapminder.csv")

This command reads the data from ‘gapminder.csv’ and stores it in a data frame named gapminder.

Exploring the Data

To get an overview of the data, let’s first view the first few rows using the head() function:

# View the first few rows of the data
head(gapminder)
      country continent year lifeExp      pop gdpPercap
1 Afghanistan      Asia 1952  28.801  8425333  779.4453
2 Afghanistan      Asia 1957  30.332  9240934  820.8530
3 Afghanistan      Asia 1962  31.997 10267083  853.1007
4 Afghanistan      Asia 1967  34.020 11537966  836.1971
5 Afghanistan      Asia 1972  36.088 13079460  739.9811
6 Afghanistan      Asia 1977  38.438 14880372  786.1134

This will display the first six rows of the gapminder dataset, allowing us to inspect the structure and contents of the data.

Now that we’ve imported the data, let’s understand what each column represents:

  • country: The name of the country.
  • continent: The continent to which the country belongs.
  • year: The year of observation.
  • lifeExp: Life expectancy at birth (in years).
  • pop: Population.
  • gdpPercap: GDP per capita (in USD).