5 + 4
[1] 9
15/3
[1] 5
Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.
You can get output from R simply by typing math in the console:
To perform useful tasks, we often need to assign values to objects in R. To create an object, we specify its name followed by the assignment operator <-
, and then provide the desired value:
The assignment operator <-
assigns values on the right to objects on the left. For instance, executing x <- 5 results in the value of x being set to 5. While =
can also be used to assign a value to an object.
In RStudio, on a PC, pressing Alt
+ -
simultaneously will produce <-
in a single keystroke. On a Mac, pressing Option
+ -
simultaneously achieves the same result.
In R, what are referred to as objects are commonly known as variables in many other programming languages. Although in different contexts, the terms “object” and “variable” may carry distinct meanings, in this lesson, they are used interchangeably.
Once you created the object you can do arithmetic with it:
You can also assign a new value to the object:
We can create a new object from anothr object and creating the new object doesn’t change the value of the other object:
So far, your code has been executed directly in the console, which is convenient for quick queries but less helpful for revisiting your work later on. To create a script file, simply press Ctrl
+ Shift
+ N
(for Mac Command
+ Shift
+ N
). Once your script is open, it’s important to save it right away. You can do this by pressing Ctrl
+ S
(for Mac Command
+ S
), which will prompt a dialogue box where you can choose where to save your script and give it a name. The .R file extension is automatically added, ensuring it opens correctly in RStudio.
Remember to save your work regularly by pressing Ctrl
+ S
.
In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:
Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%
The last two might need some explaining:
With this knowledge, follow the instructions to complete the exercise.
2^5
in the editor to calculate 2 to the power 5.28 %% 6
to calculate 28 modulo 6.A basic concept in (statistical) programming is called a variable.
A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.
You can assign a value 4 to a variable my_var with the command
my_var <- 4
Suppose you have a fruit basket with five apples. As a data analyst in training, you want to store the number of apples in a variable with the name my_apples.
Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable my_oranges and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:
my_apples + my_oranges
Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The my_apples and my_oranges variables both contained a number in the previous exercise. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges (see the editor), you would be trying to assign the addition of a numeric and a character variable to the variable my_fruit. This is not possible.
R works with numerous data types. Some of the most basic types to get started are:
Decimal values like 4.5 are called numerics.
Whole numbers like 4 are called integers. Integers are also numerics.
Boolean values (TRUE or FALSE) are called logical.
Text (or string) values are called characters.
Note how the quotation marks in the editor indicate that “some text” is a string.
Change the value of the:
my_numeric
variable to 42.my_character
variable to “universe”. Note that the quotation marks indicate that “universe” is a character.my_logical
variable to FALSE.Note that R is case sensitive!
Do you remember that when you added 5 + “six”, you got an error due to a mismatch in data types? You can avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function, as the code in the editor shows.
Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually takes one or more inputs called arguments
. Functions often (but not always) return
a value. A typical example would be the function sqrt()
. The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number.
# Define a function called add_numbers
add_numbers <- function(num1, num2) {
sum <- num1 + num2
return(sum)
}
# Call the add_numbers function with arguments 5 and 3
result <- add_numbers(5, 3)
print(result) # Output: 8
[1] 8
In this R example, num1
and num2
are the function arguments, and sum
is the return value of the add_numbers()
function. When the function is called with arguments 5 and 3, it returns the sum of the two numbers, which is then assigned to the variable result and printed.
In R, a vector is a basic data structure that contains elements of the same type. It can hold numeric
, character
, logical
, or other data types. It serves as a versatile container for storing and manipulating data.
To create a vector, you can use the c()
function, which concatenates its arguments into a vector.
[1] 1 2 3 4 5
# Creating a character vector
character_vector <- c("apple", "banana", "orange")
print(character_vector)
[1] "apple" "banana" "orange"
The quotes around “apple”, “banana”, etc. are essential here. Without the quotes R will assume objects have been created called apple, banana and orange As these objects don’t exist in R’s memory, there will be an error message.
In R, several functions help you inspect the content of a vector:
length()
: Returns the number of elements in a vector.class()
: Returns the class or data type of an object.str()
: Displays the internal structure of an R object.c()
: Adds other elements to your vector:Here’s how you can use these functions with examples:
# Creating a numeric vector
numeric_vector = c(1, 2, 3, 4, 5)
# Using length() to determine the number of elements
print(length(numeric_vector))
[1] 5
[1] "numeric"
num [1:5] 1 2 3 4 5
[1] 1 2 3 4 5 10
In this example, we created a numeric vector numeric_vector
and used the length()
, class()
, and str()
functions to inspect its properties. The length()
function returns the number of elements in the vector, class()
returns its data type (which is “numeric” in this case), and str()
displays the structure of the vector, indicating it has 5 numeric elements.
In R, there are several types of vectors, each suited for different types of data:
Each type of vector is optimized for its specific data type and provides different functionalities and operations in R.
You can check the type of your vector using the typeof() function and inputting your vector as the argument.
We’ve seen that vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
R implicitly converts them to all be the same type
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets([]).
Indices or indexes refer to the position of elements within a data structure, such as a vector. In R, indices start from 1, meaning that the first element of a vector is located at index 1, the second element at index 2, and so on.
For instance:
# Create a numeric vector
numbers <- c(10, 20, 30, 40, 50)
# Subsetting using numerical indices
subset_numeric <- numbers[c(1, 3, 5)]
print(subset_numeric) # Output: 10 30 50
[1] 10 30 50
# Create a character vector
fruits <- c("apple", "banana", "orange", "mango", "grape")
# Subsetting using numerical indices
subset_numeric <- fruits[c(1, 3, 5)]
print(subset_numeric) # Output: "apple" "orange" "grape"
[1] "apple" "orange" "grape"
Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:
# Create a numeric vector
numbers <- c(10, 20, 30, 40, 50)
# Subsetting using logical condition
subset_condition <- numbers[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
print(subset_condition) # Output: 10 40 50
[1] 10 40 50
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 20:
[1] FALSE FALSE TRUE TRUE TRUE
## so we can use this to select only the values above 20
# Subsetting using logical condition
subset_condition <- numbers[numbers > 20]
print(subset_condition) # Output: 30 40 50
[1] 30 40 50
You can combine multiple tests using &
(both conditions are true, AND) or |
(at least one of the conditions is true, OR):
[1] FALSE FALSE FALSE TRUE FALSE
[1] 40
[1] FALSE FALSE FALSE FALSE FALSE
numeric(0)
# Create a character vector
fruits = c("apple", "banana", "orange", "mango","banana", "orange","avo","banana","apple", "grape")
fruits[fruits=="apple" | fruits == "orange"]
[1] "apple" "orange" "orange" "apple"
A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in%
allows you to test if any of the elements of a search vector are found:
In R, missing data, represented as NA
, is a concept ingrained into the language to accommodate datasets. Unlike many other programming languages, R handles missing data seamlessly within vectors.
During numerical operations, functions typically return NA if any missing values are present in the dataset being analyzed. This behavior ensures that missing data are not overlooked. To calculate results as if missing values were excluded (i.e., removed), you can include the argument na.rm = TRUE
. The “rm” in na.rm
stands for “ReMoved,” indicating that missing values are removed before computation.
[1] NA
[1] NA
[1] 4
[1] 6
If your data include missing values, you may want to become familiar with the functions is.na()
, na.omit()
.
In R, vectors are just one of the many data structures available for organizing and working with data. Let’s briefly explore some of the other essential data structures:
Matrices are two-dimensional arrays where each element is of the same data type. They are created using the matrix()
function. Here’s a simple example:
Lists are collections of objects that can be of different types. They are created using the list()
function. Here’s an example:
Arrays are similar to matrices but can have more than two dimensions. They are created using the array()
function. Here’s a demonstration:
Data frames are tabular structures with rows and columns, where each column can have a different data type. They are commonly used for representing datasets. Here’s an example:
# Create a data frame
df_data <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Gender = c("Female", "Male", "Male"))
print(df_data)
Name Age Gender
1 Alice 25 Female
2 Bob 30 Male
3 Charlie 35 Male
These are just a few examples of the many data structures available in R. Understanding these structures and how to work with them is essential for effective data analysis and manipulation in R.
In this section, we will learn how to import data into R from text (txt) and comma-separated values (csv) files. Text and CSV files are common formats for storing tabular data, with CSV being a widely used standard due to its simplicity and compatibility with various software applications.
Text File (txt):
A text file is a simple file format that stores data as plain text. Each line in a text file typically represents a record, with fields separated by delimiters such as commas, tabs, or spaces.
Comma-Separated Values (CSV) File:
A CSV file is a type of text file that uses commas to separate values in each row. It’s a popular format for storing spreadsheet or database data and can be easily imported into R using built-in functions.
To import data from a text file into R, you can use the read.table()
function. Let’s assume we have a text file named “data.txt” with tab-separated values (TSV). Here’s how you can import it:
In this example, we specify the filename “data.txt”, set header = TRUE
to indicate that the first row contains column names, and specify sep = "\t"
to indicate that the values are tab-separated.
To import data from a CSV file into R, you can use the read.csv()
function. Let’s assume we have a CSV file named “data.csv”. Here’s how you can import it:
In this example, we simply specify the filename “data.csv”, and R automatically detects the delimiter (comma) and reads the data into a data frame.
We’ll use the gapminder.csv
dataset, which contains information about life expectancy, GDP per capita, and population for various countries over multiple years. The CSV file is in datasets
folder.
Now, let’s read the data from the CSV file into R. We’ll use the read.csv()
function for this purpose:
This command reads the data from ‘gapminder.csv’ and stores it in a data frame named gapminder.
To get an overview of the data, let’s first view the first few rows using the head()
function:
country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
This will display the first six rows of the gapminder dataset, allowing us to inspect the structure and contents of the data.
Now that we’ve imported the data, let’s understand what each column represents:
Comments
In R, the comment character is #. Anything written to the right of a # in a script is disregarded by R, making it helpful for leaving notes and explanations. RStudio offers a keyboard shortcut for commenting or uncommenting a paragraph: after selecting the lines you want to comment, press
Ctrl
+Shift
+C
simultaneously. Alternatively, if you only need to comment out one line, you can position the cursor anywhere on that line and then pressCtrl
+Shift
+C
.