Data wrangling

Before you can work with the gapminder dataset, you’ll need to load two R packages that contain the tools for working with it, then display the gapminder dataset so that you can see what it contains.

Exercise
  • Use the library() function to load the tidyverse package, just like we’ve loaded the gapminder package for you.
  • Type gapminder, on its own line, to look at the gapminder dataset.
  • How many observations (rows) are in the dataset?
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

The gapminder object contains structured data in the form of a data frame, which organizes data into rows and columns akin to a spreadsheet or a SQL database table. Data analyses in R, including those in this course, predominantly revolve around data frames. While the gapminder data is presented as a special type of data frame known as a tibble, the distinction is not crucial at this point.

R initially displays the first ten rows of the data frame, offering a glimpse of its contents along with a brief description. This description indicates that the tibble comprises 1,704 rows, referred to as observations, and six columns, termed variables. Understanding the meaning of each observation, or row, is pivotal in analysis. In this dataset, each observation signifies a unique combination of country and year. For instance, the first observation pertains to Afghanistan’s statistics in 1952, followed by subsequent observations for the same country in different years.

For every country-year combination, the dataset provides several variables detailing the country’s demographics. These variables include the continent (e.g., Asia), life expectancy, population, and GDP per capita. GDP per capita denotes a country’s total economic output (Gross Domestic Product) divided by its population, serving as a common metric for assessing a country’s wealth. Each variable adheres to a consistent data type: numeric for measures like life expectancy and population, and categorical for attributes like country and continent.

Even from this limited view of the data, one can glean insights. For instance, examining Afghanistan’s data reveals an increase in both life expectancy and population over time, while GDP per capita fluctuates. Throughout the course, you’ll learn how to leverage R to draw numerous conclusions about the social and economic histories of countries worldwide.

In the context of the tidyverse, “verbs” refer to the core functions or actions used for data manipulation and transformation. These verbs are part of the dplyr package, which is a key component of the tidyverse ecosystem.

The primary verbs in dplyr include:

  • filter(): Selects rows of a dataframe based on certain conditions.
  • mutate(): Adds new columns or modifies existing ones.
  • select(): Picks specific columns from a dataframe.
  • arrange(): Sorts rows of a dataframe based on one or more variables.
  • summarize(): Computes summary statistics for groups of rows.
  • group_by(): Groups the data by one or more variables for further operations.

These verbs allow for efficient and expressive data manipulation workflows, enabling users to perform common data tasks with ease and clarity.

The filter verb

Now that you’re familiar with the gapminder data, you’ll begin learning the tools to manipulate it. In the remainder of this chapter, you’ll delve into the “verbs” within the dplyr package. These verbs represent the fundamental steps used to transform data. The initial verb you’ll explore is filter.

gapminder %>% 
  filter(year == 2007)

A pipe (%>%) is a symbolic representation of “take whatever is before it, and feed it into the next step.” Following the pipe, we can execute our initial verb. In this case, we have data spanning multiple years, but we aim to narrow it down to a single year. Let’s suppose we filter for the year 2007, representing the most recent data in the dataset. The condition “year equals equals 2007” serves as the criterion for filtering observations. The “equals equals” may seem peculiar; it denotes a “logical equals” operation, comparing each year with the value 2007. Using a single equals sign here would yield a different outcome in R, which we’ll explore later. This expression signifies our intention to retain only the observations from 2007.

Executing this code results in a dataset containing only 142 rows, corresponding to the number of countries in the dataset. It’s essential to understand that we’re not altering the original gapminder data; it remains intact for other analyses. Instead, the filter function generates a new dataset, with fewer rows, which is then displayed on the screen.

gapminder %>% 
  filter(year == 2007)
# A tibble: 142 × 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Australia   Oceania    2007    81.2  20434176    34435.
 7 Austria     Europe     2007    79.8   8199783    36126.
 8 Bahrain     Asia       2007    75.6    708573    29796.
 9 Bangladesh  Asia       2007    64.1 150448339     1391.
10 Belgium     Europe     2007    79.4  10392226    33693.
# ℹ 132 more rows

Another filtering condition could involve selecting observations based on a specific country, in addition to the year. For instance, let’s say we aim to extract data solely from the United States. We express this as “filter country equals equals ‘United States’”, which would yield only the 12 observations corresponding to that country. The quotation marks around ‘United States’ are crucial; without them, R wouldn’t recognize ‘United’ and ‘States’ as the content of a text variable, as opposed to variable names. Unlike numerical values such as 2007, text requires quotation marks for proper interpretation in R.

gapminder %>% 
  filter(country == "United States")
# A tibble: 12 × 6
   country       continent  year lifeExp       pop gdpPercap
   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
 1 United States Americas   1952    68.4 157553000    13990.
 2 United States Americas   1957    69.5 171984000    14847.
 3 United States Americas   1962    70.2 186538000    16173.
 4 United States Americas   1967    70.8 198712000    19530.
 5 United States Americas   1972    71.3 209896000    21806.
 6 United States Americas   1977    73.4 220239000    24073.
 7 United States Americas   1982    74.6 232187835    25010.
 8 United States Americas   1987    75.0 242803533    29884.
 9 United States Americas   1992    76.1 256894189    32004.
10 United States Americas   1997    76.8 272911760    35767.
11 United States Americas   2002    77.3 287675526    39097.
12 United States Americas   2007    78.2 301139947    42952.

Lastly, we can establish multiple conditions within the filter function. Each condition is separated by a comma. In this context, we’re specifying that we desire only the observation for the year 2007, followed by a comma, indicating where the country is the United States. Each “equals equals” expression constitutes an argument. This dual-filter approach proves beneficial for extracting a singular observation of interest. You’ll have the opportunity to practice this technique in the forthcoming exercises.

gapminder %>% 
  filter(year == 2007, country == "United States")
# A tibble: 1 × 6
  country       continent  year lifeExp       pop gdpPercap
  <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
1 United States Americas   2007    78.2 301139947    42952.

Exercise: the filter verb

Exercise : Filtering for one year

The filter verb extracts particular observations based on a condition. In this exercise you’ll filter for observations from a particular year.

  • Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.
# Filter the gapminder dataset for the year 1957
gapminder %>%
filter(year == 1957)
# A tibble: 142 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1957    30.3  9240934      821.
 2 Albania     Europe     1957    59.3  1476505     1942.
 3 Algeria     Africa     1957    45.7 10270856     3014.
 4 Angola      Africa     1957    32.0  4561361     3828.
 5 Argentina   Americas   1957    64.4 19610538     6857.
 6 Australia   Oceania    1957    70.3  9712569    10950.
 7 Austria     Europe     1957    67.5  6965860     8843.
 8 Bahrain     Asia       1957    53.8   138655    11636.
 9 Bangladesh  Asia       1957    39.3 51365468      662.
10 Belgium     Europe     1957    69.2  8989111     9715.
# ℹ 132 more rows
Exercise: Filtering for one coutry and one year

You can also use the filter() verb to set two conditions, which could retrieve a single observation.

Just like in the last exercise, you can do this in two lines of code, starting with gapminder %>% and having the filter() on the second line. Keeping one verb on each line helps keep the code readable. Note that each time, you’ll put the pipe %>% at the end of the first line (like gapminder %>%); putting the pipe at the beginning of the second line will throw an error.

  • Filter the gapminder data to retrieve only the observation from China in the year 2002.
# Filter for China in 2002
gapminder%>% 
filter(country=='China', year==2002)
# A tibble: 1 × 6
  country continent  year lifeExp        pop gdpPercap
  <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
1 China   Asia       2002    72.0 1280400000     3119.

The arrange verb

arrange sorts the observations in a dataset, in ascending or descending order based on one of its variables. This is useful, for example, when you want to know the most extreme values in a dataset.

Similar to the filter verb, you utilize the arrange verb after the pipe operator in R. You would specify the gapminder object, followed by the pipe operator (%>%), and then arrange. Inside the parentheses of arrange, you indicate the column by which you want to sort the observations.

gapminder %>% 
  arrange(gdpPercap)
gapminder %>% 
  arrange(gdpPercap)
# A tibble: 1,704 × 6
   country          continent  year lifeExp      pop gdpPercap
   <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
 1 Congo, Dem. Rep. Africa     2002    45.0 55379852      241.
 2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.
 3 Lesotho          Africa     1952    42.1   748747      299.
 4 Guinea-Bissau    Africa     1952    32.5   580653      300.
 5 Congo, Dem. Rep. Africa     1997    42.6 47798986      312.
 6 Eritrea          Africa     1952    35.9  1438760      329.
 7 Myanmar          Asia       1952    36.3 20092996      331 
 8 Lesotho          Africa     1957    45.0   813338      336.
 9 Burundi          Africa     1952    39.0  2445618      339.
10 Eritrea          Africa     1957    38.0  1542611      344.
# ℹ 1,694 more rows

After executing this, the observations are arranged in ascending order based on the specified column, with the lowest GDP per capita appearing first. Take note of the rightmost column: observe that it begins with 241, the smallest value in the dataset, and increases thereafter. From this, it’s evident that the country-year pair with the lowest GDP per capita was the Democratic Republic of the Congo in 2002.

Just like with filter, the original gapminder object remains unchanged; arrange simply provides you with a new dataset that is sorted accordingly.

Arrange also lets you sort in descending order. To achieve that, you would encapsulate the variable you’re sorting by within desc() for descending order. This approach enables us to identify that the country-year pair with the highest GDP per capita was Kuwait in the year 1957.

gapminder %>% 
  arrange(desc(gdpPercap))
gapminder %>% 
  arrange(desc(gdpPercap))
# A tibble: 1,704 × 6
   country   continent  year lifeExp     pop gdpPercap
   <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
 1 Kuwait    Asia       1957    58.0  212846   113523.
 2 Kuwait    Asia       1972    67.7  841934   109348.
 3 Kuwait    Asia       1952    55.6  160000   108382.
 4 Kuwait    Asia       1962    60.5  358266    95458.
 5 Kuwait    Asia       1967    64.6  575003    80895.
 6 Kuwait    Asia       1977    69.3 1140357    59265.
 7 Norway    Europe     2007    80.2 4627926    49357.
 8 Kuwait    Asia       2007    77.6 2505559    47307.
 9 Singapore Asia       2007    80.0 4553009    47143.
10 Norway    Europe     2002    79.0 4535591    44684.
# ℹ 1,694 more rows

Examining all countries and years simultaneously might not be particularly informative. Suppose you were interested in identifying the countries with the highest GDP per capita for a specific year. To do that, you can combine the two verbs you’ve already learned: filter, and arrange. You start with the gapminder dataset, then a pipe to give the dataset to filter. Then you specify that you want to filter for year equals equals 2007. Then you use another pipe. This takes the result of the filter, and gives it to arrange. You specify that you want to sort in descending order of GDP per capita. This shows you that the countries with the highest GDP per capita in 2007 were Norway, Kuwait, Singapore, and the United States. We can explore many such questions with various combinations of dplyr verbs. Over the course of these lessons, you’ll learn to pipe together multiple simple operations to create a rich and informative data analysis.

gapminder %>% 
  filter(year == 2007) %>% 
  arrange(desc(gdpPercap))
# A tibble: 142 × 6
   country          continent  year lifeExp       pop gdpPercap
   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
 1 Norway           Europe     2007    80.2   4627926    49357.
 2 Kuwait           Asia       2007    77.6   2505559    47307.
 3 Singapore        Asia       2007    80.0   4553009    47143.
 4 United States    Americas   2007    78.2 301139947    42952.
 5 Ireland          Europe     2007    78.9   4109086    40676.
 6 Hong Kong, China Asia       2007    82.2   6980412    39725.
 7 Switzerland      Europe     2007    81.7   7554661    37506.
 8 Netherlands      Europe     2007    79.8  16570613    36798.
 9 Canada           Americas   2007    80.7  33390141    36319.
10 Iceland          Europe     2007    81.8    301931    36181.
# ℹ 132 more rows

Exercise: the arrange verb

Exercise: Arranging observations by life expectancy

You use arrange() to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the dataset based on the lifeExp variable.

  • Sort the gapminder dataset in ascending order of life expectancy (lifeExp).
  • Sort the gapminder dataset in descending order of life expectancy.
# Sort in ascending order of lifeExp

gapminder%>%
    arrange(lifeExp)
# A tibble: 1,704 × 6
   country      continent  year lifeExp     pop gdpPercap
   <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
 1 Rwanda       Africa     1992    23.6 7290203      737.
 2 Afghanistan  Asia       1952    28.8 8425333      779.
 3 Gambia       Africa     1952    30    284320      485.
 4 Angola       Africa     1952    30.0 4232095     3521.
 5 Sierra Leone Africa     1952    30.3 2143249      880.
 6 Afghanistan  Asia       1957    30.3 9240934      821.
 7 Cambodia     Asia       1977    31.2 6978607      525.
 8 Mozambique   Africa     1952    31.3 6446316      469.
 9 Sierra Leone Africa     1957    31.6 2295678     1004.
10 Burkina Faso Africa     1952    32.0 4469979      543.
# ℹ 1,694 more rows
# Sort in descending order of lifeExp

gapminder%>%
    arrange(desc(lifeExp))
# A tibble: 1,704 × 6
   country          continent  year lifeExp       pop gdpPercap
   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
 1 Japan            Asia       2007    82.6 127467972    31656.
 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
 3 Japan            Asia       2002    82   127065841    28605.
 4 Iceland          Europe     2007    81.8    301931    36181.
 5 Switzerland      Europe     2007    81.7   7554661    37506.
 6 Hong Kong, China Asia       2002    81.5   6762476    30209.
 7 Australia        Oceania    2007    81.2  20434176    34435.
 8 Spain            Europe     2007    80.9  40448191    28821.
 9 Sweden           Europe     2007    80.9   9031088    33860.
10 Israel           Asia       2007    80.7   6426679    25523.
# ℹ 1,694 more rows
Exercise: Filtering and arranging

You’ll often need to use the pipe operator (%>%) to combine multiple dplyr verbs in a row. In this case, you’ll combine a filter() with an arrange() to find the highest population countries in a particular year.

  • Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).
# Filter for the year 1957, then arrange in descending order of population

gapminder%>%
    filter(year==1957)%>%
    arrange(desc(pop))
# A tibble: 142 × 6
   country        continent  year lifeExp       pop gdpPercap
   <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
 1 China          Asia       1957    50.5 637408000      576.
 2 India          Asia       1957    40.2 409000000      590.
 3 United States  Americas   1957    69.5 171984000    14847.
 4 Japan          Asia       1957    65.5  91563009     4318.
 5 Indonesia      Asia       1957    39.9  90124000      859.
 6 Germany        Europe     1957    69.1  71019069    10188.
 7 Brazil         Americas   1957    53.3  65551171     2487.
 8 United Kingdom Europe     1957    70.4  51430000    11283.
 9 Bangladesh     Asia       1957    39.3  51365468      662.
10 Italy          Europe     1957    67.8  49182000     6249.
# ℹ 132 more rows

The mutate verb

The mutate verb is used to create new columns or modify existing columns in a data frame. It allows you to apply functions or operations to each row of the data frame to generate new values for the specified columns.

gapminder %>% 
  mutate(pop = pop/1000000)

Similar to filter or arrange, you use mutate after a pipe operator in R. Inside the mutate statement, what’s on the right of the equals sign is what’s being calculated, while what’s on the left is what’s being replaced. In this example, you’re calculating “population divided by one million” using the slash operator for division. On the left, you’re indicating that you want to replace the existing pop column by writing pop =.

As a result, you obtain the same table, but with the pop column replaced by a new value, one that’s much smaller than it was before. This demonstrates how you can manipulate existing variables in the table, a task often required during data processing and cleaning. Like filter and arrange, you’re not modifying the original gapminder data; instead, you’re altering the value in the new data frame that’s being returned.

gapminder %>% 
  mutate(pop = pop/1000000)
# A tibble: 1,704 × 6
   country     continent  year lifeExp   pop gdpPercap
   <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8.43      779.
 2 Afghanistan Asia       1957    30.3  9.24      821.
 3 Afghanistan Asia       1962    32.0 10.3       853.
 4 Afghanistan Asia       1967    34.0 11.5       836.
 5 Afghanistan Asia       1972    36.1 13.1       740.
 6 Afghanistan Asia       1977    38.4 14.9       786.
 7 Afghanistan Asia       1982    39.9 12.9       978.
 8 Afghanistan Asia       1987    40.8 13.9       852.
 9 Afghanistan Asia       1992    41.7 16.3       649.
10 Afghanistan Asia       1997    41.8 22.2       635.
# ℹ 1,694 more rows

Alternatively, you might want to introduce a new variable. For example, you currently have the GDP per capita, which is the Gross Domestic Product of the country divided by the current population. However, for your analysis, you might be interested in knowing the total GDP. To obtain this, you would multiply the population by the GDP per capita.

You would use mutate in a similar manner as before. You pipe your gapminder data to the mutate verb. In R, the asterisk represents multiplication, so you write gdpPercap * pop to multiply the two columns. It’s worth noting that for clarity, we’ve named the new column gdp, which appears to the left of the equals sign in our code. Column names must be single words without spaces.

gapminder %>% 
  mutate(gdp = gdpPercap * pop)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap          gdp
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
 7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
 8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
 9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ℹ 1,694 more rows

Upon inspecting the results, you’ll observe a brand new gdp column, significantly larger than the GDP per capita.

Let’s integrate the three verbs you’ve learned in this chapter to address a question about our data. Suppose we aim to identify the countries with the highest total GDP in the year 2007. We can accomplish this in three steps: creating the column, filtering for 2007, and then sorting.

gapminder %>% 
  mutate(gdp = gdpPercap * pop) %>% 
  filter(year == 2007) %>% 
  arrange(desc(gdp))

Firstly, we utilize mutate to generate the total GDP column. Next, we employ filter to confine our analysis solely to the year 2007. Lastly, we utilize arrange to arrange the data in descending order based on our newly created GDP variable. This sequence furnishes us with the desired answer.

In 2007, the United States emerged as the country with the highest total GDP, amounting to 13 trillion dollars. Following the US are China, Japan, India, and Germany, ranking among the top GDP countries. As you become proficient with dplyr, you’ll be equipped to explore answers to such questions and tackle even more complex inquiries using your own data.

gapminder %>% 
  mutate(gdp = gdpPercap * pop) %>% 
  filter(year == 2007) %>% 
  arrange(desc(gdp))
# A tibble: 142 × 7
   country        continent  year lifeExp        pop gdpPercap     gdp
   <fct>          <fct>     <int>   <dbl>      <int>     <dbl>   <dbl>
 1 United States  Americas   2007    78.2  301139947    42952. 1.29e13
 2 China          Asia       2007    73.0 1318683096     4959. 6.54e12
 3 Japan          Asia       2007    82.6  127467972    31656. 4.04e12
 4 India          Asia       2007    64.7 1110396331     2452. 2.72e12
 5 Germany        Europe     2007    79.4   82400996    32170. 2.65e12
 6 United Kingdom Europe     2007    79.4   60776238    33203. 2.02e12
 7 France         Europe     2007    80.7   61083916    30470. 1.86e12
 8 Brazil         Americas   2007    72.4  190010647     9066. 1.72e12
 9 Italy          Europe     2007    80.5   58147733    28570. 1.66e12
10 Mexico         Americas   2007    76.2  108700891    11978. 1.30e12
# ℹ 132 more rows

Exercise: the mutate verb

Exercise: Using mutate to change or create a column

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

  • Use mutate() to change the existing lifeExp column, by multiplying it by 12: 12 * lifeExp.
  • Use mutate() to add a new column, called lifeExpMonths, calculated as 12 * lifeExp.
# Use mutate to change lifeExp to be in months

gapminder%>%
    mutate(lifeExp = lifeExp*12)
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    346.  8425333      779.
 2 Afghanistan Asia       1957    364.  9240934      821.
 3 Afghanistan Asia       1962    384. 10267083      853.
 4 Afghanistan Asia       1967    408. 11537966      836.
 5 Afghanistan Asia       1972    433. 13079460      740.
 6 Afghanistan Asia       1977    461. 14880372      786.
 7 Afghanistan Asia       1982    478. 12881816      978.
 8 Afghanistan Asia       1987    490. 13867957      852.
 9 Afghanistan Asia       1992    500. 16317921      649.
10 Afghanistan Asia       1997    501. 22227415      635.
# ℹ 1,694 more rows
# Use mutate to create a new column called lifeExpMonths
gapminder%>%
    mutate(lifeExpMonths=lifeExp*12)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap lifeExpMonths
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.          346.
 2 Afghanistan Asia       1957    30.3  9240934      821.          364.
 3 Afghanistan Asia       1962    32.0 10267083      853.          384.
 4 Afghanistan Asia       1967    34.0 11537966      836.          408.
 5 Afghanistan Asia       1972    36.1 13079460      740.          433.
 6 Afghanistan Asia       1977    38.4 14880372      786.          461.
 7 Afghanistan Asia       1982    39.9 12881816      978.          478.
 8 Afghanistan Asia       1987    40.8 13867957      852.          490.
 9 Afghanistan Asia       1992    41.7 16317921      649.          500.
10 Afghanistan Asia       1997    41.8 22227415      635.          501.
# ℹ 1,694 more rows
Exercise: Combining filter, mutate, and arrange

In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

  • In one sequence of pipes on the gapminder dataset:
  • filter() for observations from the year 2007,
  • mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and
  • arrange() in descending order of that new column
# Filter for the year 1957, 
# mutate to change lifeExp to be in months
# then arrange in descending order of population

gapminder%>%
  filter(year == 1957) %>% 
  mutate(lifeExpMonths = lifeExp*12) %>% 
  arrange(lifeExpMonths)
# A tibble: 142 × 7
   country       continent  year lifeExp     pop gdpPercap lifeExpMonths
   <fct>         <fct>     <int>   <dbl>   <int>     <dbl>         <dbl>
 1 Afghanistan   Asia       1957    30.3 9240934      821.          364.
 2 Sierra Leone  Africa     1957    31.6 2295678     1004.          379.
 3 Angola        Africa     1957    32.0 4561361     3828.          384.
 4 Gambia        Africa     1957    32.1  323150      521.          385.
 5 Guinea-Bissau Africa     1957    33.5  601095      432.          402.
 6 Mozambique    Africa     1957    33.8 7038035      496.          405.
 7 Yemen, Rep.   Asia       1957    34.0 5498090      805.          408.
 8 Guinea        Africa     1957    34.6 2876726      576.          415.
 9 Burkina Faso  Africa     1957    34.9 4713416      617.          419.
10 Somalia       Africa     1957    35.0 2780415     1258.          420.
# ℹ 132 more rows