Grouping and summarizing

In this chapter, you’ll return to the topic of data transformation with dplyr to learn more ways to explore your data.

You’ve learned to use the extracting data using filter verb to pull out individual observations, such as statistics for the United States in 2007. Now you’ll learn how to summarize many observations into a single data point.

The summarize verb

Suppose you want to know the average life expectancy across all countries and years in the dataset. You would do this with the summarize verb. Take your gapminder data, pipe it into summarize, and specify that you’re creating a summary column called meanLifeExp. The “mean parentheses lifeexp end parentheses” there is worth examining. This is calling the function mean on the variable lifeExp. The mean function takes the average of a set of values, and R comes with many built-in functions like this. Notice that summarize collapses the entire table down into one row. In the output, we see the answer to our question: the mean life expectancy was about 59.47 years. If you think about it, it doesn’t really make sense to summarize across all countries and across all years. It may make more sense to ask questions about averages in a particular year, such as 2007.

gapminder %>%
  summarize(meanLifeExp = mean(lifeExp))
# A tibble: 1 × 1
  meanLifeExp
        <dbl>
1        59.5

Summarizing one year

To answer this, you can combine the summarize verb with filter: filter your data for a particular year first, then summarize the result. This shows you that the average life expectancy in the year 2007 was about 67 years. You can create multiple summaries at once with the summarize verb.

gapminder %>%
  filter(year == 2007) %>%
  summarize(meanLifeExp = mean(lifeExp))
# A tibble: 1 × 1
  meanLifeExp
        <dbl>
1        67.0

Summarizing into multiple columns

For example, suppose that along with finding the average life expectancy in 2007, you want to find the total population in that year. To do that, you add a comma after the mean of the life expectancy, and specify another column that you’re creating. You could give it a useful name like totalPop, and say that it’s equal to the sum- that’s another built-in function- of the pop variable.

gapminder %>%
  filter(year == 2007) %>%
  summarize(meanLifeExp = mean(lifeExp),
  totalPop = sum(pop))
# A tibble: 1 × 2
  meanLifeExp   totalPop
        <dbl>      <dbl>
1        67.0 6251013179

Functions you can use for summarizing

  • mean and sum are just two of the built-in functions you could use to summarize a variable within a dataset.
  • Another example is median: the median represents the point in a set of numbers where half the numbers are above that point and half of the numbers are below.
  • Two others are min, for minimum, and max, for maximum. In the exercises, you’ll use several of these functions.

Exercise: Summarizing the median life expectancy

You’ve seen how to find the mean life expectancy and the total population across a set of observations, but mean() and sum() are only two of the functions R provides for summarizing a collection of numbers. Here, you’ll learn to use the median() function in combination with summarize().

Instructions
  • Use the median() function within a summarize() to find the median life expectancy. Save it into a column called medianLifeExp.
# A tibble: 1 × 1
  medianLifeExp
          <dbl>
1          60.7
library(gapminder)
library(dplyr)

# Summarize to find the median life expectancy
gapminder%>%
summarize(medianLifeExp = median(lifeExp))
# A tibble: 1 × 1
  medianLifeExp
          <dbl>
1          60.7

Exercise: Summarizing the median life expectancy in 1957

Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you’ll find the median in the year 1957.

Instructions
  • Filter for the year 1957, then use the median() function within a summarize() to calculate the median life expectancy into a column called medianLifeExp.
library(gapminder)
library(dplyr)

# Filter for 1957 then summarize the median life expectancy
gapminder%>%
filter(year == 1957)%>%
summarize(medianLifeExp = median(lifeExp))
# A tibble: 1 × 1
  medianLifeExp
          <dbl>
1          48.4

Exercise: Summarizing multiple variables in 1957

The summarize() verb allows you to summarize multiple variables at once. In this case, you’ll use the median() function to find the median life expectancy and the max() function to find the maximum GDP per capita.

Instructions
  • Find both the median life expectancy (lifeExp) and the maximum GDP per capita (gdpPercap) in the year 1957, calling them medianLifeExp and maxGdpPercap respectively. You can use the max() function to find the maximum.
# A tibble: 1 × 2
  medianLifeExp maxGdpPercap
          <dbl>        <dbl>
1          48.4      113523.
library(gapminder)
library(dplyr)

# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder%>%
filter(year == 1957)%>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
# A tibble: 1 × 2
  medianLifeExp maxGdpPercap
          <dbl>        <dbl>
1          48.4      113523.

The group_by verb

The group_by verb

In the last set of exercises, you learned to use the summarize verb summarize verb to answer questions about the entire dataset, or about a particular year. For example, here you’re finding the average life expectancy and the total population in the year 2007.

gapminder %>%
  filter(year == 2007) %>%
  summarize(meanLifeExp = mean(lifeExp),
  totalPop = sum(pop))
# A tibble: 1 × 2
  meanLifeExp   totalPop
        <dbl>      <dbl>
1        67.0 6251013179

What if we weren’t interested just in the average for the year 2007, but for each of the years in the dataset? You could rerun this code and change the year each time, but that’s very tedious. Instead, you can use the group_by verb, which tells dplyr to summarize within groups instead of summarizing the entire dataset.

Summarizing by year

gapminder %>%
  group_by(year) %>%
  summarize(meanLifeExp = mean(lifeExp),
  totalPop = sum(pop))
# A tibble: 12 × 3
    year meanLifeExp   totalPop
   <int>       <dbl>      <dbl>
 1  1952        49.1 2406957150
 2  1957        51.5 2664404580
 3  1962        53.6 2899782974
 4  1967        55.7 3217478384
 5  1972        57.6 3576977158
 6  1977        59.6 3930045807
 7  1982        61.5 4289436840
 8  1987        63.2 4691477418
 9  1992        64.2 5110710260
10  1997        65.0 5515204472
11  2002        65.7 5886977579
12  2007        67.0 6251013179

Notice that this replaces the filter year equals 2007 with group_by year. group_by(year) tells the summarize step that it should perform the summary within each year: within 1952, then within 1957, then within 1962, and combine the results. Instead of getting one row overall, you now get one row for each year. There’s now a year variable along with the new meanLifeExp and totalPop variables. This shows us that the total population started at 2-point-4 billion, and went up to 6-point-25 billion in 2007. We can also see that average life expectancy went up from 49 years in 1952 to 67. You can summarize by other variables besides year. Suppose you’re

Summarizing by continent

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(meanLifeExp = mean(lifeExp),
  totalPop = sum(pop))
# A tibble: 5 × 3
  continent meanLifeExp   totalPop
  <fct>           <dbl>      <dbl>
1 Africa           54.8  929539692
2 Americas         73.6  898871184
3 Asia             70.7 3811953827
4 Europe           77.6  586098529
5 Oceania          80.7   24549947

interested in the average life expectancy and the total population in 2007 within each continent. You can find this by first filtering for the year 2007, grouping by continent (instead of year), and then performing your summary. This results in a table with one row for each continent, with columns for mean life expectancy and total population. We can see that Europe and Oceania have the highest life expectancy, and that Asia and Africa are lower. Now that you’ve calculated these statistics for each continent in 2007, you might be interested in how they changed for each continent over time.

Summarizing by continent and year

gapminder %>%
  group_by(year, continent) %>%
  summarize(totalPop = sum(pop),
  meanLifeExp = mean(lifeExp))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 60 × 4
# Groups:   year [12]
    year continent   totalPop meanLifeExp
   <int> <fct>          <dbl>       <dbl>
 1  1952 Africa     237640501        39.1
 2  1952 Americas   345152446        53.3
 3  1952 Asia      1395357351        46.3
 4  1952 Europe     418120846        64.4
 5  1952 Oceania     10686006        69.3
 6  1957 Africa     264837738        41.3
 7  1957 Americas   386953916        56.0
 8  1957 Asia      1562780599        49.3
 9  1957 Europe     437890351        66.7
10  1957 Oceania     11941976        70.3
# ℹ 50 more rows

To do so, you can summarize by both year and continent, by adding year comma continent within the group by. Now the output has one row for each combination of a year and continent. For example, we see the total population and average life expectancy in 1952 for Africa, the Americas, Asia, Europe, and Oceania, followed by each of the continent-level summaries for 1957. In the next video, you’ll learn how to visualize this per-year, per-continent data to understand trends over time.

Exercise: Summarizing by year

In a previous exercise, you found the median life expectancy and the maximum GDP per capita in the year 1957. Now, you’ll perform those two summaries within each year in the dataset, using the group_by verb.

Instructions
  • Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each year, saving them into medianLifeExp and maxGdpPercap, respectively.
# A tibble: 12 × 3
    year medianLifeExp maxGdpPercap
   <int>         <dbl>        <dbl>
 1  1952          45.1      108382.
 2  1957          48.4      113523.
 3  1962          50.9       95458.
 4  1967          53.8       80895.
 5  1972          56.5      109348.
 6  1977          59.7       59265.
 7  1982          62.4       33693.
 8  1987          65.8       31541.
 9  1992          67.7       34933.
10  1997          69.4       41283.
11  2002          70.8       44684.
12  2007          71.9       49357.
library(gapminder)
library(dplyr)

# Find median life expectancy and maximum GDP per capita in each year
gapminder%>%
group_by(year)%>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
# A tibble: 12 × 3
    year medianLifeExp maxGdpPercap
   <int>         <dbl>        <dbl>
 1  1952          45.1      108382.
 2  1957          48.4      113523.
 3  1962          50.9       95458.
 4  1967          53.8       80895.
 5  1972          56.5      109348.
 6  1977          59.7       59265.
 7  1982          62.4       33693.
 8  1987          65.8       31541.
 9  1992          67.7       34933.
10  1997          69.4       41283.
11  2002          70.8       44684.
12  2007          71.9       49357.

Exercise: Summarizing by continent

You can group by any variable in your dataset to create a summary. Rather than comparing across time, you might be interested in comparing among continents. You’ll want to do that within one year of the dataset: let’s use 1957.

Instructions
  • Filter the gapminder data for the year 1957. Then find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each continent, saving them into medianLifeExp and maxGdpPercap, respectively.
# A tibble: 5 × 3
  continent medianLifeExp maxGdpPercap
  <fct>             <dbl>        <dbl>
1 Africa             40.6        5487.
2 Americas           56.1       14847.
3 Asia               48.3      113523.
4 Europe             67.6       17909.
5 Oceania            70.3       12247.
library(gapminder)
library(dplyr)

# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder%>%
filter(year ==1957)%>%
group_by(continent)%>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
# A tibble: 5 × 3
  continent medianLifeExp maxGdpPercap
  <fct>             <dbl>        <dbl>
1 Africa             40.6        5487.
2 Americas           56.1       14847.
3 Asia               48.3      113523.
4 Europe             67.6       17909.
5 Oceania            70.3       12247.

Exercise: Summarizing by continent and year

Instead of grouping just by year, or just by continent, you’ll now group by both continent and year to summarize within each.

Instructions
  • Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each combination of continent and year, saving them into medianLifeExp and maxGdpPercap, respectively.
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 60 × 4
# Groups:   year [12]
    year continent medianLifeExp maxGdpPercap
   <int> <fct>             <dbl>        <dbl>
 1  1952 Africa             38.8        4725.
 2  1952 Americas           54.7       13990.
 3  1952 Asia               44.9      108382.
 4  1952 Europe             65.9       14734.
 5  1952 Oceania            69.3       10557.
 6  1957 Africa             40.6        5487.
 7  1957 Americas           56.1       14847.
 8  1957 Asia               48.3      113523.
 9  1957 Europe             67.6       17909.
10  1957 Oceania            70.3       12247.
# ℹ 50 more rows
library(gapminder)
library(dplyr)

# Find median life expectancy and maximum GDP per capita in each continent/year combination
gapminder%>%
group_by(year,continent)%>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 60 × 4
# Groups:   year [12]
    year continent medianLifeExp maxGdpPercap
   <int> <fct>             <dbl>        <dbl>
 1  1952 Africa             38.8        4725.
 2  1952 Americas           54.7       13990.
 3  1952 Asia               44.9      108382.
 4  1952 Europe             65.9       14734.
 5  1952 Oceania            69.3       10557.
 6  1957 Africa             40.6        5487.
 7  1957 Americas           56.1       14847.
 8  1957 Asia               48.3      113523.
 9  1957 Europe             67.6       17909.
10  1957 Oceania            70.3       12247.
# ℹ 50 more rows

Visualizing summarized data

In the previous section you learned to use the group by and summarize verbs to summarize the gapminder data by year, by continent, or by both. Now you’ll learn how to turn those summaries into informative visualizations, by returning to the ggplot2 package.

Summarizing by year

In the last section we summarized data by year, to find the change in population and in mean life expectancy over time. Now instead of viewing the summarized data as a table, let’s save it as an object called by_year, so you can visualize the data using ggplot2.

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(totalPop = sum(pop),meanLifeExp = mean(lifeExp))

by_year
# A tibble: 12 × 3
    year   totalPop meanLifeExp
   <int>      <dbl>       <dbl>
 1  1952 2406957150        49.1
 2  1957 2664404580        51.5
 3  1962 2899782974        53.6
 4  1967 3217478384        55.7
 5  1972 3576977158        57.6
 6  1977 3930045807        59.6
 7  1982 4289436840        61.5
 8  1987 4691477418        63.2
 9  1992 5110710260        64.2
10  1997 5515204472        65.0
11  2002 5886977579        65.7
12  2007 6251013179        67.0

Visualizing population over time

You would construct the graph with the three steps of ggplot2: - The data, which is by_year. - The aesthetics, which puts year on the x-axis and total population on the y-axis. - And the type of graph, which in this case is a scatter plot, represented by geom_point.

ggplot(by_year, aes(x = year, y = totalPop)) +
geom_point()

Notice that the steps are the same as when you were graphing countries in a scatter plot, even though it’s a new dataset. The resulting graph of population by year shows the change in the total population, which is going up over time. ggplot2 puts the y-axis is in scientific notation, since showing it with nine zeros would be hard to read. The global starts a little under 3 times 10 to the 9th power- that’s three billion- and goes up to more than 6 billion.

Starting y-axis at zero

You might notice that the graph is a little misleading because it doesn’t include zero: you don’t have a sense of how much the population grew relative to where it was when it started. This is a good time to introduce another graphing option.

ggplot(by_year, aes(x = year, y = totalPop)) +
  geom_point() +
  expand_limits(y = 0)

By adding “expand underscore limits y = 0” to the end of the ggplot call, you can specify that you want the y-axis to start at zero. Notice that you added it to the end just like you would with scale_x_log10, or facet_wrap. Now the graph makes it clearer that the population is almost tripling during this time.

You could have created other graphs of summarized data, such as a graph of the average life expectancy over time, by changing the y aesthetic.

Summarizing by year and continent

So far you’ve been graphing the by-year summarized data. But you have also learned to summarize after grouping by both year and continent, to see how the changes in population have occurred separately within each continent.

by_year_continent = gapminder %>%
  group_by(year, continent) %>%
  summarize(totalPop = sum(pop),
  meanLifeExp = mean(lifeExp))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
by_year_continent 
# A tibble: 60 × 4
# Groups:   year [12]
    year continent   totalPop meanLifeExp
   <int> <fct>          <dbl>       <dbl>
 1  1952 Africa     237640501        39.1
 2  1952 Americas   345152446        53.3
 3  1952 Asia      1395357351        46.3
 4  1952 Europe     418120846        64.4
 5  1952 Oceania     10686006        69.3
 6  1957 Africa     264837738        41.3
 7  1957 Americas   386953916        56.0
 8  1957 Asia      1562780599        49.3
 9  1957 Europe     437890351        66.7
10  1957 Oceania     11941976        70.3
# ℹ 50 more rows

Visualizing population by year and continent

Since you now have data over time within each continent, you need a way to separate it in a visualization. To do that you can use the color aesthetic you learned about in chapter two. By setting color equals continent, you can show five separate trends on the same graph.

ggplot(by_year_continent, aes(x = year, y = totalPop, color = continent)) +
  geom_point() +
  expand_limits(y = 0)

This lets us see that Asia was always the most populated continent and has been growing the most rapidly, that Europe has a slower rate of growth, and that Africa has grown to surpass both Europe and the Americas in terms of population.

Exercise: Visualizing median life expectancy over time

In the last chapter, you summarized the gapminder data to calculate the median life expectancy within each year. This code is provided for you, and is saved with as the by_year dataset. Now you can use the ggplot2 package to turn this into a visualization of changing life expectancy over time.

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

by_year
# A tibble: 12 × 3
    year medianLifeExp maxGdpPercap
   <int>         <dbl>        <dbl>
 1  1952          45.1      108382.
 2  1957          48.4      113523.
 3  1962          50.9       95458.
 4  1967          53.8       80895.
 5  1972          56.5      109348.
 6  1977          59.7       59265.
 7  1982          62.4       33693.
 8  1987          65.8       31541.
 9  1992          67.7       34933.
10  1997          69.4       41283.
11  2002          70.8       44684.
12  2007          71.9       49357.
Instructions
  • Use the by_year dataset to create a scatter plot showing the change of median life expectancy over time, with year on the x-axis and medianLifeExp on the y-axis. Be sure to add expand_limits(y = 0) to make sure the plot’s y-axis includes zero.

library(gapminder)
library(dplyr)
library(ggplot2)

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year,aes(x = year, y = medianLifeExp))+
  geom_point()+
  expand_limits(y=0)

Exercise: Visualizing median GDP per capita per continent over time

In the last exercise you were able to see how the median life expectancy of countries changed over time. Now you’ll examine the median GDP per capita instead, and see how the trend differs among continents.

Instructions
  • Summarize the gapminder dataset by continent and year, finding the median GDP per capita (gdpPercap) within each and putting it into a column called medianGdpPercap.
  • Use the assignment operator = to save this summarized data as by_year_continent.
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 60 × 3
# Groups:   continent [5]
   continent  year medianGdpPercap
   <fct>     <int>           <dbl>
 1 Africa     1952            987.
 2 Africa     1957           1024.
 3 Africa     1962           1134.
 4 Africa     1967           1210.
 5 Africa     1972           1443.
 6 Africa     1977           1400.
 7 Africa     1982           1324.
 8 Africa     1987           1220.
 9 Africa     1992           1162.
10 Africa     1997           1180.
# ℹ 50 more rows
  • Create a scatter plot showing the change in medianGdpPercap by continent over time. Use color to distinguish between continents, and be sure to add expand_limits(y = 0) so that the y-axis starts at zero.

library(gapminder)
library(dplyr)
library(ggplot2)

# Summarize medianGdpPercap within each continent within each year: by_year_continent

by_year_continent = gapminder%>%
                        group_by(continent, year)%>%
                        summarize(medianGdpPercap = median(gdpPercap))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap,color = continent))+
  geom_point()+
  expand_limits(y = 0)

Exercise: Comparing median life expectancy and median GDP per continent in 2007

In these exercises you’ve generally created plots that show change over time. But as another way of exploring your data visually, you can also use ggplot2 to plot summarized data to compare continents within a single year.

Instructions
  • Filter the gapminder dataset for the year 2007, then summarize the median GDP per capita and the median life expectancy within each continent, into columns called medianLifeExp and medianGdpPercap. Save this as by_continent_2007.
# A tibble: 5 × 3
  continent medianLifeExp medianGdpPercap
  <fct>             <dbl>           <dbl>
1 Africa             52.9           1452.
2 Americas           72.9           8948.
3 Asia               72.4           4471.
4 Europe             78.6          28054.
5 Oceania            80.7          29810.
  • Use the by_continent_2007 data to create a scatterplot comparing these summary statistics for continents in 2007, putting the median GDP per capita on the x-axis to the median life expectancy on the y-axis. Color the scatter plot by continent. You don’t need to add expand_limits(y = 0) for this plot.

library(gapminder)
library(dplyr)
library(ggplot2)

# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 = gapminder%>%
                    filter(year ==2007)%>%
                    group_by(continent)%>%
                    summarize(medianLifeExp = median(lifeExp), medianGdpPercap = median(gdpPercap))

# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007, aes(x = medianGdpPercap, y = medianLifeExp,colot = continent))+
geom_point()