4  Lab 3 - 10/10/2024

In this lecture we will learn another R programming approach based on the tidyverse package. This is alternative to the base R code we learnt in the first lectures.

4.1 The pipe operator

Let’s consider a general R function named f with argument x. We usually use the following approach when we need to apply f:

f(x)

An alternative is given by the pipe operator %>% which is part of the dplyr package (see here for more details). It works as follows

x %>% f() 
#this is equivalent to f(x)

Basically, the pipe tells R to pass x as the first argument of the function f. The shortcut to type the pipe operator in RStudio is given by CTRL/CMD Shift M.

We simulate a sample of data in order to run some simple examples with the pipe operator. By using the function sample we draw randomly (without replacement) 5 numbers between 1 and 20 (1:20).

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(4)
# sample 5 random values from a sequence of integers between 1 to 20
x = sample(1:20, 5)
x
[1] 11 19  3  7 12

We are now interested in computing the log transformation of the vector x. By adopting the standard R programming we would use:

log(x)
[1] 2.397895 2.944439 1.098612 1.945910 2.484907

while with the pipe operator we have

x %>% log()
[1] 2.397895 2.944439 1.098612 1.945910 2.484907
#it's also possible to omit the parentheses given that there is no input
x %>% log
[1] 2.397895 2.944439 1.098612 1.945910 2.484907

where x is taken as the first argument of the function log. It is also possible to include other arguments, such as for example the base of the logarithm (in this case equal to 5). In this case note that x %>% f(y) is equivalent to f(x,y).

#standard programming
log(x, base=5)
[1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593
#pipe based programming
x %>% log(base=5)
[1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593
x %>% log(5) 
[1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593

We want now to apply the log transformation and then round the corresponding output to 2 digits. This requires the use of two functions (log and round). In general, when we apply 3 functions (f and then g and finally h), we have that x %>% f %>% g %>% h is equivalent to h(g(f(x))).

# standard programming
round(log(x), 2)
[1] 2.40 2.94 1.10 1.95 2.48
# pipe based programming
x %>%  
  log %>% 
  round(2)
[1] 2.40 2.94 1.10 1.95 2.48

We now add a new function, after rounding the log output we compute the sum of the 5 numbers

# standard programming
sum(round(log(x),2))
[1] 10.87
# pipe based programming
x %>% 
  log %>% 
  round(2) %>% 
  sum 
[1] 10.87

Now, we want now to use the sum result as the base of the log transformation of the number 5.

# standard programming
log(5, base = sum(round(log(x),2)))
[1] 0.674532
# pipe based programming
x %>% 
  log %>% 
  round(2) %>% 
  sum %>% 
  log(5,base=.) 
[1] 0.674532

The symbol . is the placeholder and is used when the output of the previous pipe should not be used as the first input of the following function. In general, x %>% f(y, z = .) is equivalent to f(y, z = x).

When it is not convenient to use the pipe:

  • when the pipes are longer than 10 steps. In this case the suggestion is to create intermediate objects with meaningful names (that can help understanding what the code does);
  • when you have multiple inputs or outputs (e.g. when there is no primary object being transformed but two or more objects being combined together).

4.2 dyplyr verbs

dplyr (a package in the tidyverse collection) is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • select: pick variables (columns) based on their names
  • filter pick observations (rows) based on their values
  • summarise: reduce multiple values down to a single summary (e.g. mean)
  • mutate: add new variables that are functions of existing variables
  • arrange: change the ordering of the rows

All verbs work similarly:

  • the first argument is a data frame;
  • the subsequent arguments describe what to do with the data frame using the variable names (without quotes);
  • the result is a new data frame.

In the following we will take into account all the dplyr verbs by considering the diamonds data set which contains the prices and other attributes of almost 54,000 diamonds (see ?diamonds). If we use the function class to understand the nature of diamonds we get the following output:

class(diamonds)
[1] "tbl_df"     "tbl"        "data.frame"

The term tbl (tibble) is the tidyverse version of a classical R data frame. Tiblles are very similar to data frame (they just contain/display more information) and are designed to be used with the tidyverse syntax style.

To get the list and the type of variables included in diamonds we can use the standard str or the corresponding tidyverse function named glimpse:

str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

4.2.1 Verb 1: select

This verb is used to select some of the columns by name. For example, to select the variable named carat we use the following code:

diamonds %>% 
  select(carat) 
# A tibble: 53,940 × 1
   carat
   <dbl>
 1  0.23
 2  0.21
 3  0.23
 4  0.29
 5  0.31
 6  0.24
 7  0.24
 8  0.26
 9  0.22
10  0.23
# ℹ 53,930 more rows
#same as
diamonds[,1]
# A tibble: 53,940 × 1
   carat
   <dbl>
 1  0.23
 2  0.21
 3  0.23
 4  0.29
 5  0.31
 6  0.24
 7  0.24
 8  0.26
 9  0.22
10  0.23
# ℹ 53,930 more rows

Selecting more than one column is very simple:

diamonds %>% 
  select(carat, cut, color, price)
# A tibble: 53,940 × 4
   carat cut       color price
   <dbl> <ord>     <ord> <int>
 1  0.23 Ideal     E       326
 2  0.21 Premium   E       326
 3  0.23 Good      E       327
 4  0.29 Premium   I       334
 5  0.31 Good      J       335
 6  0.24 Very Good J       336
 7  0.24 Very Good I       336
 8  0.26 Very Good H       337
 9  0.22 Fair      E       337
10  0.23 Very Good H       338
# ℹ 53,930 more rows

Given that carat, cut and color are consecutive, the following code is also possible:

view(diamonds) #see the data frame
diamonds %>% 
  select(carat : color, price)
# A tibble: 53,940 × 4
   carat cut       color price
   <dbl> <ord>     <ord> <int>
 1  0.23 Ideal     E       326
 2  0.21 Premium   E       326
 3  0.23 Good      E       327
 4  0.29 Premium   I       334
 5  0.31 Good      J       335
 6  0.24 Very Good J       336
 7  0.24 Very Good I       336
 8  0.26 Very Good H       337
 9  0.22 Fair      E       337
10  0.23 Very Good H       338
# ℹ 53,930 more rows

Moreover, it is also possible to select all the columns whose name starts with the letter “c” by using starts_with combined with the select function:

diamonds %>% 
  select(starts_with("c"))
# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows

It is also possible to specify a criterion which excludes from the selection some variables. For example, to select all the columns but carat we use the - symbol:

diamonds %>% 
  select(-carat)
# A tibble: 53,940 × 9
   cut       color clarity depth table price     x     y     z
   <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

We proceed similarly to select all the columns but not the ones with a name starting with “c”:

diamonds %>% 
  select(- starts_with("c"))
# A tibble: 53,940 × 6
   depth table price     x     y     z
   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  61.5    55   326  3.95  3.98  2.43
 2  59.8    61   326  3.89  3.84  2.31
 3  56.9    65   327  4.05  4.07  2.31
 4  62.4    58   334  4.2   4.23  2.63
 5  63.3    58   335  4.34  4.35  2.75
 6  62.8    57   336  3.94  3.96  2.48
 7  62.3    57   336  3.95  3.98  2.47
 8  61.9    55   337  4.07  4.11  2.53
 9  65.1    61   337  3.87  3.78  2.49
10  59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Another useful application is the selection of only the numerical variables. This can be performed by using the select_if function combined with is.numeric (the latter is a test of an object being interpretable as numbers):

diamonds %>% 
  select_if(is.numeric)
# A tibble: 53,940 × 7
   carat depth table price     x     y     z
   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23  61.5    55   326  3.95  3.98  2.43
 2  0.21  59.8    61   326  3.89  3.84  2.31
 3  0.23  56.9    65   327  4.05  4.07  2.31
 4  0.29  62.4    58   334  4.2   4.23  2.63
 5  0.31  63.3    58   335  4.34  4.35  2.75
 6  0.24  62.8    57   336  3.94  3.96  2.48
 7  0.24  62.3    57   336  3.95  3.98  2.47
 8  0.26  61.9    55   337  4.07  4.11  2.53
 9  0.22  65.1    61   337  3.87  3.78  2.49
10  0.23  59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

The returned output can be then used to compute jointly all the averages of the numerical variables by using the apply function):

diamonds %>% 
  select_if(is.numeric) %>% 
  apply(2, mean)# remember 2 means per column
       carat        depth        table        price            x            y 
   0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
           z 
   3.5387338 

4.2.2 Verb 2: filter

The verb filter can be used to select the observations satisfying some criterion. Consider the example the selection of the diamonds with variable cut equal to the category “Premium”:

diamonds %>% 
  filter(cut == "Premium")
# A tibble: 13,791 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
 3  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
 4  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 5  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 6  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
 7  0.29 Premium F     SI1      62.4    58   403  4.24  4.26  2.65
 8  0.22 Premium E     VS2      61.6    58   404  3.93  3.89  2.41
 9  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
10  0.3  Premium J     SI2      59.3    61   405  4.43  4.38  2.61
# ℹ 13,781 more rows

It is also possible to include more conditions. Select for example the diamonds with cut equal to “Premium” AND color equal to “D” (the AND can be specified by using & or comma):

diamonds %>% 
  filter(cut == "Premium" & color == "D") 
# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows
diamonds %>% 
  filter(cut == "Premium" , color == "D") 
# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows

If we are interested in filtering th diamonds with a price between 500 and 600 dollars the following two alternative codes can be adopted:

diamonds %>% 
  filter(price > 500 & price < 600) #excluded 
# A tibble: 2,360 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.35 Ideal     I     VS1      60.9  57     552  4.54  4.59  2.78
 2  0.3  Premium   D     SI1      62.6  59     552  4.23  4.27  2.66
 3  0.3  Ideal     D     SI1      62.5  57     552  4.29  4.32  2.69
 4  0.3  Ideal     D     SI1      62.1  56     552  4.3   4.33  2.68
 5  0.42 Premium   I     SI2      61.5  59     552  4.78  4.84  2.96
 6  0.28 Ideal     G     VVS2     61.4  56     553  4.19  4.22  2.58
 7  0.32 Ideal     I     VVS1     62    55.3   553  4.39  4.42  2.73
 8  0.31 Very Good G     SI1      63.3  57     553  4.33  4.3   2.73
 9  0.31 Premium   G     SI1      61.8  58     553  4.35  4.32  2.68
10  0.24 Premium   E     VVS1     60.7  58     553  4.01  4.03  2.44
# ℹ 2,350 more rows
#?between
diamonds %>% 
  filter(between(price, 500,600)) #included
# A tibble: 2,407 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.35 Ideal     I     VS1      60.9  57     552  4.54  4.59  2.78
 2  0.3  Premium   D     SI1      62.6  59     552  4.23  4.27  2.66
 3  0.3  Ideal     D     SI1      62.5  57     552  4.29  4.32  2.69
 4  0.3  Ideal     D     SI1      62.1  56     552  4.3   4.33  2.68
 5  0.42 Premium   I     SI2      61.5  59     552  4.78  4.84  2.96
 6  0.28 Ideal     G     VVS2     61.4  56     553  4.19  4.22  2.58
 7  0.32 Ideal     I     VVS1     62    55.3   553  4.39  4.42  2.73
 8  0.31 Very Good G     SI1      63.3  57     553  4.33  4.3   2.73
 9  0.31 Premium   G     SI1      61.8  58     553  4.35  4.32  2.68
10  0.24 Premium   E     VVS1     60.7  58     553  4.01  4.03  2.44
# ℹ 2,397 more rows

The output of a selection can always be saved in a new data frame as follows:

diamonds_filtered = diamonds %>% 
  filter(between(price, 500,600))

To select diamonds with cut equal to fair OR good we can use the | operator or the %in% function:

diamonds %>% 
  filter(cut=="Fair" | cut=="Good")
# A tibble: 6,516 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 6,506 more rows
diamonds %>% 
  filter(cut %in% c("Fair","Good"))
# A tibble: 6,516 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 6,506 more rows

The code diamonds %>% filter(cut == c("Fair","Good")) does not perform what we mean to do. For an interesting discussion about the difference between == and %in% see here.

4.2.3 Verb 3: summarise

This verbs can be used to compute summary statistics. For example the following code computes the mean and median of price.

diamonds %>% 
  summarise(mean(price),
            median(price))
# A tibble: 1 × 2
  `mean(price)` `median(price)`
          <dbl>           <dbl>
1         3933.            2401

The output is automatically labelled but it is also possible to specify different labels as follows:

diamonds %>% 
  summarise(mean_price = mean(price),
            median_price = median(price))
# A tibble: 1 × 2
  mean_price median_price
       <dbl>        <dbl>
1      3933.         2401

Similarly, it is possible to compute some summary statistics on a new variable which is a transformation of the variables originally contained in the data frame. For example, we are interested in the mean of a new variable given by the price per unit given by the ratio between price and carat:

diamonds %>% 
  summarise(mean_price = mean(price),
            mean_carat = mean(carat),
            mean_unitprice = mean(price/carat))
# A tibble: 1 × 3
  mean_price mean_carat mean_unitprice
       <dbl>      <dbl>          <dbl>
1      3933.      0.798          4008.

Another transformation of the original variables consists in considering if a price is bigger than 10000$ (T or F) and then computing how many observations (absolute frequency, proportion or percentage) satisfy the condition. This is computed by using the sum or mean functions:

diamonds %>% 
  summarise(veryexp = sum(price > 10000), #absolute frequencies
            veryexp_prop = mean(price>10000), #proportion 
            veryexp_perc = mean(price>10000)*100) #percentage
# A tibble: 1 × 3
  veryexp veryexp_prop veryexp_perc
    <int>        <dbl>        <dbl>
1    5222       0.0968         9.68

It is also interesting to compute summary statistics conditionally on the categories of a qualitative variable (i.e. a factor). This can be done by combining the group_by function with the summarise function. Let’s compute for example the mean, min and max price conditionally on the categories of cut:

diamonds %>% 
  group_by(cut) %>% 
  summarise(price_mean = mean(price),
            price_min = min(price),
            price_max = max(price))
# A tibble: 5 × 4
  cut       price_mean price_min price_max
  <ord>          <dbl>     <int>     <int>
1 Fair           4359.       337     18574
2 Good           3929.       327     18788
3 Very Good      3982.       336     18818
4 Premium        4584.       326     18823
5 Ideal          3458.       326     18806

In particular, group_by splits the original data set into different groups (according to the category of the group_by variable) and for each of them the requested summary statistics are computed. In this case we obtain 5 values of the mean, min and max price according to the number of categories of the variable cut. It is also possible to condition on two different factors, such as for example cut and color:

diamonds %>% 
  group_by(cut,color) %>% 
  summarise(price_mean =mean(price)) 
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 3
# Groups:   cut [5]
   cut   color price_mean
   <ord> <ord>      <dbl>
 1 Fair  D          4291.
 2 Fair  E          3682.
 3 Fair  F          3827.
 4 Fair  G          4239.
 5 Fair  H          5136.
 6 Fair  I          4685.
 7 Fair  J          4976.
 8 Good  D          3405.
 9 Good  E          3424.
10 Good  F          3496.
# ℹ 25 more rows

In this case each category of cut is combined with each category of color and then the mean price is computed.

The function group_by can by combined with summarise also to compute frequency distribution, by means of the n() function which gives the current group size:

diamonds %>% 
  group_by(cut) %>% 
  summarise(AbsFreq = n())
# A tibble: 5 × 2
  cut       AbsFreq
  <ord>       <int>
1 Fair         1610
2 Good         4906
3 Very Good   12082
4 Premium     13791
5 Ideal       21551

This is an alternative to the table standard function:

table(diamonds$cut)

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 

Similarly, percentages can be easily computed by using AbsFreq represents a new variable available for new computations:

diamonds %>% 
  group_by(cut) %>% 
  summarise(AbsFreq = n(),
            Perc =AbsFreq/nrow(diamonds)*100)
# A tibble: 5 × 3
  cut       AbsFreq  Perc
  <ord>       <int> <dbl>
1 Fair         1610  2.98
2 Good         4906  9.10
3 Very Good   12082 22.4 
4 Premium     13791 25.6 
5 Ideal       21551 40.0 

A shorter alternative for computing the frequency table is given by the count function which let you quickly count the unique values of one or more variables. Basically, df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n()). Here below we have an example:

diamonds %>% 
  count(cut)
# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

4.3 Verb 4: mutate

The verb mutate can be used to create new column in the data frame. For example, let’s create a new column containing the price per unit defined previously. Moreover, we want this variable to be saved in the data frame (for this reason we create a new object named diamonds which substitutes its previous version):

diamonds = diamonds %>% 
  mutate(priceunit = price/carat)

We now want to create a new categorical variables with two categories: “Yes” if the price is < 1000$, “No” otherwise. For doing this we will make use of the ifelse function, already introduced in Section @ref(conditionalstatement). We will call the new column as pricecat and will save the new data frame (with one more column) in a new object named newdiamonds:

newdiamonds = diamonds %>% 
  mutate(pricecat = ifelse(price < 1000, "Yes", "No"))
glimpse(newdiamonds)
Rows: 53,940
Columns: 12
$ carat     <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, …
$ cut       <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, V…
$ color     <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, …
$ clarity   <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1…
$ depth     <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, …
$ table     <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, …
$ price     <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, …
$ x         <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, …
$ y         <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, …
$ z         <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, …
$ priceunit <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, …
$ pricecat  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…

To derive the frequency distribution of pricecat we proceed as described above, computing also percentages by means of mutate:

newdiamonds %>% 
  count(pricecat) %>%  # Absolute Frequencies
  #we want to add two new columns: Proportion andPercentages.
  mutate(Proportions=n/nrow(newdiamonds),
         Percentages=n/nrow(newdiamonds)*100)
# A tibble: 2 × 4
  pricecat     n Proportions Percentages
  <chr>    <int>       <dbl>       <dbl>
1 No       39441       0.731        73.1
2 Yes      14499       0.269        26.9
#same as
newdiamonds %>% 
  group_by(pricecat) %>% 
  summarise(AbsFreq = n(),
            Proportions=AbsFreq/nrow(newdiamonds),
            Perc =AbsFreq/nrow(newdiamonds)*100)
# A tibble: 2 × 4
  pricecat AbsFreq Proportions  Perc
  <chr>      <int>       <dbl> <dbl>
1 No         39441       0.731  73.1
2 Yes        14499       0.269  26.9

4.4 Verb 5: arrange

The verb arrange can be used to sort observations with respect to the values of a given variable. For example, we can sort diamonds according to price (by default the ascending order is adopted):

diamonds %>% 
  arrange(price) %>% 
  tail
# A tibble: 6 × 11
  carat cut       color clarity depth table price     x     y     z priceunit
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
1  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24     8208.
2  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01     9402.
3  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11     9084.
4  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56    12454.
5  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04     9409 
6  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16     8220.

With the function tail we have a preview of the 6 bottom lines which contains, in this case, the diamonds with the highest price.

If we need to use a descending ordering we will use the - inside arrange:

diamonds %>% 
  arrange(- price)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z priceunit
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16     8220.
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04     9409 
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56    12454.
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11     9084.
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01     9402.
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24     8208.
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84     9213.
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91     9398.
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7     10989.
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21     8740 
# ℹ 53,930 more rows

Let’s now use the arrange function in order to sort the categories of a factor (cut) according to the corresponding frequencies:

diamonds %>% 
  group_by(cut) %>% 
  summarise(freq = n()) %>% 
  arrange(desc(freq))
# A tibble: 5 × 2
  cut        freq
  <ord>     <int>
1 Ideal     21551
2 Premium   13791
3 Very Good 12082
4 Good       4906
5 Fair       1610
#or
diamonds %>% 
  count(cut) %>% 
  arrange(-n)
# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Ideal     21551
2 Premium   13791
3 Very Good 12082
4 Good       4906
5 Fair       1610
# if arrange is not specified we get as result the ascending order based of frequencies.
diamonds %>% 
  group_by(cut) %>% 
  summarise(freq = n()) 
# A tibble: 5 × 2
  cut        freq
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

It can be also useful to visualize a given number or proportion observations with the highest/lowest values of a given variable. The following code for example extract first the 10 observations with the highest values of carat (using slice_max) and among them the 50% of the observations with the lowest price (with slice_min):

diamonds %>% 
  slice_max(carat, n=10) 
# A tibble: 10 × 11
   carat cut       color clarity depth table price     x     y     z priceunit
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98     3596.
 2  4.5  Fair      J     I1       65.8    58 18531 10.2  10.2   6.72     4118 
 3  4.13 Fair      H     I1       64.8    61 17329 10     9.85  6.43     4196.
 4  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17     3796.
 5  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24     3796.
 6  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31     3996 
 7  3.67 Premium   I     I1       62.4    56 16193  9.86  9.81  6.13     4412.
 8  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38     3197.
 9  3.51 Premium   J     VS2      62.5    59 18701  9.66  9.63  6.03     5328.
10  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03     3596.
diamonds %>% 
  slice_max(carat, n=10) %>% 
  #select half of the observations (10), the lowest ones.
  slice_min(price, prop=0.5) 
# A tibble: 5 × 11
  carat cut       color clarity depth table price     x     y     z priceunit
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
1  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38     3197.
2  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03     3596.
3  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17     3796.
4  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24     3796.
5  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31     3996 

4.5 Exercises Lecture 4

4.5.1 Exercise 1

Write the code for the following operations:

  1. Simulate a vector of 15 values from the continuous Uniform distribution defined between 0 and 10 (see ?runif). Set the seed equal to 99.
  2. Round the numbers in the vector to two digits.
  3. Sort the values in the vector in descending order (see ?sort)
  4. Display a few of the largest values with head().

Use both the standard R programming approach and the more modern approach based on the use of the pipe %>% (remember to load the tidyverse library).

4.5.2 Exercise 2

Consider the mtcars data set which is available in R. Type the following code to explore the variables included in the data set (for an explanation of the variables see ?mtcars:

Use the following code to create a new variable named car_model that contains the names of the cars, now available as row names.

  1. The variables cyl and am are considered as numerical variable (dbl). Transform them into two factors (decide if you want to set labels):

  2. How many observations and variables are available?

  3. Print (on the screen) the hp variable using the select() function. Try also to use the pull() function. Which is the difference?

  4. Print out all but the hp column using the select() function.

  5. Print out the following variables: mpg, hp, vs, am, gear. Suggestion: use : if necessary.

  6. Provide the frequency distribution (absolute, relative and percentage) for the am variable (transmission).

  7. Select all the observations which have mpg>20 and hp>100 and compute the average for the mpg variable.

  8. Compute the distribution (absolute frequencies) of the car by gear. Moreover, compute the mean consumption (mpg) conditionally on the gear.

4.5.3 Exercise 3

Consider the iris dataset available in R. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for some flowers from each of 3 species of iris. See ?iris.

  1. Select the variables Sepal.Length, Sepal.Width and Petal.Length

  2. Select all the rows with Sepal.Length bigger than 5.

  3. Compute the frequency distribution (absolute, relative and percentage frequencies) of Species. Which are the available species?

  4. Compute the average and the standard deviation for all the numeric column of iris. Then compute the same statistics by conditioning on species. Suggestion: use the function summarise_if (see ?summarise_if, in this case test if the column is.numeric):

  5. Compute the average of Sepal.Length and Petal.Length by Species. Which is the species with the highest average sepal/petal length?

4.5.4 Solutions

Write the code for the following operations:

  1. Simulate a vector of 15 values from the continuous Uniform distribution defined between 0 and 10 (see ?runif). Set the seed equal to 99.
library(tidyverse)
set.seed(99)
a =runif(15, 0,10)
a
 [1] 5.847119 1.137817 6.842647 9.925088 5.349936 9.666141 6.714276 2.945777
 [9] 3.583630 1.753148 5.488174 5.054517 1.938365 6.369041 6.878001

Use both the standard R programming approach and the more modern approach based on the use of the pipe %>% (remember to load the tidyverse library).

  1. Round the numbers in the vector to two digits.
#normal
round(a,2)
 [1] 5.85 1.14 6.84 9.93 5.35 9.67 6.71 2.95 3.58 1.75 5.49 5.05 1.94 6.37 6.88
#tidyverse
a %>% 
  round(2)
 [1] 5.85 1.14 6.84 9.93 5.35 9.67 6.71 2.95 3.58 1.75 5.49 5.05 1.94 6.37 6.88
  1. Sort the values in the vector in descending order (see ?sort)
#normal
sort(round(a,2), TRUE)
 [1] 9.93 9.67 6.88 6.84 6.71 6.37 5.85 5.49 5.35 5.05 3.58 2.95 1.94 1.75 1.14
#tidyverse
a %>% 
  round(2) %>% 
  sort(TRUE)
 [1] 9.93 9.67 6.88 6.84 6.71 6.37 5.85 5.49 5.35 5.05 3.58 2.95 1.94 1.75 1.14
  1. Display a few of the largest values with head().
#normal
head(sort(round(a,2), TRUE))
[1] 9.93 9.67 6.88 6.84 6.71 6.37
#tidyverse
a %>% 
  round(2) %>% 
  sort(TRUE) %>% 
  head
[1] 9.93 9.67 6.88 6.84 6.71 6.37

4.5.5 Exercise 2

Consider the mtcars data set which is available in R. Type the following code to explore the variables included in the data set (for an explanation of the variables see ?mtcars:

library(tidyverse)
glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Use the following code to create a new variable named car_model that contains the names of the cars, now available as row names.

mtcars = mtcars %>%
        rownames_to_column("car_model")
glimpse(mtcars)
Rows: 32
Columns: 12
$ car_model <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive"…
$ mpg       <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, …
$ cyl       <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, …
$ disp      <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.…
$ hp        <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180…
$ drat      <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, …
$ wt        <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.15…
$ qsec      <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.9…
$ vs        <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
$ am        <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
$ gear      <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, …
$ carb      <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, …
  1. The variables cyl and am are considered as numerical variable (dbl). Transform them into two factors (decide if you want to set labels):
mtcars$cyl = factor(mtcars$cyl)
mtcars$am = factor(mtcars$am,labels=c("automatic","manual"))
  1. How many observations and variables are available?
dim(mtcars) # 32 observation and 12 variables.
[1] 32 12
  1. Print (on the screen) the hp variable using the select() function. Try also to use the pull() function. Which is the difference?
mtcars %>%  
  select(hp)
    hp
1  110
2  110
3   93
4  110
5  175
6  105
7  245
8   62
9   95
10 123
11 123
12 180
13 180
14 180
15 205
16 215
17 230
18  66
19  52
20  65
21  97
22 150
23 150
24 245
25 175
26  66
27  91
28 113
29 264
30 175
31 335
32 109
mtcars %>%  pull(hp) #we get only the numerical values
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109
  1. Print out all but the hp column using the select() function.
mtcars %>%  
  select(-hp)
             car_model  mpg cyl  disp drat    wt  qsec vs        am gear carb
1            Mazda RX4 21.0   6 160.0 3.90 2.620 16.46  0    manual    4    4
2        Mazda RX4 Wag 21.0   6 160.0 3.90 2.875 17.02  0    manual    4    4
3           Datsun 710 22.8   4 108.0 3.85 2.320 18.61  1    manual    4    1
4       Hornet 4 Drive 21.4   6 258.0 3.08 3.215 19.44  1 automatic    3    1
5    Hornet Sportabout 18.7   8 360.0 3.15 3.440 17.02  0 automatic    3    2
6              Valiant 18.1   6 225.0 2.76 3.460 20.22  1 automatic    3    1
7           Duster 360 14.3   8 360.0 3.21 3.570 15.84  0 automatic    3    4
8            Merc 240D 24.4   4 146.7 3.69 3.190 20.00  1 automatic    4    2
9             Merc 230 22.8   4 140.8 3.92 3.150 22.90  1 automatic    4    2
10            Merc 280 19.2   6 167.6 3.92 3.440 18.30  1 automatic    4    4
11           Merc 280C 17.8   6 167.6 3.92 3.440 18.90  1 automatic    4    4
12          Merc 450SE 16.4   8 275.8 3.07 4.070 17.40  0 automatic    3    3
13          Merc 450SL 17.3   8 275.8 3.07 3.730 17.60  0 automatic    3    3
14         Merc 450SLC 15.2   8 275.8 3.07 3.780 18.00  0 automatic    3    3
15  Cadillac Fleetwood 10.4   8 472.0 2.93 5.250 17.98  0 automatic    3    4
16 Lincoln Continental 10.4   8 460.0 3.00 5.424 17.82  0 automatic    3    4
17   Chrysler Imperial 14.7   8 440.0 3.23 5.345 17.42  0 automatic    3    4
18            Fiat 128 32.4   4  78.7 4.08 2.200 19.47  1    manual    4    1
19         Honda Civic 30.4   4  75.7 4.93 1.615 18.52  1    manual    4    2
20      Toyota Corolla 33.9   4  71.1 4.22 1.835 19.90  1    manual    4    1
21       Toyota Corona 21.5   4 120.1 3.70 2.465 20.01  1 automatic    3    1
22    Dodge Challenger 15.5   8 318.0 2.76 3.520 16.87  0 automatic    3    2
23         AMC Javelin 15.2   8 304.0 3.15 3.435 17.30  0 automatic    3    2
24          Camaro Z28 13.3   8 350.0 3.73 3.840 15.41  0 automatic    3    4
25    Pontiac Firebird 19.2   8 400.0 3.08 3.845 17.05  0 automatic    3    2
26           Fiat X1-9 27.3   4  79.0 4.08 1.935 18.90  1    manual    4    1
27       Porsche 914-2 26.0   4 120.3 4.43 2.140 16.70  0    manual    5    2
28        Lotus Europa 30.4   4  95.1 3.77 1.513 16.90  1    manual    5    2
29      Ford Pantera L 15.8   8 351.0 4.22 3.170 14.50  0    manual    5    4
30        Ferrari Dino 19.7   6 145.0 3.62 2.770 15.50  0    manual    5    6
31       Maserati Bora 15.0   8 301.0 3.54 3.570 14.60  0    manual    5    8
32          Volvo 142E 21.4   4 121.0 4.11 2.780 18.60  1    manual    4    2
  1. Print out the following variables: mpg, hp, vs, am, gear. Suggestion: use : if necessary.
mtcars %>%  
  select(mpg, hp, vs:gear)
    mpg  hp vs        am gear
1  21.0 110  0    manual    4
2  21.0 110  0    manual    4
3  22.8  93  1    manual    4
4  21.4 110  1 automatic    3
5  18.7 175  0 automatic    3
6  18.1 105  1 automatic    3
7  14.3 245  0 automatic    3
8  24.4  62  1 automatic    4
9  22.8  95  1 automatic    4
10 19.2 123  1 automatic    4
11 17.8 123  1 automatic    4
12 16.4 180  0 automatic    3
13 17.3 180  0 automatic    3
14 15.2 180  0 automatic    3
15 10.4 205  0 automatic    3
16 10.4 215  0 automatic    3
17 14.7 230  0 automatic    3
18 32.4  66  1    manual    4
19 30.4  52  1    manual    4
20 33.9  65  1    manual    4
21 21.5  97  1 automatic    3
22 15.5 150  0 automatic    3
23 15.2 150  0 automatic    3
24 13.3 245  0 automatic    3
25 19.2 175  0 automatic    3
26 27.3  66  1    manual    4
27 26.0  91  0    manual    5
28 30.4 113  1    manual    5
29 15.8 264  0    manual    5
30 19.7 175  0    manual    5
31 15.0 335  0    manual    5
32 21.4 109  1    manual    4
  1. Provide the frequency distribution (absolute, relative and percentage) for the am variable (transmission).
mtcars %>% 
  group_by(am) %>% 
  summarise(Absfreq =n(),
            Relfreq =Absfreq/nrow(mtcars),
            Perc = Absfreq/nrow(mtcars)*100)
# A tibble: 2 × 4
  am        Absfreq Relfreq  Perc
  <fct>       <int>   <dbl> <dbl>
1 automatic      19   0.594  59.4
2 manual         13   0.406  40.6
  1. Select all the observations which have mpg>20 and hp>100 and compute the average for the mpg variable.
mtcars %>% 
  filter(mpg > 20, hp>100) %>% 
  summarise(average_mpg = mean(mpg))
  average_mpg
1       23.04
  1. Compute the distribution (absolute frequencies) of the car by gear. Moreover, compute the mean consumption (mpg) conditionally on the gear.
mtcars %>% 
  group_by (gear) %>% 
  summarise(absfre = n())
# A tibble: 3 × 2
   gear absfre
  <dbl>  <int>
1     3     15
2     4     12
3     5      5
#or
mtcars %>% 
  count(gear)
  gear  n
1    3 15
2    4 12
3    5  5
mtcars %>% 
  group_by(gear) %>% 
  summarise (meanmpg = mean(mpg))
# A tibble: 3 × 2
   gear meanmpg
  <dbl>   <dbl>
1     3    16.1
2     4    24.5
3     5    21.4

4.5.6 Exercise 3

Consider the iris dataset available in R. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for some flowers from each of 3 species of iris. See ?iris.

glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
  1. Select the variables Sepal.Length, Sepal.Width and Petal.Length
iris %>% 
  select(Sepal.Length:Petal.Length) %>% 
  head
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7
  1. Select all the row with Sepal.Length bigger than 5.
iris %>% 
  filter(Sepal.Length > 5) %>% 
  head
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          5.4         3.9          1.7         0.4  setosa
3          5.4         3.7          1.5         0.2  setosa
4          5.8         4.0          1.2         0.2  setosa
5          5.7         4.4          1.5         0.4  setosa
6          5.4         3.9          1.3         0.4  setosa
  1. Compute the frequency distribution (absolute, relative and percentage frequencies) of Species. Which are the available species?
iris %>% 
  group_by(Species) %>% 
  summarise(Absfreq =n(),
            Relfreq =Absfreq/nrow(iris),
            Perc = Absfreq/nrow(iris)*100)
# A tibble: 3 × 4
  Species    Absfreq Relfreq  Perc
  <fct>        <int>   <dbl> <dbl>
1 setosa          50   0.333  33.3
2 versicolor      50   0.333  33.3
3 virginica       50   0.333  33.3
  1. Compute the average and the standard deviation for all the numeric column of iris. Then compute the same statistics by conditioning on species. Suggestion: use the function summarise_if (see ?summarise_if, in this case test if the column is.numeric):
iris %>% 
  select_if(is.numeric) %>% 
  summarise(across(Sepal.Length:Petal.Width, mean))
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.843333    3.057333        3.758    1.199333
iris %>% 
  select_if(is.numeric) %>% 
  summarise(across(Sepal.Length:Petal.Width, sd))
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1    0.8280661   0.4358663     1.765298   0.7622377
  1. Compute the average of Sepal.Length and Petal.Length by Species. Which is the species with the highest average sepal/petal length?
iris %>% 
  group_by(Species) %>%
  summarise(mean(Sepal.Length), mean(Petal.Length))
# A tibble: 3 × 3
  Species    `mean(Sepal.Length)` `mean(Petal.Length)`
  <fct>                     <dbl>                <dbl>
1 setosa                     5.01                 1.46
2 versicolor                 5.94                 4.26
3 virginica                  6.59                 5.55