f(x)
4 Lab 3 - 10/10/2024
In this lecture we will learn another R programming approach based on the tidyverse
package. This is alternative to the base R code we learnt in the first lectures.
4.1 The pipe operator
Let’s consider a general R function named f
with argument x
. We usually use the following approach when we need to apply f
:
An alternative is given by the pipe operator %>%
which is part of the dplyr
package (see here for more details). It works as follows
%>% f()
x #this is equivalent to f(x)
Basically, the pipe tells R to pass x
as the first argument of the function f
. The shortcut to type the pipe operator in RStudio is given by CTRL/CMD Shift M.
We simulate a sample of data in order to run some simple examples with the pipe operator. By using the function sample
we draw randomly (without replacement) 5 numbers between 1 and 20 (1:20).
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(4)
# sample 5 random values from a sequence of integers between 1 to 20
= sample(1:20, 5)
x x
[1] 11 19 3 7 12
We are now interested in computing the log transformation of the vector x
. By adopting the standard R programming we would use:
log(x)
[1] 2.397895 2.944439 1.098612 1.945910 2.484907
while with the pipe operator we have
%>% log() x
[1] 2.397895 2.944439 1.098612 1.945910 2.484907
#it's also possible to omit the parentheses given that there is no input
%>% log x
[1] 2.397895 2.944439 1.098612 1.945910 2.484907
where x
is taken as the first argument of the function log
. It is also possible to include other arguments, such as for example the base
of the logarithm (in this case equal to 5). In this case note that x %>% f(y)
is equivalent to f(x,y)
.
#standard programming
log(x, base=5)
[1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593
#pipe based programming
%>% log(base=5) x
[1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593
%>% log(5) x
[1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593
We want now to apply the log transformation and then round the corresponding output to 2 digits. This requires the use of two functions (log
and round
). In general, when we apply 3 functions (f
and then g
and finally h
), we have that x %>% f %>% g %>% h
is equivalent to h(g(f(x)))
.
# standard programming
round(log(x), 2)
[1] 2.40 2.94 1.10 1.95 2.48
# pipe based programming
%>%
x %>%
log round(2)
[1] 2.40 2.94 1.10 1.95 2.48
We now add a new function, after rounding the log output we compute the sum of the 5 numbers
# standard programming
sum(round(log(x),2))
[1] 10.87
# pipe based programming
%>%
x %>%
log round(2) %>%
sum
[1] 10.87
Now, we want now to use the sum result as the base of the log transformation of the number 5.
# standard programming
log(5, base = sum(round(log(x),2)))
[1] 0.674532
# pipe based programming
%>%
x %>%
log round(2) %>%
%>%
sum log(5,base=.)
[1] 0.674532
The symbol .
is the placeholder and is used when the output of the previous pipe should not be used as the first input of the following function. In general, x %>% f(y, z = .)
is equivalent to f(y, z = x)
.
When it is not convenient to use the pipe:
- when the pipes are longer than 10 steps. In this case the suggestion is to create intermediate objects with meaningful names (that can help understanding what the code does);
- when you have multiple inputs or outputs (e.g. when there is no primary object being transformed but two or more objects being combined together).
4.2 dyplyr verbs
dplyr
(a package in the tidyverse
collection) is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select
: pick variables (columns) based on their namesfilter
pick observations (rows) based on their valuessummarise
: reduce multiple values down to a single summary (e.g. mean)mutate
: add new variables that are functions of existing variablesarrange
: change the ordering of the rows
All verbs work similarly:
- the first argument is a data frame;
- the subsequent arguments describe what to do with the data frame using the variable names (without quotes);
- the result is a new data frame.
In the following we will take into account all the dplyr
verbs by considering the diamonds
data set which contains the prices and other attributes of almost 54,000 diamonds (see ?diamonds
). If we use the function class
to understand the nature of diamonds
we get the following output:
class(diamonds)
[1] "tbl_df" "tbl" "data.frame"
The term tbl
(tibble) is the tidyverse version of a classical R data frame. Tiblles are very similar to data frame (they just contain/display more information) and are designed to be used with the tidyverse syntax style.
To get the list and the type of variables included in diamonds
we can use the standard str
or the corresponding tidyverse
function named glimpse
:
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
4.2.1 Verb 1: select
This verb is used to select some of the columns by name. For example, to select the variable named carat
we use the following code:
%>%
diamonds select(carat)
# A tibble: 53,940 × 1
carat
<dbl>
1 0.23
2 0.21
3 0.23
4 0.29
5 0.31
6 0.24
7 0.24
8 0.26
9 0.22
10 0.23
# ℹ 53,930 more rows
#same as
1] diamonds[,
# A tibble: 53,940 × 1
carat
<dbl>
1 0.23
2 0.21
3 0.23
4 0.29
5 0.31
6 0.24
7 0.24
8 0.26
9 0.22
10 0.23
# ℹ 53,930 more rows
Selecting more than one column is very simple:
%>%
diamonds select(carat, cut, color, price)
# A tibble: 53,940 × 4
carat cut color price
<dbl> <ord> <ord> <int>
1 0.23 Ideal E 326
2 0.21 Premium E 326
3 0.23 Good E 327
4 0.29 Premium I 334
5 0.31 Good J 335
6 0.24 Very Good J 336
7 0.24 Very Good I 336
8 0.26 Very Good H 337
9 0.22 Fair E 337
10 0.23 Very Good H 338
# ℹ 53,930 more rows
Given that carat
, cut
and color
are consecutive, the following code is also possible:
view(diamonds) #see the data frame
%>%
diamonds select(carat : color, price)
# A tibble: 53,940 × 4
carat cut color price
<dbl> <ord> <ord> <int>
1 0.23 Ideal E 326
2 0.21 Premium E 326
3 0.23 Good E 327
4 0.29 Premium I 334
5 0.31 Good J 335
6 0.24 Very Good J 336
7 0.24 Very Good I 336
8 0.26 Very Good H 337
9 0.22 Fair E 337
10 0.23 Very Good H 338
# ℹ 53,930 more rows
Moreover, it is also possible to select all the columns whose name starts with the letter “c” by using starts_with
combined with the select
function:
%>%
diamonds select(starts_with("c"))
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
It is also possible to specify a criterion which excludes from the selection some variables. For example, to select all the columns but carat
we use the -
symbol:
%>%
diamonds select(-carat)
# A tibble: 53,940 × 9
cut color clarity depth table price x y z
<ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
We proceed similarly to select all the columns but not the ones with a name starting with “c”:
%>%
diamonds select(- starts_with("c"))
# A tibble: 53,940 × 6
depth table price x y z
<dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 61.5 55 326 3.95 3.98 2.43
2 59.8 61 326 3.89 3.84 2.31
3 56.9 65 327 4.05 4.07 2.31
4 62.4 58 334 4.2 4.23 2.63
5 63.3 58 335 4.34 4.35 2.75
6 62.8 57 336 3.94 3.96 2.48
7 62.3 57 336 3.95 3.98 2.47
8 61.9 55 337 4.07 4.11 2.53
9 65.1 61 337 3.87 3.78 2.49
10 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Another useful application is the selection of only the numerical variables. This can be performed by using the select_if
function combined with is.numeric
(the latter is a test of an object being interpretable as numbers):
%>%
diamonds select_if(is.numeric)
# A tibble: 53,940 × 7
carat depth table price x y z
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 61.5 55 326 3.95 3.98 2.43
2 0.21 59.8 61 326 3.89 3.84 2.31
3 0.23 56.9 65 327 4.05 4.07 2.31
4 0.29 62.4 58 334 4.2 4.23 2.63
5 0.31 63.3 58 335 4.34 4.35 2.75
6 0.24 62.8 57 336 3.94 3.96 2.48
7 0.24 62.3 57 336 3.95 3.98 2.47
8 0.26 61.9 55 337 4.07 4.11 2.53
9 0.22 65.1 61 337 3.87 3.78 2.49
10 0.23 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
The returned output can be then used to compute jointly all the averages of the numerical variables by using the apply
function):
%>%
diamonds select_if(is.numeric) %>%
apply(2, mean)# remember 2 means per column
carat depth table price x y
0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572 5.7345260
z
3.5387338
4.2.2 Verb 2: filter
The verb filter
can be used to select the observations satisfying some criterion. Consider the example the selection of the diamonds with variable cut
equal to the category “Premium”:
%>%
diamonds filter(cut == "Premium")
# A tibble: 13,791 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
3 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
4 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
5 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
6 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
7 0.29 Premium F SI1 62.4 58 403 4.24 4.26 2.65
8 0.22 Premium E VS2 61.6 58 404 3.93 3.89 2.41
9 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
10 0.3 Premium J SI2 59.3 61 405 4.43 4.38 2.61
# ℹ 13,781 more rows
It is also possible to include more conditions. Select for example the diamonds with cut
equal to “Premium” AND color
equal to “D” (the AND can be specified by using &
or comma
):
%>%
diamonds filter(cut == "Premium" & color == "D")
# A tibble: 1,603 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
3 0.71 Premium D SI2 61.7 59 2768 5.71 5.67 3.51
4 0.71 Premium D VS2 62.5 60 2770 5.65 5.61 3.52
5 0.7 Premium D VS2 58 62 2773 5.87 5.78 3.38
6 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
7 0.7 Premium D SI1 62.8 60 2782 5.68 5.66 3.56
8 0.72 Premium D SI2 62 60 2795 5.73 5.69 3.54
9 0.71 Premium D SI1 62.7 60 2797 5.67 5.71 3.57
10 0.71 Premium D SI1 61.3 58 2797 5.73 5.75 3.52
# ℹ 1,593 more rows
%>%
diamonds filter(cut == "Premium" , color == "D")
# A tibble: 1,603 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
3 0.71 Premium D SI2 61.7 59 2768 5.71 5.67 3.51
4 0.71 Premium D VS2 62.5 60 2770 5.65 5.61 3.52
5 0.7 Premium D VS2 58 62 2773 5.87 5.78 3.38
6 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
7 0.7 Premium D SI1 62.8 60 2782 5.68 5.66 3.56
8 0.72 Premium D SI2 62 60 2795 5.73 5.69 3.54
9 0.71 Premium D SI1 62.7 60 2797 5.67 5.71 3.57
10 0.71 Premium D SI1 61.3 58 2797 5.73 5.75 3.52
# ℹ 1,593 more rows
If we are interested in filtering th diamonds with a price between 500 and 600 dollars the following two alternative codes can be adopted:
%>%
diamonds filter(price > 500 & price < 600) #excluded
# A tibble: 2,360 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.35 Ideal I VS1 60.9 57 552 4.54 4.59 2.78
2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
3 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
4 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
5 0.42 Premium I SI2 61.5 59 552 4.78 4.84 2.96
6 0.28 Ideal G VVS2 61.4 56 553 4.19 4.22 2.58
7 0.32 Ideal I VVS1 62 55.3 553 4.39 4.42 2.73
8 0.31 Very Good G SI1 63.3 57 553 4.33 4.3 2.73
9 0.31 Premium G SI1 61.8 58 553 4.35 4.32 2.68
10 0.24 Premium E VVS1 60.7 58 553 4.01 4.03 2.44
# ℹ 2,350 more rows
#?between
%>%
diamonds filter(between(price, 500,600)) #included
# A tibble: 2,407 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.35 Ideal I VS1 60.9 57 552 4.54 4.59 2.78
2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
3 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
4 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
5 0.42 Premium I SI2 61.5 59 552 4.78 4.84 2.96
6 0.28 Ideal G VVS2 61.4 56 553 4.19 4.22 2.58
7 0.32 Ideal I VVS1 62 55.3 553 4.39 4.42 2.73
8 0.31 Very Good G SI1 63.3 57 553 4.33 4.3 2.73
9 0.31 Premium G SI1 61.8 58 553 4.35 4.32 2.68
10 0.24 Premium E VVS1 60.7 58 553 4.01 4.03 2.44
# ℹ 2,397 more rows
The output of a selection can always be saved in a new data frame as follows:
= diamonds %>%
diamonds_filtered filter(between(price, 500,600))
To select diamonds with cut
equal to fair OR good we can use the |
operator or the %in%
function:
%>%
diamonds filter(cut=="Fair" | cut=="Good")
# A tibble: 6,516 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 6,506 more rows
%>%
diamonds filter(cut %in% c("Fair","Good"))
# A tibble: 6,516 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 6,506 more rows
The code diamonds %>% filter(cut == c("Fair","Good"))
does not perform what we mean to do. For an interesting discussion about the difference between ==
and %in%
see here.
4.2.3 Verb 3: summarise
This verbs can be used to compute summary statistics. For example the following code computes the mean and median of price
.
%>%
diamonds summarise(mean(price),
median(price))
# A tibble: 1 × 2
`mean(price)` `median(price)`
<dbl> <dbl>
1 3933. 2401
The output is automatically labelled but it is also possible to specify different labels as follows:
%>%
diamonds summarise(mean_price = mean(price),
median_price = median(price))
# A tibble: 1 × 2
mean_price median_price
<dbl> <dbl>
1 3933. 2401
Similarly, it is possible to compute some summary statistics on a new variable which is a transformation of the variables originally contained in the data frame. For example, we are interested in the mean of a new variable given by the price per unit given by the ratio between price
and carat
:
%>%
diamonds summarise(mean_price = mean(price),
mean_carat = mean(carat),
mean_unitprice = mean(price/carat))
# A tibble: 1 × 3
mean_price mean_carat mean_unitprice
<dbl> <dbl> <dbl>
1 3933. 0.798 4008.
Another transformation of the original variables consists in considering if a price is bigger than 10000$ (T
or F
) and then computing how many observations (absolute frequency, proportion or percentage) satisfy the condition. This is computed by using the sum
or mean
functions:
%>%
diamonds summarise(veryexp = sum(price > 10000), #absolute frequencies
veryexp_prop = mean(price>10000), #proportion
veryexp_perc = mean(price>10000)*100) #percentage
# A tibble: 1 × 3
veryexp veryexp_prop veryexp_perc
<int> <dbl> <dbl>
1 5222 0.0968 9.68
It is also interesting to compute summary statistics conditionally on the categories of a qualitative variable (i.e. a factor). This can be done by combining the group_by
function with the summarise
function. Let’s compute for example the mean, min and max price conditionally on the categories of cut
:
%>%
diamonds group_by(cut) %>%
summarise(price_mean = mean(price),
price_min = min(price),
price_max = max(price))
# A tibble: 5 × 4
cut price_mean price_min price_max
<ord> <dbl> <int> <int>
1 Fair 4359. 337 18574
2 Good 3929. 327 18788
3 Very Good 3982. 336 18818
4 Premium 4584. 326 18823
5 Ideal 3458. 326 18806
In particular, group_by
splits the original data set into different groups (according to the category of the group_by
variable) and for each of them the requested summary statistics are computed. In this case we obtain 5 values of the mean, min and max price according to the number of categories of the variable cut
. It is also possible to condition on two different factors, such as for example cut
and color
:
%>%
diamonds group_by(cut,color) %>%
summarise(price_mean =mean(price))
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 3
# Groups: cut [5]
cut color price_mean
<ord> <ord> <dbl>
1 Fair D 4291.
2 Fair E 3682.
3 Fair F 3827.
4 Fair G 4239.
5 Fair H 5136.
6 Fair I 4685.
7 Fair J 4976.
8 Good D 3405.
9 Good E 3424.
10 Good F 3496.
# ℹ 25 more rows
In this case each category of cut
is combined with each category of color
and then the mean price is computed.
The function group_by
can by combined with summarise
also to compute frequency distribution, by means of the n()
function which gives the current group size:
%>%
diamonds group_by(cut) %>%
summarise(AbsFreq = n())
# A tibble: 5 × 2
cut AbsFreq
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
This is an alternative to the table
standard function:
table(diamonds$cut)
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
Similarly, percentages can be easily computed by using AbsFreq
represents a new variable available for new computations:
%>%
diamonds group_by(cut) %>%
summarise(AbsFreq = n(),
Perc =AbsFreq/nrow(diamonds)*100)
# A tibble: 5 × 3
cut AbsFreq Perc
<ord> <int> <dbl>
1 Fair 1610 2.98
2 Good 4906 9.10
3 Very Good 12082 22.4
4 Premium 13791 25.6
5 Ideal 21551 40.0
A shorter alternative for computing the frequency table is given by the count
function which let you quickly count the unique values of one or more variables. Basically, df %>% count(a, b)
is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())
. Here below we have an example:
%>%
diamonds count(cut)
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
4.3 Verb 4: mutate
The verb mutate
can be used to create new column in the data frame. For example, let’s create a new column containing the price per unit defined previously. Moreover, we want this variable to be saved in the data frame (for this reason we create a new object named diamonds
which substitutes its previous version):
= diamonds %>%
diamonds mutate(priceunit = price/carat)
We now want to create a new categorical variables with two categories: “Yes” if the price is < 1000$, “No” otherwise. For doing this we will make use of the ifelse
function, already introduced in Section @ref(conditionalstatement). We will call the new column as pricecat
and will save the new data frame (with one more column) in a new object named newdiamonds
:
= diamonds %>%
newdiamonds mutate(pricecat = ifelse(price < 1000, "Yes", "No"))
glimpse(newdiamonds)
Rows: 53,940
Columns: 12
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, …
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, V…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, …
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1…
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, …
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, …
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, …
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, …
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, …
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, …
$ priceunit <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, …
$ pricecat <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…
To derive the frequency distribution of pricecat
we proceed as described above, computing also percentages by means of mutate
:
%>%
newdiamonds count(pricecat) %>% # Absolute Frequencies
#we want to add two new columns: Proportion andPercentages.
mutate(Proportions=n/nrow(newdiamonds),
Percentages=n/nrow(newdiamonds)*100)
# A tibble: 2 × 4
pricecat n Proportions Percentages
<chr> <int> <dbl> <dbl>
1 No 39441 0.731 73.1
2 Yes 14499 0.269 26.9
#same as
%>%
newdiamonds group_by(pricecat) %>%
summarise(AbsFreq = n(),
Proportions=AbsFreq/nrow(newdiamonds),
Perc =AbsFreq/nrow(newdiamonds)*100)
# A tibble: 2 × 4
pricecat AbsFreq Proportions Perc
<chr> <int> <dbl> <dbl>
1 No 39441 0.731 73.1
2 Yes 14499 0.269 26.9
4.4 Verb 5: arrange
The verb arrange
can be used to sort observations with respect to the values of a given variable. For example, we can sort diamonds according to price
(by default the ascending order is adopted):
%>%
diamonds arrange(price) %>%
tail
# A tibble: 6 × 11
carat cut color clarity depth table price x y z priceunit
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24 8208.
2 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01 9402.
3 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11 9084.
4 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56 12454.
5 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04 9409
6 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16 8220.
With the function tail
we have a preview of the 6 bottom lines which contains, in this case, the diamonds with the highest price.
If we need to use a descending ordering we will use the -
inside arrange
:
%>%
diamonds arrange(- price)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z priceunit
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16 8220.
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04 9409
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56 12454.
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11 9084.
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01 9402.
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24 8208.
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84 9213.
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91 9398.
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7 10989.
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21 8740
# ℹ 53,930 more rows
Let’s now use the arrange
function in order to sort the categories of a factor (cut
) according to the corresponding frequencies:
%>%
diamonds group_by(cut) %>%
summarise(freq = n()) %>%
arrange(desc(freq))
# A tibble: 5 × 2
cut freq
<ord> <int>
1 Ideal 21551
2 Premium 13791
3 Very Good 12082
4 Good 4906
5 Fair 1610
#or
%>%
diamonds count(cut) %>%
arrange(-n)
# A tibble: 5 × 2
cut n
<ord> <int>
1 Ideal 21551
2 Premium 13791
3 Very Good 12082
4 Good 4906
5 Fair 1610
# if arrange is not specified we get as result the ascending order based of frequencies.
%>%
diamonds group_by(cut) %>%
summarise(freq = n())
# A tibble: 5 × 2
cut freq
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
It can be also useful to visualize a given number or proportion observations with the highest/lowest values of a given variable. The following code for example extract first the 10 observations with the highest values of carat
(using slice_max
) and among them the 50% of the observations with the lowest price (with slice_min
):
%>%
diamonds slice_max(carat, n=10)
# A tibble: 10 × 11
carat cut color clarity depth table price x y z priceunit
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98 3596.
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72 4118
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43 4196.
4 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17 3796.
5 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24 3796.
6 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31 3996
7 3.67 Premium I I1 62.4 56 16193 9.86 9.81 6.13 4412.
8 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38 3197.
9 3.51 Premium J VS2 62.5 59 18701 9.66 9.63 6.03 5328.
10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03 3596.
%>%
diamonds slice_max(carat, n=10) %>%
#select half of the observations (10), the lowest ones.
slice_min(price, prop=0.5)
# A tibble: 5 × 11
carat cut color clarity depth table price x y z priceunit
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38 3197.
2 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03 3596.
3 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17 3796.
4 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24 3796.
5 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31 3996
4.5 Exercises Lecture 4
4.5.1 Exercise 1
Write the code for the following operations:
- Simulate a vector of 15 values from the continuous Uniform distribution defined between 0 and 10 (see
?runif
). Set the seed equal to 99. - Round the numbers in the vector to two digits.
- Sort the values in the vector in descending order (see
?sort
) - Display a few of the largest values with
head()
.
Use both the standard R
programming approach and the more modern approach based on the use of the pipe %>%
(remember to load the tidyverse
library).
4.5.2 Exercise 2
Consider the mtcars
data set which is available in R
. Type the following code to explore the variables included in the data set (for an explanation of the variables see ?mtcars
:
Use the following code to create a new variable named car_model
that contains the names of the cars, now available as row names.
The variables
cyl
andam
are considered as numerical variable (dbl
). Transform them into two factors (decide if you want to set labels):How many observations and variables are available?
Print (on the screen) the
hp
variable using theselect()
function. Try also to use thepull()
function. Which is the difference?Print out all but the
hp
column using theselect()
function.Print out the following variables:
mpg
,hp
,vs
,am
,gear
. Suggestion: use:
if necessary.Provide the frequency distribution (absolute, relative and percentage) for the
am
variable (transmission).Select all the observations which have
mpg>20
andhp>100
and compute the average for thempg
variable.Compute the distribution (absolute frequencies) of the car by
gear
. Moreover, compute the mean consumption (mpg
) conditionally on thegear
.
4.5.3 Exercise 3
Consider the iris
dataset available in R. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for some flowers from each of 3 species of iris. See ?iris
.
Select the variables Sepal.Length, Sepal.Width and Petal.Length
Select all the rows with Sepal.Length bigger than 5.
Compute the frequency distribution (absolute, relative and percentage frequencies) of
Species
. Which are the available species?Compute the average and the standard deviation for all the numeric column of
iris
. Then compute the same statistics by conditioning on species. Suggestion: use the functionsummarise_if
(see?summarise_if
, in this case test if the columnis.numeric
):Compute the average of
Sepal.Length
andPetal.Length
bySpecies
. Which is the species with the highest average sepal/petal length?
4.5.4 Solutions
Write the code for the following operations:
- Simulate a vector of 15 values from the continuous Uniform distribution defined between 0 and 10 (see
?runif
). Set the seed equal to 99.
library(tidyverse)
set.seed(99)
=runif(15, 0,10)
a a
[1] 5.847119 1.137817 6.842647 9.925088 5.349936 9.666141 6.714276 2.945777
[9] 3.583630 1.753148 5.488174 5.054517 1.938365 6.369041 6.878001
Use both the standard R
programming approach and the more modern approach based on the use of the pipe %>%
(remember to load the tidyverse
library).
- Round the numbers in the vector to two digits.
#normal
round(a,2)
[1] 5.85 1.14 6.84 9.93 5.35 9.67 6.71 2.95 3.58 1.75 5.49 5.05 1.94 6.37 6.88
#tidyverse
%>%
a round(2)
[1] 5.85 1.14 6.84 9.93 5.35 9.67 6.71 2.95 3.58 1.75 5.49 5.05 1.94 6.37 6.88
- Sort the values in the vector in descending order (see
?sort
)
#normal
sort(round(a,2), TRUE)
[1] 9.93 9.67 6.88 6.84 6.71 6.37 5.85 5.49 5.35 5.05 3.58 2.95 1.94 1.75 1.14
#tidyverse
%>%
a round(2) %>%
sort(TRUE)
[1] 9.93 9.67 6.88 6.84 6.71 6.37 5.85 5.49 5.35 5.05 3.58 2.95 1.94 1.75 1.14
- Display a few of the largest values with
head()
.
#normal
head(sort(round(a,2), TRUE))
[1] 9.93 9.67 6.88 6.84 6.71 6.37
#tidyverse
%>%
a round(2) %>%
sort(TRUE) %>%
head
[1] 9.93 9.67 6.88 6.84 6.71 6.37
4.5.5 Exercise 2
Consider the mtcars
data set which is available in R
. Type the following code to explore the variables included in the data set (for an explanation of the variables see ?mtcars
:
library(tidyverse)
glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
Use the following code to create a new variable named car_model
that contains the names of the cars, now available as row names.
= mtcars %>%
mtcars rownames_to_column("car_model")
glimpse(mtcars)
Rows: 32
Columns: 12
$ car_model <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive"…
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, …
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, …
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, …
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.15…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.9…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, …
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, …
- The variables
cyl
andam
are considered as numerical variable (dbl
). Transform them into two factors (decide if you want to set labels):
$cyl = factor(mtcars$cyl)
mtcars$am = factor(mtcars$am,labels=c("automatic","manual")) mtcars
- How many observations and variables are available?
dim(mtcars) # 32 observation and 12 variables.
[1] 32 12
- Print (on the screen) the
hp
variable using theselect()
function. Try also to use thepull()
function. Which is the difference?
%>%
mtcars select(hp)
hp
1 110
2 110
3 93
4 110
5 175
6 105
7 245
8 62
9 95
10 123
11 123
12 180
13 180
14 180
15 205
16 215
17 230
18 66
19 52
20 65
21 97
22 150
23 150
24 245
25 175
26 66
27 91
28 113
29 264
30 175
31 335
32 109
%>% pull(hp) #we get only the numerical values mtcars
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
- Print out all but the
hp
column using theselect()
function.
%>%
mtcars select(-hp)
car_model mpg cyl disp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 3.90 2.620 16.46 0 manual 4 4
2 Mazda RX4 Wag 21.0 6 160.0 3.90 2.875 17.02 0 manual 4 4
3 Datsun 710 22.8 4 108.0 3.85 2.320 18.61 1 manual 4 1
4 Hornet 4 Drive 21.4 6 258.0 3.08 3.215 19.44 1 automatic 3 1
5 Hornet Sportabout 18.7 8 360.0 3.15 3.440 17.02 0 automatic 3 2
6 Valiant 18.1 6 225.0 2.76 3.460 20.22 1 automatic 3 1
7 Duster 360 14.3 8 360.0 3.21 3.570 15.84 0 automatic 3 4
8 Merc 240D 24.4 4 146.7 3.69 3.190 20.00 1 automatic 4 2
9 Merc 230 22.8 4 140.8 3.92 3.150 22.90 1 automatic 4 2
10 Merc 280 19.2 6 167.6 3.92 3.440 18.30 1 automatic 4 4
11 Merc 280C 17.8 6 167.6 3.92 3.440 18.90 1 automatic 4 4
12 Merc 450SE 16.4 8 275.8 3.07 4.070 17.40 0 automatic 3 3
13 Merc 450SL 17.3 8 275.8 3.07 3.730 17.60 0 automatic 3 3
14 Merc 450SLC 15.2 8 275.8 3.07 3.780 18.00 0 automatic 3 3
15 Cadillac Fleetwood 10.4 8 472.0 2.93 5.250 17.98 0 automatic 3 4
16 Lincoln Continental 10.4 8 460.0 3.00 5.424 17.82 0 automatic 3 4
17 Chrysler Imperial 14.7 8 440.0 3.23 5.345 17.42 0 automatic 3 4
18 Fiat 128 32.4 4 78.7 4.08 2.200 19.47 1 manual 4 1
19 Honda Civic 30.4 4 75.7 4.93 1.615 18.52 1 manual 4 2
20 Toyota Corolla 33.9 4 71.1 4.22 1.835 19.90 1 manual 4 1
21 Toyota Corona 21.5 4 120.1 3.70 2.465 20.01 1 automatic 3 1
22 Dodge Challenger 15.5 8 318.0 2.76 3.520 16.87 0 automatic 3 2
23 AMC Javelin 15.2 8 304.0 3.15 3.435 17.30 0 automatic 3 2
24 Camaro Z28 13.3 8 350.0 3.73 3.840 15.41 0 automatic 3 4
25 Pontiac Firebird 19.2 8 400.0 3.08 3.845 17.05 0 automatic 3 2
26 Fiat X1-9 27.3 4 79.0 4.08 1.935 18.90 1 manual 4 1
27 Porsche 914-2 26.0 4 120.3 4.43 2.140 16.70 0 manual 5 2
28 Lotus Europa 30.4 4 95.1 3.77 1.513 16.90 1 manual 5 2
29 Ford Pantera L 15.8 8 351.0 4.22 3.170 14.50 0 manual 5 4
30 Ferrari Dino 19.7 6 145.0 3.62 2.770 15.50 0 manual 5 6
31 Maserati Bora 15.0 8 301.0 3.54 3.570 14.60 0 manual 5 8
32 Volvo 142E 21.4 4 121.0 4.11 2.780 18.60 1 manual 4 2
- Print out the following variables:
mpg
,hp
,vs
,am
,gear
. Suggestion: use:
if necessary.
%>%
mtcars select(mpg, hp, vs:gear)
mpg hp vs am gear
1 21.0 110 0 manual 4
2 21.0 110 0 manual 4
3 22.8 93 1 manual 4
4 21.4 110 1 automatic 3
5 18.7 175 0 automatic 3
6 18.1 105 1 automatic 3
7 14.3 245 0 automatic 3
8 24.4 62 1 automatic 4
9 22.8 95 1 automatic 4
10 19.2 123 1 automatic 4
11 17.8 123 1 automatic 4
12 16.4 180 0 automatic 3
13 17.3 180 0 automatic 3
14 15.2 180 0 automatic 3
15 10.4 205 0 automatic 3
16 10.4 215 0 automatic 3
17 14.7 230 0 automatic 3
18 32.4 66 1 manual 4
19 30.4 52 1 manual 4
20 33.9 65 1 manual 4
21 21.5 97 1 automatic 3
22 15.5 150 0 automatic 3
23 15.2 150 0 automatic 3
24 13.3 245 0 automatic 3
25 19.2 175 0 automatic 3
26 27.3 66 1 manual 4
27 26.0 91 0 manual 5
28 30.4 113 1 manual 5
29 15.8 264 0 manual 5
30 19.7 175 0 manual 5
31 15.0 335 0 manual 5
32 21.4 109 1 manual 4
- Provide the frequency distribution (absolute, relative and percentage) for the
am
variable (transmission).
%>%
mtcars group_by(am) %>%
summarise(Absfreq =n(),
Relfreq =Absfreq/nrow(mtcars),
Perc = Absfreq/nrow(mtcars)*100)
# A tibble: 2 × 4
am Absfreq Relfreq Perc
<fct> <int> <dbl> <dbl>
1 automatic 19 0.594 59.4
2 manual 13 0.406 40.6
- Select all the observations which have
mpg>20
andhp>100
and compute the average for thempg
variable.
%>%
mtcars filter(mpg > 20, hp>100) %>%
summarise(average_mpg = mean(mpg))
average_mpg
1 23.04
- Compute the distribution (absolute frequencies) of the car by
gear
. Moreover, compute the mean consumption (mpg
) conditionally on thegear
.
%>%
mtcars group_by (gear) %>%
summarise(absfre = n())
# A tibble: 3 × 2
gear absfre
<dbl> <int>
1 3 15
2 4 12
3 5 5
#or
%>%
mtcars count(gear)
gear n
1 3 15
2 4 12
3 5 5
%>%
mtcars group_by(gear) %>%
summarise (meanmpg = mean(mpg))
# A tibble: 3 × 2
gear meanmpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
4.5.6 Exercise 3
Consider the iris
dataset available in R. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for some flowers from each of 3 species of iris. See ?iris
.
glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
- Select the variables Sepal.Length, Sepal.Width and Petal.Length
%>%
iris select(Sepal.Length:Petal.Length) %>%
head
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
- Select all the row with Sepal.Length bigger than 5.
%>%
iris filter(Sepal.Length > 5) %>%
head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 5.4 3.9 1.7 0.4 setosa
3 5.4 3.7 1.5 0.2 setosa
4 5.8 4.0 1.2 0.2 setosa
5 5.7 4.4 1.5 0.4 setosa
6 5.4 3.9 1.3 0.4 setosa
- Compute the frequency distribution (absolute, relative and percentage frequencies) of
Species
. Which are the available species?
%>%
iris group_by(Species) %>%
summarise(Absfreq =n(),
Relfreq =Absfreq/nrow(iris),
Perc = Absfreq/nrow(iris)*100)
# A tibble: 3 × 4
Species Absfreq Relfreq Perc
<fct> <int> <dbl> <dbl>
1 setosa 50 0.333 33.3
2 versicolor 50 0.333 33.3
3 virginica 50 0.333 33.3
- Compute the average and the standard deviation for all the numeric column of
iris
. Then compute the same statistics by conditioning on species. Suggestion: use the functionsummarise_if
(see?summarise_if
, in this case test if the columnis.numeric
):
%>%
iris select_if(is.numeric) %>%
summarise(across(Sepal.Length:Petal.Width, mean))
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.843333 3.057333 3.758 1.199333
%>%
iris select_if(is.numeric) %>%
summarise(across(Sepal.Length:Petal.Width, sd))
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 0.8280661 0.4358663 1.765298 0.7622377
- Compute the average of
Sepal.Length
andPetal.Length
bySpecies
. Which is the species with the highest average sepal/petal length?
%>%
iris group_by(Species) %>%
summarise(mean(Sepal.Length), mean(Petal.Length))
# A tibble: 3 × 3
Species `mean(Sepal.Length)` `mean(Petal.Length)`
<fct> <dbl> <dbl>
1 setosa 5.01 1.46
2 versicolor 5.94 4.26
3 virginica 6.59 5.55