3  Lab 2 - 03/10/2024

In this lecture we will learn:

3.1 Data frames

In the previous lesson, we had seen that if we concatenate a vector with characters with a vector of numbers, we get as result a vector that contains only characters.

x = sample(x = c("head", "tail"), size = 10, replace = TRUE)
class(x) #character
[1] "character"
set.seed(55)
y = runif(10)
class(y)#numeric
[1] "numeric"
z = round(y, 2)
class(z)#numeric
[1] "numeric"
w = c(x, y)
class(x)#numeric
[1] "character"

Using data frames it is possible to combine vectors of different types, e.g. containing text or numbers. For example, we can combine x, y and z - which have the same length - into a data frame named mydf:

mydf = data.frame(coin = x, score = y, z)
class(mydf)
[1] "data.frame"
head(mydf) #first 6 rows
  coin      score    z
1 head 0.54781352 0.55
2 tail 0.21815968 0.22
3 head 0.03496399 0.03
4 head 0.79154929 0.79
5 tail 0.56024208 0.56
6 head 0.07422517 0.07

Note that it is possible to specify column names (in this case coin and score) instead of the default name (as for z). The structure of a data frame is very similar to the one of a matrix (bi-dimensional object). In particular, the number of rows corresponds to the number of observations and the number of columns to the number of variables.

dim(mydf) #vector of dimension
[1] 10  3
nrow(mydf)
[1] 10
ncol(mydf)
[1] 3

Another important function is str which describes the data frame and the variables herein contained:

str(mydf)
'data.frame':   10 obs. of  3 variables:
 $ coin : chr  "head" "tail" "head" "head" ...
 $ score: num  0.548 0.218 0.035 0.792 0.56 ...
 $ z    : num  0.55 0.22 0.03 0.79 0.56 0.07 0.13 0.29 0.5 0.09
mydf
   coin      score    z
1  head 0.54781352 0.55
2  tail 0.21815968 0.22
3  head 0.03496399 0.03
4  head 0.79154929 0.79
5  tail 0.56024208 0.56
6  head 0.07422517 0.07
7  head 0.13152294 0.13
8  tail 0.29412388 0.29
9  tail 0.50076126 0.50
10 head 0.08832446 0.09

In data frames, data selection can be performed using the squared parentheses. See for example:

mydf[1,1] #first row and first column
[1] "head"
mydf[1:4, 1:2] #first 4 rows and first 2 columns
  coin      score
1 head 0.54781352
2 tail 0.21815968
3 head 0.03496399
4 head 0.79154929
mydf[1,] #first row (all the columns)
  coin     score    z
1 head 0.5478135 0.55

If we are interested in selecting all the values in a particular column it is also possible to use the $ followed by the column name:

#we select the first column 
mydf[,1]
 [1] "head" "tail" "head" "head" "tail" "head" "head" "tail" "tail" "head"
#or
mydf$coin
 [1] "head" "tail" "head" "head" "tail" "head" "head" "tail" "tail" "head"

Let’s assume now that we want to select only the rows such that the first variable is equal to head. In this case we need to perform a selection by condition (given by the name "head") performed on the rows (first index in the squared parentheses):

# condition
mydf$coin == "head" #logical vector
 [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
# selection
mydf2 = mydf[mydf$coin == "head" , ]
mydf2 
   coin      score    z
1  head 0.54781352 0.55
3  head 0.03496399 0.03
4  head 0.79154929 0.79
6  head 0.07422517 0.07
7  head 0.13152294 0.13
10 head 0.08832446 0.09

It is also possible to add a new column to the data frame by using the object assignment approach (the name of the new variable will be newcol and will contain 10 values from a random normal distribution):

set.seed(1234)
mydf$newcol = rnorm(nrow(mydf))
str(mydf)
'data.frame':   10 obs. of  4 variables:
 $ coin  : chr  "head" "tail" "head" "head" ...
 $ score : num  0.548 0.218 0.035 0.792 0.56 ...
 $ z     : num  0.55 0.22 0.03 0.79 0.56 0.07 0.13 0.29 0.5 0.09
 $ newcol: num  -1.207 0.277 1.084 -2.346 0.429 ...
head(mydf) #top 6 lines of the dataframe
  coin      score    z     newcol
1 head 0.54781352 0.55 -1.2070657
2 tail 0.21815968 0.22  0.2774292
3 head 0.03496399 0.03  1.0844412
4 head 0.79154929 0.79 -2.3456977
5 tail 0.56024208 0.56  0.4291247
6 head 0.07422517 0.07  0.5060559
tail(mydf) #bottom 6 lines of the dataframe
   coin      score    z     newcol
5  tail 0.56024208 0.56  0.4291247
6  head 0.07422517 0.07  0.5060559
7  head 0.13152294 0.13 -0.5747400
8  tail 0.29412388 0.29 -0.5466319
9  tail 0.50076126 0.50 -0.5644520
10 head 0.08832446 0.09 -0.8900378

Summary statistics functions can be computed for data frames. For example the code returns the sum of all the 10 values contained in mydf (the output is a single value). Note that the first column has been removed (with [, -1], all the columns but not the first one) given that it is a text variable and it is not possible to compute the sum or the mean for it.

sum(mydf[,-1])
[1] 2.640112

Sometimes it is necessary to compute the summary statistics marginally by row or by column. This could be done as follows:

sum(mydf$score)
[1] 3.241686
sum(mydf$z)
[1] 3.23
sum(mydf$newcol)
[1] -3.831574

This approach is not optimum at all as it requires a number of code lines equal to the number of columns (guess what happen when you have a lot of columns!). A fast and convenient alternative consist in using the function apply (see ?apply). The function definition is apply(X, MARGIN, FUN,... ), where MARGIN=1 indicates by row, and MARGIN=2 by column. For example, the following code computes the sum function marginally by row and returns a vector.

apply(mydf[,-1], 1, sum) #This command does the sum by row for all the column, except for the first one.
 [1] -0.10925223  0.71558892  1.14940517 -0.76414841  1.54936676  0.65028107
 [7] -0.31321702  0.03749202  0.43630926 -0.71171337
apply(mydf[,-1], 2, sum) # This command does the sum by column .
    score         z    newcol 
 3.241686  3.230000 -3.831574 

Similarly, apply the function sum separately for each column and returns a vector. Instead of the sum, it is possible to apply other summary statistics:

apply(mydf[,-1], 2, min)
      score           z      newcol 
 0.03496399  0.03000000 -2.34569770 
apply(mydf[,-1], 2, mean)
     score          z     newcol 
 0.3241686  0.3230000 -0.3831574 
apply(mydf[,-1], 2, var)
     score          z     newcol 
0.06737376 0.06780111 0.99159283 
apply(mydf[,-1], 2, summary)  
             score      z     newcol
Min.    0.03496399 0.0300 -2.3456977
1st Qu. 0.09912408 0.1000 -0.8112134
Median  0.25614178 0.2550 -0.5555419
Mean    0.32416863 0.3230 -0.3831574
3rd Qu. 0.53605045 0.5375  0.3912008
Max.    0.79154929 0.7900  1.0844412

3.2 Lists

A list is like a box which can contain different kind of objects (with different dimensions). In the following example we will create a list containing:

  • the numerical vector x
  • the data frame mydf
  • a string of text
mylist = list(x, mydf, "hi!")

See the particular structure of the list mylist:

mylist
[[1]]
 [1] "head" "tail" "head" "head" "tail" "head" "head" "tail" "tail" "head"

[[2]]
   coin      score    z     newcol
1  head 0.54781352 0.55 -1.2070657
2  tail 0.21815968 0.22  0.2774292
3  head 0.03496399 0.03  1.0844412
4  head 0.79154929 0.79 -2.3456977
5  tail 0.56024208 0.56  0.4291247
6  head 0.07422517 0.07  0.5060559
7  head 0.13152294 0.13 -0.5747400
8  tail 0.29412388 0.29 -0.5466319
9  tail 0.50076126 0.50 -0.5644520
10 head 0.08832446 0.09 -0.8900378

[[3]]
[1] "hi!"

Note that its length (number of objects in the list) is given by

length(mylist)
[1] 3

Two possible kinds of selection can be performed with list:

  • with single squared parentheses (that returns a smaller list):
# we want to select the first list.
mylist[1] 
[[1]]
 [1] "head" "tail" "head" "head" "tail" "head" "head" "tail" "tail" "head"
class(mylist[1])
[1] "list"
  • with double squared parentheses (that returns another kind of object):
mylist[[1]]
 [1] "head" "tail" "head" "head" "tail" "head" "head" "tail" "tail" "head"
class(mylist[[1]])
[1] "character"

The difference is that in the first case ([1]) the output is another list (smaller), while in the second case ([[1]]) is a vector. It is also possible to combine single and double parentheses:

#first element in the first list object
mylist[[1]][1] 
[1] "head"

For a visual description of these concepts read this interesting link: click here.

It is also possible to specify object names when the list is created:

mylist = list(x=x, df=mydf, mystring="hi!")
mylist
$x
 [1] "head" "tail" "head" "head" "tail" "head" "head" "tail" "tail" "head"

$df
   coin      score    z     newcol
1  head 0.54781352 0.55 -1.2070657
2  tail 0.21815968 0.22  0.2774292
3  head 0.03496399 0.03  1.0844412
4  head 0.79154929 0.79 -2.3456977
5  tail 0.56024208 0.56  0.4291247
6  head 0.07422517 0.07  0.5060559
7  head 0.13152294 0.13 -0.5747400
8  tail 0.29412388 0.29 -0.5466319
9  tail 0.50076126 0.50 -0.5644520
10 head 0.08832446 0.09 -0.8900378

$mystring
[1] "hi!"
names(mylist)
[1] "x"        "df"       "mystring"

In this case it is also possible to use the $ to perform element selection:

#equivalent codes:
mylist$df
   coin      score    z     newcol
1  head 0.54781352 0.55 -1.2070657
2  tail 0.21815968 0.22  0.2774292
3  head 0.03496399 0.03  1.0844412
4  head 0.79154929 0.79 -2.3456977
5  tail 0.56024208 0.56  0.4291247
6  head 0.07422517 0.07  0.5060559
7  head 0.13152294 0.13 -0.5747400
8  tail 0.29412388 0.29 -0.5466319
9  tail 0.50076126 0.50 -0.5644520
10 head 0.08832446 0.09 -0.8900378
mylist[[2]]
   coin      score    z     newcol
1  head 0.54781352 0.55 -1.2070657
2  tail 0.21815968 0.22  0.2774292
3  head 0.03496399 0.03  1.0844412
4  head 0.79154929 0.79 -2.3456977
5  tail 0.56024208 0.56  0.4291247
6  head 0.07422517 0.07  0.5060559
7  head 0.13152294 0.13 -0.5747400
8  tail 0.29412388 0.29 -0.5466319
9  tail 0.50076126 0.50 -0.5644520
10 head 0.08832446 0.09 -0.8900378

Finally, there is an equivalent version of apply which is suitable for a list: it’s the lapply function (see ?lapply). For example, the following code apply the function class to each element in the list (avoiding several copy-paste commands):

lapply(mylist, class) 
$x
[1] "character"

$df
[1] "data.frame"

$mystring
[1] "character"

It is possible to transform the list into a vector by using the unlist function:

unlist(lapply(mylist, class))
           x           df     mystring 
 "character" "data.frame"  "character" 

3.3 Write a new function in R

The Do not repeat yourself (DRY) principle suggests avoiding repetitions in the code, as copy-and-paste the same code to compute the same quantity(ies) for different data. The more repetition you have in your code, the more places you need to remember to update when things change, and the more likely you will include errors in your code.

Automate common tasks through functions is a powerful alternative to copy-and-paste. When input changes, with functions you only need to update the code in one place instead of many (this eliminates the chance of incidental mistakes).

To define a new function we use the following structure:

name.of.function = function(agrument1, argument2){ 
  statements
  return(something)
}

We we would like to create a function that is able to transform temperatures from Celsius to Fahrenheit. Thus, we define a new function named temp_conv which takes in input the vector named c (that stands for Celsius). The function is defined as:

temp_cov = function(c){
  # from Celsius to Fahrenheit
  f=9/5*c +32
  #returns the temperature in Fahrenheit
  return(f) 
}

After running the code, you will see a new object of type function in the top right panel. We would like to use this function:

#degrees conversion from Celsius to Fahrenheit when the temperature is 0.
temp_cov(0)
[1] 32
#conversion of a vector of temperatures.
temp_cov(c(10, 3, 35, -10))
[1] 50.0 37.4 95.0 14.0

We can now expand the function by transforming Celsius (c) degrees temperatures into Fahrenheit (f) and Kelvin (k) scale. We can use two three different commands, the first returns two vectors, the second returns one data frame and the third a list.

#1st Method: VECTOR
temp_cov2 = function(c){
  # from Celsius to Fahrenheit
  f=9/5*c +32
  # from Celsius to Kelvin
  k= c+ 273.15
  #returns the temperature in Fahrenheit and Kelvin 
  return(c(f,k))
}

temp_cov2(10)
[1]  50.00 283.15
#2nd Method: DATA FRAME
temp_cov3 = function(c){
  # from Celsius to Fahrenheit
  f=9/5*c +32
  # from Celsius to Kelvin
  k= c+ 273.15
  #returns the temperature in Fahrenheit and Kelvin 
  return(data.frame(Fahrenheit=f,Kelvin =k))
}
temp_cov3(10)
  Fahrenheit Kelvin
1         50 283.15
#3rd Method:LIST
temp_cov4 = function(c){
  # from Celsius to Fahrenheit
  f=9/5*c +32
  # from Celsius to Kelvin
  k= c+ 273.15
  #returns the temperature in Fahrenheit and Kelvin 
  return(list(Fahrenheit=f,Kelvin =k))
}
temp_cov4(10)
$Fahrenheit
[1] 50

$Kelvin
[1] 283.15

3.4 Conditional statements

The general definition for a conditional statement is the following:

if (condition1){
  expr1
  #code executed when condition1 is TRUE
} else if (condition2){ 
  expr2
  #code executed when condition2 is TRUE
} else if (condition3) {
  expr3
  #code executed when condition3 is TRUE
} else { 
  expr 4
  #code executed otherwise
}

Let’s say that we want now to expand the temp_cov function and include two inputs: the temperature and the unit of measurement, that it could be Celsius or Fahrenheit. The function transforms the temperature (value) from the unit of measurement (unit) to the other one. For the specific case we will have

temp_cov_unit =function(value, unit){
  if (unit=="c"){
    f=9/5* value +32
    return(f)
  } else if (unit=="f"){
    c=5/9*(value - 32)
    return(c)
  } else {
    print("wrong measurament unit")
  }
}

The function can be used as follows:

#from Fahrenheit to Celsius
temp_cov_unit(value=17, unit= "f")
[1] -8.333333
#From Celsius to Fahrenheit
temp_cov_unit(value=8, unit= "c")
[1] 46.4
#Wrong unit 
temp_cov_unit(value=25, unit= "k")
[1] "wrong measurament unit"

3.5 Tidyverse

Tidyverse is a collection of R packages designed for data science (see Figure @ref(fig:tidyverse). All the packages share an underlying design philosophy, grammar, and data structures. See here for more details.

Packages included in tidyverse

The tidyverse-based functions process faster than base R functions. It is because they are written in a computationally efficient manner and are also more stable in the syntax and better supports data frames than vectors.

3.6 Install and load a package

Before starting using a package it is necessary to follow two steps:

  1. install the package: this has to be done only once (unless you re-install R, change or reset your computer). It is like buying a light bulb and installing it in the lamp, as described in Figure @ref(fig:installload): you do this only once not every time you need some light in your room. This step can be performed by using the RStudio menu, through Tools - Install package, as shown in Figure @ref(fig:installmenu). Behind this menu shortcut RStudio is using the install.packages function.

  2. load the package: this is like switching on the light one you have an installed light bulb, something that can be done every time you need some light in the room (see Figure @ref(fig:installload)). Similarly, each package can be loaded whenever you need to use some functions included in the package. To load the tidyverse package we proceed as follows:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Install and load a R package

3.7 Exercises Lecture 2

3.7.1 Exercise 1

Consider the following three vectors:

name = c("Milan","Inter","Napoli","Atalanta","Juventus")
points = c(38, 37, 36, 34, 27)
lastwon = c(TRUE, TRUE, FALSE, TRUE, TRUE)
  1. Create a data frame combining the available information. Name it Teams. Check the structure and dimensions of the data frame.

  2. Create another data frame (Teams2) by selecting only the teams (and all the variables) which won the last match (see lastwon).

3.7.2 Exercise 2

Consider the following list:

a = list (x=5, y=10, z=15, w=rnorm(10))
a
$x
[1] 5

$y
[1] 10

$z
[1] 15

$w
 [1] -0.47719270 -0.99838644 -0.77625389  0.06445882  0.95949406 -0.11028549
 [7] -0.51100951 -0.91119542 -0.83717168  2.41583518
  1. Compute the sum of all the elements in a.

  2. Extract the second element of w.

3.Compute how many elements of w are positive. Compute also the percentage.

3.7.3 Exercise 3

  1. Define a function named myf which takes a single argument \(x\) and returns the value of the function \(y\) which is defined as follows:
  • \(y=x^2+2x+3\) if \(x<0\);
  • \(y=x+3\) if \(0\leq x <2\);
  • \(y=x^2+4x-7\) if \(x\geq 2\).
  1. Evaluate the function in the following values of \(x\): -4.5, 5.90, 122.

3.8 Solutions

3.8.1 Exercise 1

Consider the following three vectors:

name = c("Milan","Inter","Napoli","Atalanta","Juventus")
points = c(38, 37, 36, 34, 27)
lastwon = c(TRUE, TRUE, FALSE, TRUE, TRUE)
  1. Create a data frame combining the available information. Name it Teams. Check the structure and dimensions of the data frame.
Teams= data.frame(name, points, lastwon)
  1. Create another data frame (Teams2) by selecting only the teams (and all the variables) which won the last match (see lastwon).
Teams2= Teams[Teams$lastwon == "TRUE" , ]
Teams2
      name points lastwon
1    Milan     38    TRUE
2    Inter     37    TRUE
4 Atalanta     34    TRUE
5 Juventus     27    TRUE
  1. Apply, when possible, the function sum and mean to the columns of Teams2.
apply(Teams2[,-1],2, sum)
 points lastwon 
    136       4 
apply(Teams2[,-1],2, mean)
 points lastwon 
     34       1 

3.8.2 Exercise 2

Consider the following list:

a = list (x=5, y=10, z=15, w=rnorm(10))
a
$x
[1] 5

$y
[1] 10

$z
[1] 15

$w
 [1]  0.1340882 -0.4906859 -0.4405479  0.4595894 -0.6937202 -1.4482049
 [7]  0.5747557 -1.0236557 -0.0151383 -0.9359486
  1. Compute the sum of all the elements in a.
sum(unlist(a))
[1] 26.12053
  1. Extract the second element of w.
a[[4]][2]
[1] -0.4906859

3.Compute how many elements of w are positive. Compute also the percentage.

sum(a[[4]]>0)
[1] 3
mean(a[[4]]>0)*100
[1] 30

3.8.3 Exercise 3

  1. Define a function named myf which takes a single argument \(x\) and returns the value of the function \(y\) which is defined as follows:
  • \(y=x^2+2x+3\) if \(x<0\);
  • \(y=x+3\) if \(0\leq x <2\);
  • \(y=x^2+4x-7\) if \(x\geq 2\).
myf= function(x){
  if (x<0){
    y= x^2+2*x +3
  } else if (x>0 & x<2){
    y= x+3
  } else {
    y=x^2 +4*x-7
  }
  return(y)
}
  1. Evaluate the function in the following values of \(x\): -4.5, 5.90, 122.
myf(-4.5)
[1] 14.25
myf(5.90)
[1] 51.41
myf(122)
[1] 15365