5 Tidyverse
5.1 dplyr
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select()picks variables based on their names.rename()renames variables names.filter()picks cases based on their values.arrange()changes the ordering of the rows.
slice()picks cases based on their indices.mutate()adds new variables that are functions of existing variables.relocate()changes variable positions.
summarise()reduces multiple values down to a single summary.group_by()groups tibble in order to calculate grouped statistics withsummarise().pull()pulls the vector value from a summarize tibble.
5.1.1 Pipe
With dplyr we can perform a series of operations, for example select and then filter, by sending the results of one function to another using what is called the pipe operator: %>%. In dplyr we can write code that looks more like a description of what we want to do without intermediate objects.
starwars %>% # Pass on starwars dataset...
select(name, height, mass) %>% # ... select name, height, mass variables and pass on...
rename(kilograms = mass) %>% # ... rename mass to kilograms and pass on...
filter(kilograms > 50) # filter cases bigger than 50 kilograms.In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. Here is a very simple example:
16 %>% sqrt()
#> [1] 4We can continue to pipe values along:
16 %>% sqrt() %>% log2()
#> [1] 2The above statement is equivalent to log2(sqrt(16)).
Remember that the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined:
16 %>% sqrt() %>% log(base = 2)
#> [1] 2Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument.
5.1.2 Select
https://dplyr.tidyverse.org/reference/select.html
Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset.
starwars %>% select(name, birth_year, homeworld)
# A tibble: 3 x 3
# name birth_year homeworld
# <chr> <dbl> <chr>
# 1 Luke Skywalker 19 Tatooine
# 2 C-3PO 112 Tatooine
# 3 R2-D2 33 Naboo Equivalent to:
starwars[,c("name","birth_year","homeworld")]5.1.3 Rename
https://dplyr.tidyverse.org/reference/rename.html
You could also rename variables in select(), but because it drops all the variables not explicitly mentioned, it’s not that useful. Instead, use rename():
starwars %>% rename(planet = homeworld) # new_name = variable_name
# A tibble: 3 x 3
# name birth_year planet
# <chr> <dbl> <chr>
# 1 Luke Skywalker 19 Tatooine
# 2 C-3PO 112 Tatooine
# 3 R2-D2 33 Naboo Equivalent to:
colnames(starwars)[10] = "planet" # You'd have to find out the column number.5.1.4 Filter
https://dplyr.tidyverse.org/reference/filter.html
starwars %>% filter(eye_color == "blue", mass < 77)
# A tibble: 3 x 3
# name height mass hair_color skin_color eye_color
# <chr> <int> <dbl> <chr> <chr> <chr>
# 1 Beru Whi… 165 75 brown light blue
# 2 Adi Gall… 184 50 none dark blue
# 3 Luminara… 170 56.2 black yellow blueRoughly the equivalent to:
starwars[starwars$eye_color == "blue" & starwars$mass < 77, ]5.1.5 Arrange
https://dplyr.tidyverse.org/reference/arrange.html
This function arranges tables by variables. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
starwars %>% arrange(mass)
# A tibble: 3 x 3
# name height mass
# <chr> <int> <dbl>
# 1 Ratts… 79 15
# 2 Yoda 66 17
# 3 Wicke… 88 20Use desc()to order a column in descending order:
starwars %>% arrange(desc(mass))
# A tibble: 3 x 3
# name height mass
# <chr> <int> <dbl>
# 1 Jabba… 175 1358
# 2 Griev… 216 159
# 3 IG-88 200 1405.1.6 Slice
https://dplyr.tidyverse.org/reference/slice.html
slice() lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows.
starwars %>% slice(1:3)
# A tibble: 3 x 5
# name height mass hair_color skin_color
# <chr> <int> <dbl> <chr> <chr>
# 1 Luke … 172 77 blond fair
# 2 C-3PO 167 75 NA gold
# 3 R2-D2 96 32 NA white, blueOther Slices
Slice is accompanied by a number of helpers for common use cases: * slice_head() and slice_tail() select the first or last rows. * slice_sample() randomly selects rows. * slice_min() and slice_max() select rows with highest or lowest values of a variable.
If .data is a grouped_df, the operation will be performed on each group, so that (e.g.) slice_head(df, n = 5) will select the first five rows in each group.
Examples: https://dplyr.tidyverse.org/reference/slice.html
5.1.7 Mutate
https://dplyr.tidyverse.org/reference/mutate.html
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns.
This creates a new column with height in meters instead of centimeters:
starwars %>%
mutate(height_m = height / 100) %>%
relocate(height_m, .before = height) # Move it before variable height, to the beginning.
# A tibble: 87 x 15
# name height_m height
# <chr> <dbl> <int>
# 1 Luke Sky… 1.72 172
# 2 C-3PO 1.67 167
# 3 R2-D2 0.96 96If you only want to keep the new variables, use transmute().
5.1.8 Relocate
https://dplyr.tidyverse.org/reference/relocate.html
Use a similar syntax as select() to move blocks of columns at once.
starwars %>% relocate(sex:gender, .before = height) # Move all variables between sex and gender before variable height.
# A tibble: 3 x 4
# name sex gender height
# <chr> <chr> <chr> <int>
# 1 Luke … male mascu… 172
# 2 C-3PO none mascu… 167
# 3 R2-D2 none mascu… 96Another syntax that relocates the new variable up front would be:
starwars %>%
mutate(height_m = height / 100) %>%
select(height_m, everything()) # everything() return all other variables.5.1.9 Summarise
https://dplyr.tidyverse.org/reference/summarise.html
It collapses a data frame to a single row. It’s not that useful until we learn the group_by() verb below. This summarise syntax returns the mean and standard deviation for all cases, across species, genders, sex etc. :
starwars %>%
summarise(mean_height = mean(height, na.rm = T),
mean_mass = mean(mass, na.rm = T))
# A tibble: 1 x 2
# mean_height mean_mass
# <dbl> <dbl>
# 1 174. 97.3Summarize with group by
Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group.” ungroup() removes grouping.
starwars %>%
group_by(sex) %>% # Creates a tibble with a group: sex [5] - 5 levels
summarise( # Create summary statistics for height and mass, grouped by sex variable.
height = mean(height, na.rm = TRUE),
mass = mean(mass, na.rm = TRUE))
# A tibble: 5 x 3
# sex height mass
# <chr> <dbl> <dbl>
# 1 female 169. 54.7
# 2 hermaphroditic 175 1358
# 3 male 179. 81.0
# 4 none 131. 69.8
# 5 NA 181. 48 More examples of grouped summary statistics:
starwars %>%
group_by(sex) %>%
summarise(
n = n(),
min_mass = min(mass, na.rm = T),
mean_mass = mean(mass, na.rm = T),
sd_mass = sd(mass, na.rm = T),
max_mass = max(mass, na.rm = T)
)
# A tibble: 5 x 6
# sex n min_mass mean_mass sd_mass max_mass
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 male 60 15 81.0 28.2 159
# 2 female 16 45 54.7 8.59 75
# 3 none 6 32 69.8 51.0 140
# 4 NA 4 48 48 NA 48
# 5 hermaphroditic 1 1358 1358 NA 1358 5.1.10 Pull
https://dplyr.tidyverse.org/reference/pull.html
A useful trick for accessing values stored in data when using pipes: when a data object is piped that object and its columns can be accessed using the pull function.
starwars %>%
group_by(sex) %>%
summarise(height = mean(height, na.rm = TRUE)) %>%
pull(height)
# [1] 169.2667 175.0000 179.1053 131.2000 181.33335.1.11 Distinct values
https://dplyr.tidyverse.org/reference/n_distinct.html
Function returns the number of distinct values in a column. Similar to levels in factors.
n_distinct(starwars$sex)
#[1] 55.1.12 Counts and proportions
https://dplyr.tidyverse.org/reference/count.html
Function lets you quickly count the unique values of one or more variables.
starwars %>%
count(eye_color, sort = T)
# A tibble: 15 × 2
# eye_color n
# <chr> <int>
# 1 brown 21
# 2 blue 19
# 3 yellow 11
# 4 black 10
# 5 orange 8With two variables, count returns the number of times each combination of possible values has happened.
starwars %>%
count(eye_color, hair_color, sort = T, name = "count")
# A tibble: 35 × 3
# eye_color hair_color count
# <chr> <chr> <int>
# 1 black none 9
# 2 brown black 9
# 3 brown brown 9
# 4 blue brown 7
# 5 orange none 7
# 6 yellow none 6
# 7 blue blond 3
# 8 blue none 3
# 9 red none 3
# 10 blue black 2
# … with 25 more rowsCount and Prop Tables
Simply mutate a frequency and percentage column on a counted table.
starwars %>%
count(sex) %>%
mutate(freq = n / sum(n)) %>%
mutate(perc = freq * 100)
# A tibble: 5 × 4
# sex n freq perc
# <chr> <int> <dbl> <dbl>
# 1 female 16 0.184 18.4
# 2 hermaphroditic 1 0.0115 1.15
# 3 male 60 0.690 69.0
# 4 none 6 0.0690 6.90
# 5 NA 4 0.0460 4.605.1.13 Case when
https://dplyr.tidyverse.org/reference/case_when.html
The case_when function is useful for vectorizing conditional statements. It is similar to ifelse but can output any number of values, as opposed to just TRUE or FALSE. Here is an example splitting numbers into negative, positive, and 0:
x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative",
x > 0 ~ "Positive",
TRUE ~ "Zero")
#> [1] "Negative" "Negative" "Zero" "Positive" "Positive"A common use for this function is to define categorical variables based on existing variables. For example, suppose we want to compare the murder rates in four groups of states: New England, West Coast, South, and other. For each state, we need to ask if it is in New England, if it is not we ask if it is in the West Coast, if not we ask if it is in the South, and if not we assign other. Here is how we use case_when to do this:
murders %>%
mutate(group = case_when(
abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",
abb %in% c("WA", "OR", "CA") ~ "West Coast",
region == "South" ~ "South",
TRUE ~ "Other")) %>%
group_by(group) %>%
summarize(rate = sum(total) / sum(population) * 10^5)
#> # A tibble: 4 x 2
#> group rate
#> <chr> <dbl>
#> 1 New England 1.72
#> 2 Other 2.71
#> 3 South 3.63
#> 4 West Coast 2.905.1.14 Between
https://dplyr.tidyverse.org/reference/between.html
A common operation in data analysis is to determine if a value falls inside an interval. We can check this using conditionals. For example, to check if the elements of a vector x are between a and b we can type
x >= a & x <= bHowever, this can become cumbersome, especially within the tidyverse approach. The between function performs the same operation.
between(x, a, b)Or on a tibble using filter.
filter(starwars, between(height, 100, 150))
#> # A tibble: 5 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Leia Or… 150 49 brown light brown 19 fema… femini…
#> 2 Mon Mot… 150 NA auburn fair blue 48 fema… femini…
#> 3 Watto 137 NA black blue, grey yellow NA male mascul…
#> 4 Sebulba 112 40 none grey, red orange NA male mascul…
#> 5 Gasgano 122 NA none white, bl… black NA male mascul…
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>5.2 forcats
The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors. Factors are useful when you have categorical data, variables that have a fixed and known set of values, and when you want to display character vectors in non-alphabetical order.
5.2.1 Count factors
fct_count() https://forcats.tidyverse.org/reference/fct_count.html
Calculates a count and prop table.
starwars$sex %>%
factor() %>%
fct_count(sort = T, prop = T)
# A tibble: 5 × 3
# f n p
# <fct> <int> <dbl>
# 1 male 60 0.690
# 2 female 16 0.184
# 3 none 6 0.0690
# 4 NA 4 0.0460
# 5 hermaphroditic 1 0.0115fct_match() https://forcats.tidyverse.org/reference/fct_match.html
Test for presence of levels in a factor. Do any of lvls occur in f?
table(fct_match(gss_cat$marital, c("Married", "Divorced"))) # Do those levels exist in marital column? Return cases where it's TRUE and where it's FALSE.FALSE TRUE 7983 13500
5.2.2 Reorder levels
fct_inorder()
When using factor(), by default, it returns levels in ascending order.
f <- factor(c("b", "b", "a", "c", "c", "c"))
f\[1\] b b a c c c Levels: a b c Explicitly order levels in the order the observations appear.
fct_inorder(f)
#> [1] b b a c c c
#> Levels: b a cfct_infreq()
Order the levels by their number of observations.
fct_infreq(f)
#> [1] b b a c c c
#> Levels: c b afct_inseq()
Changes the levels to be in order, meaning in ascending sequence. Useful if you have a factor with unordered levels.
f = factor(1:3, levels = c(3,2,1))
f
# [1] 1 2 3
# Levels: 3 2 1
fct_inseq(f)
# [1] 1 2 3
# Levels: 1 2 3fct_relevel()
https://forcats.tidyverse.org/reference/fct_relevel.html
Move any number of levels to any location. In its simplest form, it pushes each consecutive level entered to the front.
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
fct_relevel(f, "a") # Push "a" to the first place
#> [1] a b c dMove level to a specific position.
fct_relevel(f, "a", after = 2)
#> [1] a b c d
#> Levels: b c a dMove level to the end.
fct_relevel(f, "a", after = Inf)
#> [1] a b c d
#> Levels: b c d afct_rev()
https://forcats.tidyverse.org/reference/fct_rev.html
Reverse order of levels. Often useful when plotting.
f <- factor(c("a", "b", "c"))
fct_rev(f)
#> [1] a b c
#> Levels: c b afct_shift()
Shift factor levels to left or right, wrapping around at end.
x <- factor(
c("Mon", "Tue", "Wed"),
levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
fct_shift(x,1) # Shift levels 1 to the left, i.e. Sun is at last position.
# [1] Mon Tue Wed
# Levels: Mon Tue Wed Thu Fri Sat Sun
fct_shift(x,-2) # Shift levels 2 to the right, i.e. Fri comes out at first position.
# [1] Mon Tue Wed
# Levels: Fri Sat Sun Mon Tue Wed Thufct_shuffle()
Randomly shuffle order of levels.
f <- factor(c("a", "b", "c"))
fct_shuffle(f)
#> [1] a b c
#> Levels: b c afct_reorder
https://forcats.tidyverse.org/reference/fct_reorder.html
Reorder factor levels by sorting along another variable.
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
Reorders the first by the second argument.
df$color <- factor(df$color) # color is our example factor
fct_reorder(df$color, df$a, min)\[1\] blue green purple red yellow Levels: blue red purple yellow green
fct_reorder2() Reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.
fct_reorder2(df$color, df$a, df$b)
#> [1] blue green purple red yellow
#> Levels: purple red blue green yellowThis is very useful when plotting.
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width), data = iris)
# Descending order:
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)5.3 purrr
https://purrr.tidyverse.org
Applying the same function or procedure to elements of an object, is quite common in data analysis. The purrr package includes functions similar to sapply but that better interact with other tidyverse functions.
5.3.1 Map
https://purrr.tidyverse.org/reference/map.html
The first purrr function we will learn is map, which works very similar to sapply but always, without exception, returns a list:
x = 1:5
add_one = function(value) {
value + 1
}
map(x, add_one)
# [[1]]
# [1] 2
# [[2]]
# [1] 3
# [[3]]
# [1] 4
# [[4]]
# [1] 5
# [[5]]
# [1] 65.3.2 Map doubles
If we want a numeric vector, we can instead use map_dbl which always returns a vector of numeric values.
x = 1:5
map_dbl(x, add_one)
# [1] 2 3 4 5 65.3.3 Map dataframes
5.3.4 Map functions/closures
Instead of creating and storing a separate function which we’ll apply via mapin the next step like this:
x = 1:5
add_one = function(value) {
value + 1
}
map_dbl(x, add_one)
# [1] 2 3 4 5 6
`
We can simply create an "anonymous function" inside the `map` function.
`r
map_dbl(x, function(x) x + 1)
# [1] 2 3 4 5 6