12 Frequent Problems

12.1 First exploration of new dataset

The skim()shows how many missing and unique values each variable has. It uses appropriate measures to describe each variable based on its type: character, numeric or list.

skimr::skim(starwars)
Table 12.1: Data summary
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 12 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 17 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

The glimpse function, on the other hand, gives us a good peak at the first raw values each variable has.

glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

12.2 A Count and prop table

First way with forcats::fct_count() Calculates a count and prop table.

starwars$sex %>%
 factor() %>% 
  fct_count(sort = T, prop = T)
## # A tibble: 5 × 3
##   f                  n      p
##   <fct>          <int>  <dbl>
## 1 male              60 0.690 
## 2 female            16 0.184 
## 3 none               6 0.0690
## 4 <NA>               4 0.0460
## 5 hermaphroditic     1 0.0115

Second way with deplyr::count() Simply mutate a frequency and percentage column on a counted table.

starwars %>% 
 count(sex) %>% 
  mutate(freq = n / sum(n)) %>% 
  mutate(perc = freq * 100)
## # A tibble: 5 × 4
##   sex                n   freq  perc
##   <chr>          <int>  <dbl> <dbl>
## 1 female            16 0.184  18.4 
## 2 hermaphroditic     1 0.0115  1.15
## 3 male              60 0.690  69.0 
## 4 none               6 0.0690  6.90
## 5 <NA>               4 0.0460  4.60

12.3 Bar graph with count data

Here is a situation where we calculated a count table for hair color - we summarized all values. If we then want to plot a bar graph based on that count table we run into problems, because ggplot2 is expecting a non-summarized or normal data frame.

hair_color_table = starwars %>% 
  mutate(hair_color = fct_lump_min(hair_color, 2)) %>% 
  group_by(hair_color) %>% 
  summarise(n = n())
hair_color_table
## # A tibble: 7 × 2
##   hair_color     n
##   <fct>      <int>
## 1 black         13
## 2 blond          3
## 3 brown         18
## 4 none          37
## 5 white          4
## 6 Other          7
## 7 <NA>           5

To tell the function that we have already summarized data, we add the argument stat = "identity" to the geom_bar() function.

hair_color_table %>% 
  ggplot(aes(x = reorder(hair_color, n), y = n, fill = hair_color)) + 
  geom_bar(stat = "identity") + 
  theme(legend.position = "none")

12.4 Bar graph with percentage labels

First we create a table with counts and percentages:

d = starwars %>% 
  group_by(gender) %>% 
  summarise(count = n()) %>% 
  mutate(percentage = count/sum(count))
d
## # A tibble: 3 × 3
##   gender    count percentage
##   <chr>     <int>      <dbl>
## 1 feminine     17     0.195 
## 2 masculine    66     0.759 
## 3 <NA>          4     0.0460

Then we plot a graph with bar and with percentage labels.

d %>% 
  ggplot(aes(gender, percentage, label = round(percentage, 2), fill = gender)) + 
  geom_bar(stat = "identity") + 
  geom_label(aes(fill = NA), fill = "white") + 
  theme(legend.position = "none")

12.5 Collapse factors to „Other”

This syntax mutates the categorical variable homeworld into eight of its most frequent values. The other values are being collapsed into the categorical value „other”.

starwars %>% 
  mutate(homeworld = fct_lump_n(homeworld, n = 8)) %>% 
  group_by(homeworld) %>% 
  summarise(mean(height, na.rm =T), mean(mass, na.rm = T), n())
## # A tibble: 11 × 4
##    homeworld `mean(height, na.rm = T)` `mean(mass, na.rm = T)` `n()`
##    <fct>                         <dbl>                   <dbl> <int>
##  1 Alderaan                       176.                    64       3
##  2 Corellia                       175                     78.5     2
##  3 Coruscant                      174.                    50       3
##  4 Kamino                         208.                    83.1     3
##  5 Kashyyyk                       231                    124       2
##  6 Mirial                         168                     53.1     2
##  7 Naboo                          175.                    64.2    11
##  8 Ryloth                         179                     55       2
##  9 Tatooine                       170.                    85.4    10
## 10 Other                          173.                   117.     39
## 11 <NA>                           139.                    82      10

12.6 Filter for specific values

We can easily filter out cases with certain column values, like for example the states of Hawai and Alaska. We use filter(), the operator ! and %in%.

starwars %>% 
  filter(!homeworld%in%c("Tatooine","Naboo")) %>% 
  select(name, homeworld)
## # A tibble: 66 × 2
##    name                  homeworld 
##    <chr>                 <chr>     
##  1 Leia Organa           Alderaan  
##  2 Obi-Wan Kenobi        Stewjon   
##  3 Wilhuff Tarkin        Eriadu    
##  4 Chewbacca             Kashyyyk  
##  5 Han Solo              Corellia  
##  6 Greedo                Rodia     
##  7 Jabba Desilijic Tiure Nal Hutta 
##  8 Wedge Antilles        Corellia  
##  9 Jek Tono Porkins      Bestine IV
## 10 Yoda                  <NA>      
## # … with 56 more rows

12.7 Change bar colors in barplot

You can manually pick the colors with fill and a vector containing the color values. Either in String, written out.

starwars %>% 
  mutate(sex = fct_infreq(sex)) %>% 
  ggplot(aes(sex)) + 
  geom_bar(fill = c("red","blue","green","black","grey")) 

Or with RGB Color Codes.

starwars %>% 
    mutate(sex = fct_infreq(sex)) %>%
    ggplot(aes(sex)) +
    geom_bar(fill = c("#003f5c","#58508d","#bc5090","#ff6361","#ffa600")) 

12.8 Hide aes(color) mapping legend

Here is an example where we want the bar colored based on the variable itself, but without the mapping legend.

starwars %>% 
  mutate(sex = fct_infreq(sex)) %>% 
  ggplot(aes(sex, fill = sex)) + 
  geom_bar()

Hide the geom_bar legend.

starwars %>% 
  mutate(sex = fct_infreq(sex)) %>% 
  ggplot(aes(sex, fill = sex)) + 
  geom_bar(show.legend = F)

Remove just the legend title:

starwars %>% 
  mutate(sex = fct_infreq(sex)) %>% 
  ggplot(aes(sex, fill = sex)) + 
  geom_bar() +
  theme(legend.title = element_blank())

Hide all legends created:

starwars %>% 
  mutate(sex = fct_infreq(sex)) %>% 
  ggplot(aes(sex, fill = sex)) + 
  geom_bar() +
  theme(legend.position = "none")

12.9 Re-code values of categorical variables

First way We can use fct_collapse()to create a new column with the new recoded values in it.

Second way By using mutate, to create a new column with our own values and case_when, to run through our observations looking for defined cases, together with “variable” %in%, we can create our own groups.

gapminder %>% 
 mutate(group = case_when(
    region %in% c("Western Europe", "Northern Europe","Southern Europe","Northern America", "Australia and New Zealand") ~ "West", # If region is one of values -> assign it "West" in new group column.
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & 
      region != "Northern Africa" ~ "Sub-Saharan",
    TRUE ~ "Others")) %>%  # If nothing above applies -> assign it "Others" in group column
  head(10)
##                country year infant_mortality life_expectancy fertility
## 1              Albania 1960           115.40           62.87      6.19
## 2              Algeria 1960           148.20           47.50      7.65
## 3               Angola 1960           208.00           35.98      7.32
## 4  Antigua and Barbuda 1960               NA           62.97      4.43
## 5            Argentina 1960            59.87           65.39      3.11
## 6              Armenia 1960               NA           66.86      4.55
## 7                Aruba 1960               NA           65.66      4.82
## 8            Australia 1960            20.30           70.87      3.45
## 9              Austria 1960            37.30           68.75      2.70
## 10          Azerbaijan 1960               NA           61.33      5.57
##    population          gdp continent                    region         group
## 1     1636054           NA    Europe           Southern Europe          West
## 2    11124892  13828152297    Africa           Northern Africa        Others
## 3     5270844           NA    Africa             Middle Africa   Sub-Saharan
## 4       54681           NA  Americas                 Caribbean Latin America
## 5    20619075 108322326649  Americas             South America Latin America
## 6     1867396           NA      Asia              Western Asia        Others
## 7       54208           NA  Americas                 Caribbean Latin America
## 8    10292328  96677859364   Oceania Australia and New Zealand          West
## 9     7065525  52392699681    Europe            Western Europe          West
## 10    3897889           NA      Asia              Western Asia        Others

We turn this group variable into a factor to control the order of the levels:

12.10 Order color legend

Order color legend by a variable’s values.

12.11 Show unique values

Display all unique values of variable.

distinct(starwars, species) # dplyr function
## # A tibble: 38 × 1
##    species       
##    <chr>         
##  1 Human         
##  2 Droid         
##  3 Wookiee       
##  4 Rodian        
##  5 Hutt          
##  6 Yoda's species
##  7 Trandoshan    
##  8 Mon Calamari  
##  9 Ewok          
## 10 Sullustan     
## # … with 28 more rows

Note: distinct(dat$countries) doesn’t work.

12.12 Slice rows by maximum or minimum values

Note: parameter n must be explicitly written, otherwise it throws an error.

starwars %>% 
  slice_max(height, n = 5)
## # A tibble: 5 × 14
##   name     height  mass hair_color skin_color eye_color birth_year sex   gender 
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  
## 1 Yarael …    264    NA none       white      yellow            NA male  mascul…
## 2 Tarfful     234   136 brown      brown      blue              NA male  mascul…
## 3 Lama Su     229    88 none       grey       black             NA male  mascul…
## 4 Chewbac…    228   112 brown      unknown    blue             200 male  mascul…
## 5 Roos Ta…    224    82 none       grey       orange            NA male  mascul…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Show me 5% of the lowest height rows.

starwars %>% 
  slice_min(height, prop = 0.05)
## # A tibble: 4 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Yoda          66    17 white      green      brown            896 male  mascu…
## 2 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
## 3 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
## 4 Dud Bolt      94    45 none       blue, grey yellow            NA male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

12.13 Show Number of NAs

For a quick check of how many missing values there are in a single column:

sum(is.na(starwars$height))
## [1] 6

And how many are not NAs.

sum(!is.na(starwars$height))
## [1] 81

For a more detailed overview of the whole dataset use skim(). It shows a very useful complete_rate which tells us how much of the column is disturbed by missing values.

skimr::skim(starwars)
Table 12.2: Data summary
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 12 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 17 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

12.14 Drop rows with missing values

Drop rows that have NAvalues in a specific column, here in height.

starwars %>% 
  drop_na(height)
## # A tibble: 81 × 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S…    172    77 blond       fair       blue            19   male  mascu…
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu…
##  3 R2-D2       96    32 <NA>        white, bl… red             33   none  mascu…
##  4 Darth …    202   136 none        white      yellow          41.9 male  mascu…
##  5 Leia O…    150    49 brown       light      brown           19   fema… femin…
##  6 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
##  7 Beru W…    165    75 brown       light      blue            47   fema… femin…
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu…
##  9 Biggs …    183    84 black       light      brown           24   male  mascu…
## 10 Obi-Wa…    182    77 auburn, wh… fair       blue-gray       57   male  mascu…
## # … with 71 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Drop all rows that contain NA in any column.

starwars %>% 
  drop_na()
## # A tibble: 6 × 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk…    172    77 blond       fair       blue            19   male  mascu…
## 2 Obi-Wan…    182    77 auburn, wh… fair       blue-gray       57   male  mascu…
## 3 Anakin …    188    84 blond       fair       blue            41.9 male  mascu…
## 4 Chewbac…    228   112 brown       unknown    blue           200   male  mascu…
## 5 Wedge A…    170    77 brown       fair       hazel           21   male  mascu…
## 6 Darth M…    175    80 none        red        yellow          54   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Filter out any NA containing rows.

starwars %>% 
 na.exclude()
## # A tibble: 29 × 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S…    172    77 blond       fair       blue            19   male  mascu…
##  2 Darth …    202   136 none        white      yellow          41.9 male  mascu…
##  3 Leia O…    150    49 brown       light      brown           19   fema… femin…
##  4 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
##  5 Beru W…    165    75 brown       light      blue            47   fema… femin…
##  6 Biggs …    183    84 black       light      brown           24   male  mascu…
##  7 Obi-Wa…    182    77 auburn, wh… fair       blue-gray       57   male  mascu…
##  8 Anakin…    188    84 blond       fair       blue            41.9 male  mascu…
##  9 Chewba…    228   112 brown       unknown    blue           200   male  mascu…
## 10 Han So…    180    80 brown       fair       brown           29   male  mascu…
## # … with 19 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

12.15 Replace NAs

Replace 0 with value you want as a replacement.

data(na_example)
sum(is.na(na_example))
## [1] 145
no_nas <- ifelse(is.na(na_example), 0, na_example) # "if is NA is true, change value to 0, else keep the value (i.e. na_example)"

sum(is.na(no_nas))
## [1] 0

12.16 The factor variable trap

The FVT is about what happens when you try to return factorized vectors into numeric values. Let’s look at this with this code.

z <-factor(c("12", "13", "14", "15", "12")) # We create an object by directly factorizing a vector. 
z
## [1] 12 13 14 15 12
## Levels: 12 13 14 15
y <- as.numeric(z) # Now we want to convert them into numeric values. 
y # What?
## [1] 1 2 3 4 1

This happened, because we picked up the on the factorization result. factor() assigns every element, based on its value, an integer number.

typeof(z) # 1=12, 13=2, 14=3, 15=4, 12=1
## [1] "integer"

To fix this problem, first convert the object back to character and then to numeric.

y <- as.numeric(as.character(z))
y
## [1] 12 13 14 15 12