dplyr Examples

Author:	Mitch Richling
Updated:	2022-11-29 18:02:56

1. Metadata
2. First Steps
3. Selecting Rows
4. Sorting
5. Selecting Columns
6. New columns
- 6.1. Compute new columns based on other columns
- 6.2. Compute new columns based on other columns and throw away all the old columns
7. Aggregation
8. Join
9. Long & Wide
- 9.1. Long to Wide
- 9.2. Wide to Long

1. Metadata

The home for this HTML file is: https://richmit.github.io/ex-R/dplyr.html

Files related to this document may be found on github: https://github.com/richmit/ex-R

Directory contents:

`src`	-	The org-mode file that generated this HTML document
`docs`	-	This html document
`data`	-	Data files
`tangled`	-	Tangled R code from this document

2. First Steps

2.1. Load the library

library(dplyr)
library(data.table)
library(tidyr)

2.2. Example Data

mtcars

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

2.3. Row Names

mtcarsR <-
  mtcars %>%
  tibble::rownames_to_column('car_name') %>%
  data.table::data.table()
mtcarsR %>%
  head()

            car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1:         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2:     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3:        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4:    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5: Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6:           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2.4. Chaining/Pipes vs Nesting vs Assignments

Using the Tidyverse tends to lead to deeply nested function calls, which can get a bit messy using traditional infix notation. Ofcource the LISPers would tell our syntax is wrong, and the smalltalk peeps would say "just use dots". The Tidyverse has taken a diffrent appraoch: the pipe. In UNIX shells the pipe has long been used as a way to manage a flow of text transformations – the essence of many shell scripts. In R we frequently deal with a flow of data taransformations, so the pipe concept seems liek a good fit. After this section, I'll pretty much stick to the "pipe way".

2.4.1. Chaining (Pipes)

Chaining is a technique using the %>% operator that can be used to "chain" together dplyr calls into sequential steps. Note modern R versions contain native pipe operators that have slightly different semantics. In this document I'll stick to the Tidyverse pipe.

mtcarsR %>%
  dplyr::group_by(cyl, carb) %>%
  dplyr::select(mpg, hp, cyl, carb) %>%
  dplyr::summarise(meanMPG=mean(mpg, na.rm=TRUE),
                   meanCYL=mean(hp,  na.rm=TRUE),
                   .groups = 'drop')  %>%
  filter(meanMPG > 20) %>%
  data.table::data.table()

   cyl carb meanMPG meanCYL
1:   4    1   27.58    77.4
2:   4    2   25.90    87.0

2.4.2. Temporary Variables

A favorite for FORTRAN programmers!

tmp <- dplyr::group_by(mtcarsR, cyl, carb)
tmp <- dplyr::select(tmp, mpg, hp, cyl, carb)
tmp <- dplyr::summarise(tmp,
                        meanMPG=mean(mpg, na.rm=TRUE),
                        meanCYL=mean(hp,  na.rm=TRUE),
                        .groups = 'drop')
tmp <- dplyr::filter(tmp, meanMPG > 20)
tmp <- data.table::data.table(tmp)
tmp

   cyl carb meanMPG meanCYL
1:   4    1   27.58    77.4
2:   4    2   25.90    87.0

2.4.3. Composition

How the LISPers do it, but with worse syntax. ;)

data.table::data.table(dplyr::filter(dplyr::summarise(dplyr::select(dplyr::group_by(mtcarsR,
                                                                                    cyl,
                                                                                    carb),
                                                                    mpg,
                                                                    hp,
                                                                    cyl,
                                                                    carb),
                                                      meanMPG=mean(mpg, na.rm=TRUE),
                                                      meanCYL=mean(hp,  na.rm=TRUE),
                                                      .groups = 'drop'),
                                     meanMPG > 20))

   cyl carb meanMPG meanCYL
1:   4    1   27.58    77.4
2:   4    2   25.90    87.0

3. Selecting Rows

3.1. With a boolean expression

You can actually replace that & with a comma, but I'm usually explicit about this sort of thing

mtcarsR %>%
  dplyr::filter(cyl>4 & carb==4)

               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
 3:          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
 4:            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
 5:           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
 6:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
 7: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
 8:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
 9:          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
10:      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

3.2. With an index

mtcarsR %>%
  dplyr::slice(15:20)

              car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
2: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
3:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
4:            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
5:         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
6:      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

3.3. Pull out a sample row for each distinct record

More useful if dplyr::select is used to limit input to the column(s)

mtcarsR %>%
  dplyr::distinct(cyl, carb)

   cyl carb
1:   6    4
2:   4    1
3:   6    1
4:   8    2
5:   8    4
6:   4    2
7:   8    3
8:   6    6
9:   8    8

4. Sorting

Could have used '-gear' here as 'gear' is numeric, but 'desc' works on strings as well.

mtcarsR %>%
  dplyr::arrange(cyl, desc(gear))

               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
 2:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 4:           Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 5:            Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
 6:            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
 7:         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
 8:      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
 9:           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
10:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
11:       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
12:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
13:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
14:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
15:            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
16:           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
17:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
18:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
19:      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
20:       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
21:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
22:          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
23:          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
24:          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
25:         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
26:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
27: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
28:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
29:    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
30:         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
31:          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
32:    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb

5. Selecting Columns

5.1. With a range

Love how you can use column names with the range operator. Numbers work too, but that is boring.

mtcarsR %>%
  dplyr::select(cyl:drat) %>%
  head()

   cyl disp  hp drat
1:   6  160 110 3.90
2:   6  160 110 3.90
3:   4  108  93 3.85
4:   6  258 110 3.08
5:   8  360 175 3.15
6:   6  225 105 2.76

5.2. Select and rename

mtcarsR %>%
  dplyr::select(displacement=disp, cyl) %>%
  head()

   displacement cyl
1:          160   6
2:          160   6
3:          108   4
4:          258   6
5:          360   8
6:          225   6

5.3. Just rename (but keep other columns)

mtcarsR %>%
  dplyr::rename(displacement=disp, weight=wt) %>%
  head()

            car_name  mpg cyl displacement  hp drat weight  qsec vs am gear carb
1:         Mazda RX4 21.0   6          160 110 3.90  2.620 16.46  0  1    4    4
2:     Mazda RX4 Wag 21.0   6          160 110 3.90  2.875 17.02  0  1    4    4
3:        Datsun 710 22.8   4          108  93 3.85  2.320 18.61  1  1    4    1
4:    Hornet 4 Drive 21.4   6          258 110 3.08  3.215 19.44  1  0    3    1
5: Hornet Sportabout 18.7   8          360 175 3.15  3.440 17.02  0  0    3    2
6:           Valiant 18.1   6          225 105 2.76  3.460 20.22  1  0    3    1

6. New columns

6.1. Compute new columns based on other columns

mtcarsR %>%
  dplyr::mutate(mpc=mpg/cyl, impc=1/mpc) %>%
  head()

            car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb      mpc      impc
1:         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.500000 0.2857143
2:     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.500000 0.2857143
3:        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 5.700000 0.1754386
4:    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.566667 0.2803738
5: Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.337500 0.4278075
6:           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 3.016667 0.3314917

6.2. Compute new columns based on other columns and throw away all the old columns

mtcarsR %>%
  dplyr::transmute(mpc=mpg/cyl, mpd=mpg/disp) %>%
  head()

        mpc        mpd
1: 3.500000 0.13125000
2: 3.500000 0.13125000
3: 5.700000 0.21111111
4: 3.566667 0.08294574
5: 2.337500 0.05194444
6: 3.016667 0.08044444

7. Aggregation

7.1. Global Aggregation

mtcarsR %>%
  dplyr::summarize(mean_disp=mean(disp), sd_disp=sd(disp), mean_wt=mean(wt)) %>%
  data.table::data.table()

   mean_disp  sd_disp mean_wt
1:  230.7219 123.9387 3.21725

7.2. Aggregate by factor level

mtcarsR %>%
  dplyr::group_by(cyl) %>%
  dplyr::summarize(mean_disp_by_cyl=mean(disp), .groups = 'drop') %>%
  data.table::data.table()

   cyl mean_disp_by_cyl
1:   4         105.1364
2:   6         183.3143
3:   8         353.1000

7.3. Aggregate by multiple factors

mtcarsR %>%
  dplyr::group_by(cyl, gear) %>%
  dplyr::summarize(, mean_disp_by_cyl_and_gear=mean(disp), .groups = 'drop') %>%
  data.table::data.table()

   cyl gear mean_disp_by_cyl_and_gear
1:   4    3                  120.1000
2:   4    4                  102.6250
3:   4    5                  107.7000
4:   6    3                  241.5000
5:   6    4                  163.8000
6:   6    5                  145.0000
7:   8    3                  357.6167
8:   8    5                  326.0000

7.4. Summarize by group and put results back in data frame

mtcarsR %>%
  dplyr::group_by(cyl, gear) %>%
  dplyr::mutate(mean_disp_by_cyl_and_gear=mean(disp)) %>%
  data.table::data.table()

               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb mean_disp_by_cyl_and_gear
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4                  163.8000
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4                  163.8000
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1                  102.6250
 4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1                  241.5000
 5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2                  357.6167
 6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1                  241.5000
 7:          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4                  357.6167
 8:           Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2                  102.6250
 9:            Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2                  102.6250
10:            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4                  163.8000
11:           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4                  163.8000
12:          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3                  357.6167
13:          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3                  357.6167
14:         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3                  357.6167
15:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4                  357.6167
16: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4                  357.6167
17:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4                  357.6167
18:            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1                  102.6250
19:         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2                  102.6250
20:      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1                  102.6250
21:       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1                  120.1000
22:    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2                  357.6167
23:         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2                  357.6167
24:          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4                  357.6167
25:    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2                  357.6167
26:           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1                  102.6250
27:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2                  107.7000
28:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2                  107.7000
29:      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4                  326.0000
30:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6                  145.0000
31:       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8                  326.0000
32:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2                  102.6250
               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb mean_disp_by_cyl_and_gear

8. Join

8.1. Example Data For Joins

carSurvey <- data.table::fread(header=T, text='
               car_name, GEARS, word
              Mazda RX4,     4, ZoomZoom
          Mazda RX4 Wag,     4, ZoomZoom
     Cadillac Fleetwood,     3, RollingCouch
    Lincoln Continental,     3, RollingCouch
               Delorean,    16, TimeWarp
          Porsche 914-2,     5, SuperCar
           Lotus Europa,     5, SuperCar
           Ferrari Dino,     5, SuperCar
           Ferrari Dino,     5, SuperRedCar
             Volvo 142E,     4, BoxeyButGood
   ')
carSurvey

               car_name GEARS         word
 1:           Mazda RX4     4     ZoomZoom
 2:       Mazda RX4 Wag     4     ZoomZoom
 3:  Cadillac Fleetwood     3 RollingCouch
 4: Lincoln Continental     3 RollingCouch
 5:            Delorean    16     TimeWarp
 6:       Porsche 914-2     5     SuperCar
 7:        Lotus Europa     5     SuperCar
 8:        Ferrari Dino     5     SuperCar
 9:        Ferrari Dino     5  SuperRedCar
10:          Volvo 142E     4 BoxeyButGood

Notes:

The 'gear' column in mtcars and the 'GEARS' column in carSurvey are logically the equivlant; however, they have different names!
The 'word' column is not very descriptive outside of the carSurvey container.

8.2. Inner Join

Here we only get records that match on both the left (x) and right (y) sides.

mtcarsR %>%
  dplyr::inner_join(carSurvey, by=c('car_name', 'gear'='GEARS'))

              car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb         word
1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     ZoomZoom
2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     ZoomZoom
3:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 RollingCouch
4: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 RollingCouch
5:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     SuperCar
6:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     SuperCar
7:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     SuperCar
8:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  SuperRedCar
9:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 BoxeyButGood

8.3. Left Join

Now we get a record for every record on the left (x), and any records on the right (y) that match one on the left (x). For the "extra" records with no matching data on the right (y), NA values are introduced.

mtcarsR %>%
  dplyr::left_join(carSurvey, by=c('car_name', 'gear'='GEARS'))

               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb         word
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     ZoomZoom
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     ZoomZoom
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1         <NA>
 4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1         <NA>
 5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2         <NA>
 6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1         <NA>
 7:          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4         <NA>
 8:           Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2         <NA>
 9:            Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2         <NA>
10:            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4         <NA>
11:           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4         <NA>
12:          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3         <NA>
13:          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3         <NA>
14:         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3         <NA>
15:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 RollingCouch
16: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 RollingCouch
17:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4         <NA>
18:            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1         <NA>
19:         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2         <NA>
20:      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1         <NA>
21:       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1         <NA>
22:    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2         <NA>
23:         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2         <NA>
24:          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4         <NA>
25:    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2         <NA>
26:           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1         <NA>
27:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     SuperCar
28:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     SuperCar
29:      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4         <NA>
30:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     SuperCar
31:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  SuperRedCar
32:       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8         <NA>
33:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 BoxeyButGood
               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb         word

8.4. Right Join

Now we get a record for every record on the right (y), and any records on the left (x) that match one on the right (y). For the "extra" records with no matching data on the left (x), NA values are introduced.

mtcarsR %>%
  dplyr::right_join(carSurvey, by=c('car_name', 'gear'='GEARS'));

               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb         word
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     ZoomZoom
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     ZoomZoom
 3:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 RollingCouch
 4: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 RollingCouch
 5:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     SuperCar
 6:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     SuperCar
 7:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     SuperCar
 8:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  SuperRedCar
 9:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 BoxeyButGood
10:            Delorean   NA  NA    NA  NA   NA    NA    NA NA NA   16   NA     TimeWarp

8.5. full outer Join

Now we get a record for every record on the left (x) and right (y). Any extra records that don't match the other side get NA values.

mtcarsR %>%
  dplyr::full_join(carSurvey, by=c('car_name', 'gear'='GEARS'));

               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb         word
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     ZoomZoom
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     ZoomZoom
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1         <NA>
 4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1         <NA>
 5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2         <NA>
 6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1         <NA>
 7:          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4         <NA>
 8:           Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2         <NA>
 9:            Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2         <NA>
10:            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4         <NA>
11:           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4         <NA>
12:          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3         <NA>
13:          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3         <NA>
14:         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3         <NA>
15:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 RollingCouch
16: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 RollingCouch
17:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4         <NA>
18:            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1         <NA>
19:         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2         <NA>
20:      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1         <NA>
21:       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1         <NA>
22:    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2         <NA>
23:         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2         <NA>
24:          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4         <NA>
25:    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2         <NA>
26:           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1         <NA>
27:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     SuperCar
28:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     SuperCar
29:      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4         <NA>
30:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     SuperCar
31:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  SuperRedCar
32:       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8         <NA>
33:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 BoxeyButGood
34:            Delorean   NA  NA    NA  NA   NA    NA    NA NA NA   16   NA     TimeWarp
               car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb         word

8.6. Find left matches

Sometimes you just want to filter a table by keeping only records that match some other table. Note semi_join is not just a left_join followed by the removal of the right columns: 1) The column titles are from the left side only, and 2) records are not duplicated when multiple matches exist on the right.

mtcarsR %>%
  dplyr::semi_join(carSurvey, by=c('car_name', 'gear'='GEARS'));

              car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
4: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
5:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
6:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
7:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
8:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

8.7. Fixing column names

mtcarsR %>%
  dplyr::inner_join(rename(carSurvey, gear=GEARS, survey_word=word), by=c('car_name', 'gear'));

              car_name  mpg cyl  disp  hp drat    wt  qsec vs am gear carb  survey_word
1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     ZoomZoom
2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     ZoomZoom
3:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 RollingCouch
4: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 RollingCouch
5:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     SuperCar
6:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     SuperCar
7:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     SuperCar
8:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  SuperRedCar
9:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 BoxeyButGood

9. Long & Wide

In the past I had an entire page with examples of how to transform data between wide and long formats using various R tools. Today I almost always use tidyr…

First we need some long data:

longData <-
  mtcarsR %>%
  dplyr::select(gear, carb, mpg) %>%
  dplyr::group_by(gear, carb) %>%
  dplyr::summarise(mpg=mean(mpg), .groups = 'drop') %>%
  data.table::data.table()
longData

    gear carb      mpg
 1:    3    1 20.33333
 2:    3    2 17.15000
 3:    3    3 16.30000
 4:    3    4 12.62000
 5:    4    1 29.10000
 6:    4    2 24.75000
 7:    4    4 19.75000
 8:    5    2 28.20000
 9:    5    4 15.80000
10:    5    6 19.70000
11:    5    8 15.00000

9.1. Long to Wide

wideData <-
  longData %>%
  tidyr::pivot_wider(names_from=carb,
                     names_prefix='carb_',            ### This prepends "carb_" to the column titles created from the  carb column
                     values_from=mpg) %>%
  data.table::data.table()
wideData

   gear   carb_1 carb_2 carb_3 carb_4 carb_6 carb_8
1:    3 20.33333  17.15   16.3  12.62     NA     NA
2:    4 29.10000  24.75     NA  19.75     NA     NA
3:    5       NA  28.20     NA  15.80   19.7     15

9.2. Wide to Long

wideData %>%
  tidyr::pivot_longer(cols=2:7,
                      names_to="carb",
                      names_prefix='carb_',                     ### Wack the "carb_" we added to the column names.
                      names_transform = list(carb=as.integer),  #### Previous line gets rid of the "carb_", but left a string.  Convert it to integers.
                      values_drop_na=TRUE,                      ### Get rid of the NA values we created when we made teh wide dta
                      values_to='mpg') %>%
  data.table::data.table()

    gear carb      mpg
 1:    3    1 20.33333
 2:    3    2 17.15000
 3:    3    3 16.30000
 4:    3    4 12.62000
 5:    4    1 29.10000
 6:    4    2 24.75000
 7:    4    4 19.75000
 8:    5    2 28.20000
 9:    5    4 15.80000
10:    5    6 19.70000
11:    5    8 15.00000