dplyr Examples
Author: | Mitch Richling |
Updated: | 2022-11-29 18:02:56 |
Copyright 2020-2021 Mitch Richling. All rights reserved.
Table of Contents
1. Metadata
The home for this HTML file is: https://richmit.github.io/ex-R/dplyr.html
Files related to this document may be found on github: https://github.com/richmit/ex-R
Directory contents:
src |
- | The org-mode file that generated this HTML document |
docs |
- | This html document |
data |
- | Data files |
tangled |
- | Tangled R code from this document |
2. First Steps
2.1. Load the library
library(dplyr) library(data.table) library(tidyr)
2.2. Example Data
mtcars
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
2.3. Row Names
mtcarsR <- mtcars %>% tibble::rownames_to_column('car_name') %>% data.table::data.table() mtcarsR %>% head()
car_name mpg cyl disp hp drat wt qsec vs am gear carb 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
2.4. Chaining/Pipes vs Nesting vs Assignments
Using the Tidyverse tends to lead to deeply nested function calls, which can get a bit messy using traditional infix notation. Ofcource the LISPers would tell our syntax is wrong, and the smalltalk peeps would say "just use dots". The Tidyverse has taken a diffrent appraoch: the pipe. In UNIX shells the pipe has long been used as a way to manage a flow of text transformations – the essence of many shell scripts. In R we frequently deal with a flow of data taransformations, so the pipe concept seems liek a good fit. After this section, I'll pretty much stick to the "pipe way".
2.4.1. Chaining (Pipes)
Chaining is a technique using the %>% operator that can be used to "chain" together dplyr calls into sequential steps. Note modern R versions contain native pipe operators that have slightly different semantics. In this document I'll stick to the Tidyverse pipe.
mtcarsR %>% dplyr::group_by(cyl, carb) %>% dplyr::select(mpg, hp, cyl, carb) %>% dplyr::summarise(meanMPG=mean(mpg, na.rm=TRUE), meanCYL=mean(hp, na.rm=TRUE), .groups = 'drop') %>% filter(meanMPG > 20) %>% data.table::data.table()
cyl carb meanMPG meanCYL 1: 4 1 27.58 77.4 2: 4 2 25.90 87.0
2.4.2. Temporary Variables
A favorite for FORTRAN programmers!
tmp <- dplyr::group_by(mtcarsR, cyl, carb) tmp <- dplyr::select(tmp, mpg, hp, cyl, carb) tmp <- dplyr::summarise(tmp, meanMPG=mean(mpg, na.rm=TRUE), meanCYL=mean(hp, na.rm=TRUE), .groups = 'drop') tmp <- dplyr::filter(tmp, meanMPG > 20) tmp <- data.table::data.table(tmp) tmp
cyl carb meanMPG meanCYL 1: 4 1 27.58 77.4 2: 4 2 25.90 87.0
2.4.3. Composition
How the LISPers do it, but with worse syntax. ;)
data.table::data.table(dplyr::filter(dplyr::summarise(dplyr::select(dplyr::group_by(mtcarsR, cyl, carb), mpg, hp, cyl, carb), meanMPG=mean(mpg, na.rm=TRUE), meanCYL=mean(hp, na.rm=TRUE), .groups = 'drop'), meanMPG > 20))
cyl carb meanMPG meanCYL 1: 4 1 27.58 77.4 2: 4 2 25.90 87.0
3. Selecting Rows
3.1. With a boolean expression
You can actually replace that & with a comma, but I'm usually explicit about this sort of thing
mtcarsR %>%
dplyr::filter(cyl>4 & carb==4)
car_name mpg cyl disp hp drat wt qsec vs am gear carb 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3: Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 4: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 5: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 6: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 8: Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 9: Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 10: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
3.2. With an index
mtcarsR %>%
dplyr::slice(15:20)
car_name mpg cyl disp hp drat wt qsec vs am gear carb 1: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 3: Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 4: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 5: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 6: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
3.3. Pull out a sample row for each distinct record
More useful if dplyr::select
is used to limit input to the column(s)
mtcarsR %>%
dplyr::distinct(cyl, carb)
cyl carb 1: 6 4 2: 4 1 3: 6 1 4: 8 2 5: 8 4 6: 4 2 7: 8 3 8: 6 6 9: 8 8
4. Sorting
Could have used '-gear' here as 'gear' is numeric, but 'desc' works on strings as well.
mtcarsR %>%
dplyr::arrange(cyl, desc(gear))
car_name mpg cyl disp hp drat wt qsec vs am gear carb 1: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 2: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4: Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 5: Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 6: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 8: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 9: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 10: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11: Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 12: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 13: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 14: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 15: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 16: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 18: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 19: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 20: Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 21: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 22: Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 23: Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 24: Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 25: Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 26: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 27: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 28: Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 29: Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 30: AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 31: Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 32: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 car_name mpg cyl disp hp drat wt qsec vs am gear carb
5. Selecting Columns
5.1. With a range
Love how you can use column names with the range operator. Numbers work too, but that is boring.
mtcarsR %>% dplyr::select(cyl:drat) %>% head()
cyl disp hp drat 1: 6 160 110 3.90 2: 6 160 110 3.90 3: 4 108 93 3.85 4: 6 258 110 3.08 5: 8 360 175 3.15 6: 6 225 105 2.76
5.2. Select and rename
mtcarsR %>% dplyr::select(displacement=disp, cyl) %>% head()
displacement cyl 1: 160 6 2: 160 6 3: 108 4 4: 258 6 5: 360 8 6: 225 6
5.3. Just rename (but keep other columns)
mtcarsR %>% dplyr::rename(displacement=disp, weight=wt) %>% head()
car_name mpg cyl displacement hp drat weight qsec vs am gear carb 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
6. New columns
6.1. Compute new columns based on other columns
mtcarsR %>% dplyr::mutate(mpc=mpg/cyl, impc=1/mpc) %>% head()
car_name mpg cyl disp hp drat wt qsec vs am gear carb mpc impc 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.500000 0.2857143 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.500000 0.2857143 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5.700000 0.1754386 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.566667 0.2803738 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.337500 0.4278075 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.016667 0.3314917
6.2. Compute new columns based on other columns and throw away all the old columns
mtcarsR %>% dplyr::transmute(mpc=mpg/cyl, mpd=mpg/disp) %>% head()
mpc mpd 1: 3.500000 0.13125000 2: 3.500000 0.13125000 3: 5.700000 0.21111111 4: 3.566667 0.08294574 5: 2.337500 0.05194444 6: 3.016667 0.08044444
7. Aggregation
7.1. Global Aggregation
mtcarsR %>% dplyr::summarize(mean_disp=mean(disp), sd_disp=sd(disp), mean_wt=mean(wt)) %>% data.table::data.table()
mean_disp sd_disp mean_wt 1: 230.7219 123.9387 3.21725
7.2. Aggregate by factor level
mtcarsR %>% dplyr::group_by(cyl) %>% dplyr::summarize(mean_disp_by_cyl=mean(disp), .groups = 'drop') %>% data.table::data.table()
cyl mean_disp_by_cyl 1: 4 105.1364 2: 6 183.3143 3: 8 353.1000
7.3. Aggregate by multiple factors
mtcarsR %>% dplyr::group_by(cyl, gear) %>% dplyr::summarize(, mean_disp_by_cyl_and_gear=mean(disp), .groups = 'drop') %>% data.table::data.table()
cyl gear mean_disp_by_cyl_and_gear 1: 4 3 120.1000 2: 4 4 102.6250 3: 4 5 107.7000 4: 6 3 241.5000 5: 6 4 163.8000 6: 6 5 145.0000 7: 8 3 357.6167 8: 8 5 326.0000
7.4. Summarize by group and put results back in data frame
mtcarsR %>% dplyr::group_by(cyl, gear) %>% dplyr::mutate(mean_disp_by_cyl_and_gear=mean(disp)) %>% data.table::data.table()
car_name mpg cyl disp hp drat wt qsec vs am gear carb mean_disp_by_cyl_and_gear 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 163.8000 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 163.8000 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 102.6250 4: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 241.5000 5: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 357.6167 6: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 241.5000 7: Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 357.6167 8: Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 102.6250 9: Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 102.6250 10: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 163.8000 11: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 163.8000 12: Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 357.6167 13: Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 357.6167 14: Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 357.6167 15: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 357.6167 16: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 357.6167 17: Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 357.6167 18: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 102.6250 19: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 102.6250 20: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 102.6250 21: Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 120.1000 22: Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 357.6167 23: AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 357.6167 24: Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 357.6167 25: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 357.6167 26: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 102.6250 27: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 107.7000 28: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 107.7000 29: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 326.0000 30: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 145.0000 31: Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 326.0000 32: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 102.6250 car_name mpg cyl disp hp drat wt qsec vs am gear carb mean_disp_by_cyl_and_gear
8. Join
8.1. Example Data For Joins
carSurvey <- data.table::fread(header=T, text=' car_name, GEARS, word Mazda RX4, 4, ZoomZoom Mazda RX4 Wag, 4, ZoomZoom Cadillac Fleetwood, 3, RollingCouch Lincoln Continental, 3, RollingCouch Delorean, 16, TimeWarp Porsche 914-2, 5, SuperCar Lotus Europa, 5, SuperCar Ferrari Dino, 5, SuperCar Ferrari Dino, 5, SuperRedCar Volvo 142E, 4, BoxeyButGood ') carSurvey
car_name GEARS word 1: Mazda RX4 4 ZoomZoom 2: Mazda RX4 Wag 4 ZoomZoom 3: Cadillac Fleetwood 3 RollingCouch 4: Lincoln Continental 3 RollingCouch 5: Delorean 16 TimeWarp 6: Porsche 914-2 5 SuperCar 7: Lotus Europa 5 SuperCar 8: Ferrari Dino 5 SuperCar 9: Ferrari Dino 5 SuperRedCar 10: Volvo 142E 4 BoxeyButGood
Notes:
- The '
gear
' column inmtcars
and the 'GEARS
' column incarSurvey
are logically the equivlant; however, they have different names! - The '
word
' column is not very descriptive outside of thecarSurvey
container.
8.2. Inner Join
Here we only get records that match on both the left (x) and right (y) sides.
mtcarsR %>% dplyr::inner_join(carSurvey, by=c('car_name', 'gear'='GEARS'))
car_name mpg cyl disp hp drat wt qsec vs am gear carb word 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ZoomZoom 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ZoomZoom 3: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 RollingCouch 4: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 RollingCouch 5: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 SuperCar 6: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 SuperCar 7: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperCar 8: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperRedCar 9: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 BoxeyButGood
8.3. Left Join
Now we get a record for every record on the left (x), and any records on the right (y) that match one on the left (x). For the "extra" records with no
matching data on the right (y), NA
values are introduced.
mtcarsR %>% dplyr::left_join(carSurvey, by=c('car_name', 'gear'='GEARS'))
car_name mpg cyl disp hp drat wt qsec vs am gear carb word 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ZoomZoom 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ZoomZoom 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 <NA> 4: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 <NA> 5: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 <NA> 6: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 <NA> 7: Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 <NA> 8: Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 <NA> 9: Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 <NA> 10: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 <NA> 11: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 <NA> 12: Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 <NA> 13: Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 <NA> 14: Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 <NA> 15: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 RollingCouch 16: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 RollingCouch 17: Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 <NA> 18: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 <NA> 19: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 <NA> 20: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 <NA> 21: Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 <NA> 22: Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 <NA> 23: AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 <NA> 24: Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 <NA> 25: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 <NA> 26: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 <NA> 27: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 SuperCar 28: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 SuperCar 29: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 <NA> 30: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperCar 31: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperRedCar 32: Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 <NA> 33: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 BoxeyButGood car_name mpg cyl disp hp drat wt qsec vs am gear carb word
8.4. Right Join
Now we get a record for every record on the right (y), and any records on the left (x) that match one on the right (y). For the "extra" records with no
matching data on the left (x), NA
values are introduced.
mtcarsR %>% dplyr::right_join(carSurvey, by=c('car_name', 'gear'='GEARS'));
car_name mpg cyl disp hp drat wt qsec vs am gear carb word 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ZoomZoom 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ZoomZoom 3: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 RollingCouch 4: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 RollingCouch 5: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 SuperCar 6: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 SuperCar 7: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperCar 8: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperRedCar 9: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 BoxeyButGood 10: Delorean NA NA NA NA NA NA NA NA NA 16 NA TimeWarp
8.5. full outer Join
Now we get a record for every record on the left (x) and right (y). Any extra records that don't match the other side get NA
values.
mtcarsR %>% dplyr::full_join(carSurvey, by=c('car_name', 'gear'='GEARS'));
car_name mpg cyl disp hp drat wt qsec vs am gear carb word 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ZoomZoom 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ZoomZoom 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 <NA> 4: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 <NA> 5: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 <NA> 6: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 <NA> 7: Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 <NA> 8: Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 <NA> 9: Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 <NA> 10: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 <NA> 11: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 <NA> 12: Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 <NA> 13: Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 <NA> 14: Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 <NA> 15: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 RollingCouch 16: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 RollingCouch 17: Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 <NA> 18: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 <NA> 19: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 <NA> 20: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 <NA> 21: Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 <NA> 22: Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 <NA> 23: AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 <NA> 24: Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 <NA> 25: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 <NA> 26: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 <NA> 27: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 SuperCar 28: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 SuperCar 29: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 <NA> 30: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperCar 31: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperRedCar 32: Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 <NA> 33: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 BoxeyButGood 34: Delorean NA NA NA NA NA NA NA NA NA 16 NA TimeWarp car_name mpg cyl disp hp drat wt qsec vs am gear carb word
8.6. Find left matches
Sometimes you just want to filter a table by keeping only records that match some other table. Note semi_join
is not just a left_join
followed by the
removal of the right columns: 1) The column titles are from the left side only, and 2) records are not duplicated when multiple matches exist on the right.
mtcarsR %>% dplyr::semi_join(carSurvey, by=c('car_name', 'gear'='GEARS'));
car_name mpg cyl disp hp drat wt qsec vs am gear carb 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 4: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 5: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 6: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 8: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
8.7. Fixing column names
mtcarsR %>% dplyr::inner_join(rename(carSurvey, gear=GEARS, survey_word=word), by=c('car_name', 'gear'));
car_name mpg cyl disp hp drat wt qsec vs am gear carb survey_word 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ZoomZoom 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ZoomZoom 3: Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 RollingCouch 4: Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 RollingCouch 5: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 SuperCar 6: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 SuperCar 7: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperCar 8: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 SuperRedCar 9: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 BoxeyButGood
9. Long & Wide
In the past I had an entire page with examples of how to transform data between wide and long formats using various R tools. Today I almost always use tidyr
…
First we need some long data:
longData <- mtcarsR %>% dplyr::select(gear, carb, mpg) %>% dplyr::group_by(gear, carb) %>% dplyr::summarise(mpg=mean(mpg), .groups = 'drop') %>% data.table::data.table() longData
gear carb mpg 1: 3 1 20.33333 2: 3 2 17.15000 3: 3 3 16.30000 4: 3 4 12.62000 5: 4 1 29.10000 6: 4 2 24.75000 7: 4 4 19.75000 8: 5 2 28.20000 9: 5 4 15.80000 10: 5 6 19.70000 11: 5 8 15.00000
9.1. Long to Wide
wideData <- longData %>% tidyr::pivot_wider(names_from=carb, names_prefix='carb_', ### This prepends "carb_" to the column titles created from the carb column values_from=mpg) %>% data.table::data.table() wideData
gear carb_1 carb_2 carb_3 carb_4 carb_6 carb_8 1: 3 20.33333 17.15 16.3 12.62 NA NA 2: 4 29.10000 24.75 NA 19.75 NA NA 3: 5 NA 28.20 NA 15.80 19.7 15
9.2. Wide to Long
wideData %>% tidyr::pivot_longer(cols=2:7, names_to="carb", names_prefix='carb_', ### Wack the "carb_" we added to the column names. names_transform = list(carb=as.integer), #### Previous line gets rid of the "carb_", but left a string. Convert it to integers. values_drop_na=TRUE, ### Get rid of the NA values we created when we made teh wide dta values_to='mpg') %>% data.table::data.table()
gear carb mpg 1: 3 1 20.33333 2: 3 2 17.15000 3: 3 3 16.30000 4: 3 4 12.62000 5: 4 1 29.10000 6: 4 2 24.75000 7: 4 4 19.75000 8: 5 2 28.20000 9: 5 4 15.80000 10: 5 6 19.70000 11: 5 8 15.00000