November 11, 2024
join
)airlines
databaseConsider a database of US flights between 2013 and 2015. The flights are downloaded from the Bureau of Transportation Statistics, US Department of Transportation. The database is a superset of the nycflights13
R package that tracks only flights in and out of airports serving New York City in 2013.
To set up a SQL connection, you need the location of the server (host
) as well as a user
name and password
.
Hadley Wickham discusses how to use Sys.getenv: https://cran.r-project.org/web/packages/httr/vignettes/secrets.html
airlines
databasetbl
[1] NA 2
# Source: SQL [6 x 2]
# Database: mysql [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:3306/airlines]
carrier name
<chr> <chr>
1 02Q Titan Airways
2 04Q Tradewind Aviation
3 05Q Comlux Aviation, AG
4 06Q Master Top Linhas Aereas Ltd.
5 07Q Flair Airlines Ltd.
6 09Q Swift Air, LLC
tibble
The function collect()
copies a SQL table from its server location on disk to your local memory location in R.
carriers
take up?The data frame in R takes up 2 orders of magnitude of memory more than the table which just points to the object in SQL.
We will write SQL code in three distinct ways:
r
chunk.sql
chunk, we can write actual SQL code inside a Quarto document.The function dbListTables()
in the DBI package will tell us what tables exist in the airlines database.
flights
data taken?Because flights
is not actually a data.frame
in R (but instead a tbl
in SQL), the work that was done above was actually performed in SQL. To see the SQL code, we can use the function show_query
.
LAX
and BOS
in 2015.dbplyr doesn’t translate every R command into SQL.
SQL is not a statistical software and doesn’t, for example, have a mechanism for creating data visualizations.
track which R commands are connected to SQL at the dbplyr reference sheet.
flights
data. year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013 10 1 2 10 -8 453 505
2 2013 10 1 4 2359 5 730 729
3 2013 10 1 11 15 -4 528 530
4 2013 10 1 14 2355 19 544 540
5 2013 10 1 16 17 -1 515 525
6 2013 10 1 22 20 2 552 554
7 2013 10 1 29 35 -6 808 816
8 2013 10 1 29 35 -6 449 458
arr_delay carrier tailnum flight origin dest air_time distance cancelled
1 -12 AA N201AA 2400 LAX DFW 149 1235 0
2 1 FL N344AT 710 SFO ATL 247 2139 0
3 -2 AA N3KMAA 1052 SFO DFW 182 1464 0
4 4 AA N3ENAA 2392 SEA ORD 191 1721 0
5 -10 UA N38473 1614 LAX IAH 157 1379 0
6 -2 UA N458UA 291 SFO IAH 188 1635 0
7 -8 US N551UW 436 LAX CLT 256 2125 0
8 -9 AS N402AS 108 ANC SEA 181 1448 0
diverted hour minute time_hour
1 0 0 10 2013-10-01 00:10:00
2 0 23 59 2013-10-01 23:59:00
3 0 0 15 2013-10-01 00:15:00
4 0 23 55 2013-10-01 23:55:00
5 0 0 17 2013-10-01 00:17:00
6 0 0 20 2013-10-01 00:20:00
7 0 0 35 2013-10-01 00:35:00
8 0 0 35 2013-10-01 00:35:00
flights
table?sql
chunksSQL queries can be written directly inside a sql
chunk in RStudio.
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | cancelled | diverted | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 10 | 1 | 2 | 10 | -8 | 453 | 505 | -12 | AA | N201AA | 2400 | LAX | DFW | 149 | 1235 | 0 | 0 | 0 | 10 | 2013-10-01 00:10:00 |
2013 | 10 | 1 | 4 | 2359 | 5 | 730 | 729 | 1 | FL | N344AT | 710 | SFO | ATL | 247 | 2139 | 0 | 0 | 23 | 59 | 2013-10-01 23:59:00 |
2013 | 10 | 1 | 11 | 15 | -4 | 528 | 530 | -2 | AA | N3KMAA | 1052 | SFO | DFW | 182 | 1464 | 0 | 0 | 0 | 15 | 2013-10-01 00:15:00 |
2013 | 10 | 1 | 14 | 2355 | 19 | 544 | 540 | 4 | AA | N3ENAA | 2392 | SEA | ORD | 191 | 1721 | 0 | 0 | 23 | 55 | 2013-10-01 23:55:00 |
2013 | 10 | 1 | 16 | 17 | -1 | 515 | 525 | -10 | UA | N38473 | 1614 | LAX | IAH | 157 | 1379 | 0 | 0 | 0 | 17 | 2013-10-01 00:17:00 |
2013 | 10 | 1 | 22 | 20 | 2 | 552 | 554 | -2 | UA | N458UA | 291 | SFO | IAH | 188 | 1635 | 0 | 0 | 0 | 20 | 2013-10-01 00:20:00 |
2013 | 10 | 1 | 29 | 35 | -6 | 808 | 816 | -8 | US | N551UW | 436 | LAX | CLT | 256 | 2125 | 0 | 0 | 0 | 35 | 2013-10-01 00:35:00 |
2013 | 10 | 1 | 29 | 35 | -6 | 449 | 458 | -9 | AS | N402AS | 108 | ANC | SEA | 181 | 1448 | 0 | 0 | 0 | 35 | 2013-10-01 00:35:00 |
sql
chunksSQL queries can be written directly inside a sql
chunk in RStudio.
Always a good idea to terminate the SQL connection when you are done with it.
DBeaver is a free SQL client that supports MySQL
writing SQL code in R has some benefits (e.g., piping results tables into ggplot2 for visualizations)
using a SQL client that is designed for SQL queries has benefits as well.
to use DBeaver, download the client onto your computer and open it from your Applications.