April 14, 2025
join
)You may have heard of some of the following SQL dialects
They are very similar, and you will be able to quickly pick up whichever dialect your institution uses. In our work, we will use MySQL.
airlines
databaseConsider a database of US flights between 2013 and 2015. The flights are downloaded from the Bureau of Transportation Statistics, US Department of Transportation. The database is a superset of the nycflights13
R package that tracks only flights in and out of airports serving New York City in 2013.
To set up a SQL connection, you need the location of the server (host
) as well as a user
name and password
.
Hadley Wickham discusses how to use Sys.getenv: https://cran.r-project.org/web/packages/httr/vignettes/secrets.html
airlines
databasetbl
[1] NA 2
# Source: SQL [6 x 2]
# Database: mysql [mdsr_public@mdsr.crcbo51tmesf.us-east-2.rds.amazonaws.com:3306/airlines]
carrier name
<chr> <chr>
1 02Q Titan Airways
2 04Q Tradewind Aviation
3 05Q Comlux Aviation, AG
4 06Q Master Top Linhas Aereas Ltd.
5 07Q Flair Airlines Ltd.
6 09Q Swift Air, LLC
tibble
The function collect()
copies a SQL table from its server location on disk to your local memory location in R.
carriers
take up?The data frame in R takes up 2 orders of magnitude of memory more than the table which just points to the object in SQL.
We will write SQL code in three distinct ways:
r
chunk.sql
chunk, we can write actual SQL code inside a Quarto document.The function dbListTables()
in the DBI package will tell us what tables exist in the airlines database.
flights
data taken?Because flights
is not actually a data.frame
in R (but instead a tbl
in SQL), the work that was done above was actually performed in SQL. To see the SQL code, we can use the function show_query
.
LAX
and BOS
in 2015.dbplyr doesn’t translate every R command into SQL.
SQL is not a statistical software and doesn’t, for example, have a mechanism for creating data visualizations.
track which R commands are connected to SQL at the dbplyr reference sheet.
flights
data. year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013 10 1 2 10 -8 453 505
2 2013 10 1 4 2359 5 730 729
3 2013 10 1 11 15 -4 528 530
4 2013 10 1 14 2355 19 544 540
5 2013 10 1 16 17 -1 515 525
6 2013 10 1 22 20 2 552 554
7 2013 10 1 29 35 -6 808 816
8 2013 10 1 29 35 -6 449 458
arr_delay carrier tailnum flight origin dest air_time distance cancelled
1 -12 AA N201AA 2400 LAX DFW 149 1235 0
2 1 FL N344AT 710 SFO ATL 247 2139 0
3 -2 AA N3KMAA 1052 SFO DFW 182 1464 0
4 4 AA N3ENAA 2392 SEA ORD 191 1721 0
5 -10 UA N38473 1614 LAX IAH 157 1379 0
6 -2 UA N458UA 291 SFO IAH 188 1635 0
7 -8 US N551UW 436 LAX CLT 256 2125 0
8 -9 AS N402AS 108 ANC SEA 181 1448 0
diverted hour minute time_hour
1 0 0 10 2013-10-01 00:10:00
2 0 23 59 2013-10-01 23:59:00
3 0 0 15 2013-10-01 00:15:00
4 0 23 55 2013-10-01 23:55:00
5 0 0 17 2013-10-01 00:17:00
6 0 0 20 2013-10-01 00:20:00
7 0 0 35 2013-10-01 00:35:00
8 0 0 35 2013-10-01 00:35:00
flights
table?sql
chunksSQL queries can be written directly inside a sql
chunk in Quarto.
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | cancelled | diverted | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 10 | 1 | 2 | 10 | -8 | 453 | 505 | -12 | AA | N201AA | 2400 | LAX | DFW | 149 | 1235 | 0 | 0 | 0 | 10 | 2013-10-01 00:10:00 |
2013 | 10 | 1 | 4 | 2359 | 5 | 730 | 729 | 1 | FL | N344AT | 710 | SFO | ATL | 247 | 2139 | 0 | 0 | 23 | 59 | 2013-10-01 23:59:00 |
2013 | 10 | 1 | 11 | 15 | -4 | 528 | 530 | -2 | AA | N3KMAA | 1052 | SFO | DFW | 182 | 1464 | 0 | 0 | 0 | 15 | 2013-10-01 00:15:00 |
2013 | 10 | 1 | 14 | 2355 | 19 | 544 | 540 | 4 | AA | N3ENAA | 2392 | SEA | ORD | 191 | 1721 | 0 | 0 | 23 | 55 | 2013-10-01 23:55:00 |
2013 | 10 | 1 | 16 | 17 | -1 | 515 | 525 | -10 | UA | N38473 | 1614 | LAX | IAH | 157 | 1379 | 0 | 0 | 0 | 17 | 2013-10-01 00:17:00 |
2013 | 10 | 1 | 22 | 20 | 2 | 552 | 554 | -2 | UA | N458UA | 291 | SFO | IAH | 188 | 1635 | 0 | 0 | 0 | 20 | 2013-10-01 00:20:00 |
2013 | 10 | 1 | 29 | 35 | -6 | 808 | 816 | -8 | US | N551UW | 436 | LAX | CLT | 256 | 2125 | 0 | 0 | 0 | 35 | 2013-10-01 00:35:00 |
2013 | 10 | 1 | 29 | 35 | -6 | 449 | 458 | -9 | AS | N402AS | 108 | ANC | SEA | 181 | 1448 | 0 | 0 | 0 | 35 | 2013-10-01 00:35:00 |
sql
chunksSQL queries can be written directly inside a sql
chunk in Quarto.
Always a good idea to terminate the SQL connection when you are done with it.
DBeaver is a free SQL client that supports MySQL
A “client” means that the software is just for SQL (and not for running models or making figures, like R)
writing SQL code in R has some benefits (e.g., piping results tables into ggplot2 for visualizations)
using a SQL client that is designed for SQL queries has benefits as well.
to use DBeaver, download the client onto your computer and open it from your Applications.