Databases and dbplyr and SQL, oh my!

November 11, 2024

Jo Hardin

Agenda 11/11/24

  1. Database
  2. SQL connections
  3. SQL in R (3 ways)

SQL:Structured Query Language

What is a database?

  • structured collection of data organized with
    • efficient storage
    • easy retrieval
    • consistent management
  • data stored in tables which are linked to one another via keys (called a relational database, think join)

Tidy data

  • data frame (R) or table (SQL)
  • columns of variables
  • rows of observational units

Differences between R and SQL

  • tables in SQL databases can be arbitrarily large
    • live in storage, computer’s hard drive (usually remote)
  • data frames in R
    • live in memory (RAM) on your personal computer
  • tables in a database are linked via a key

Today’s example

The airlines database

Consider a database of US flights between 2013 and 2015. The flights are downloaded from the Bureau of Transportation Statistics, US Department of Transportation. The database is a superset of the nycflights13 R package that tracks only flights in and out of airports serving New York City in 2013.

SQL connection

To set up a SQL connection, you need the location of the server (host) as well as a username and password.

con_air <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "airlines",
  host = Sys.getenv("MDSR_HOST"),
  user = Sys.getenv("MDSR_USER"),
  password = Sys.getenv("MDSR_PWD")
)

Hadley Wickham discusses how to use Sys.getenv: https://cran.r-project.org/web/packages/httr/vignettes/secrets.html

Tables in airlines database

DBI::dbListTables(con_air)
[1] "planes"          "carriers"        "airports"        "flights_summary"
[5] "flights"        

SQL tables as tbl

carriers <- dplyr::tbl(con_air, "carriers")
dim(carriers)
[1] NA  2
head(carriers)
# Source:   SQL [6 x 2]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:3306/airlines]
  carrier name                         
  <chr>   <chr>                        
1 02Q     Titan Airways                
2 04Q     Tradewind Aviation           
3 05Q     Comlux Aviation, AG          
4 06Q     Master Top Linhas Aereas Ltd.
5 07Q     Flair Airlines Ltd.          
6 09Q     Swift Air, LLC               

SQL tables as tibble

The function collect() copies a SQL table from its server location on disk to your local memory location in R.

carriers_tibble <- carriers |>
  dplyr::collect()

dim(carriers_tibble)
[1] 1610    2
head(carriers_tibble)
# A tibble: 6 × 2
  carrier name                         
  <chr>   <chr>                        
1 02Q     Titan Airways                
2 04Q     Tradewind Aviation           
3 05Q     Comlux Aviation, AG          
4 06Q     Master Top Linhas Aereas Ltd.
5 07Q     Flair Airlines Ltd.          
6 09Q     Swift Air, LLC               

How much space does carriers take up?

The data frame in R takes up 2 orders of magnitude of memory more than the table which just points to the object in SQL.

carriers |>
  object.size() |>
  print(units = "Kb")
5.3 Kb
carriers_tibble |>
  object.size() |>
  print(units = "Kb")
234.8 Kb

What is SQL?

  • SQL is a programming language for working with relational databases.
  • SQL has been around since the 1970s, but has, unfortunately, many different dialects.
  • To connect to the mdsr database (via R and DBeaver), use MySQL.
  • To connect to DuckDB, use the dialect native to DuckDB.

Using SQL in RStudio

We will write SQL code in three distinct ways:

  1. Using the package dbplyr R will directly translate dplyr code into SQL.
  2. Using the DBI package, we can send SQL queries through an r chunk.
  3. Using a sql chunk, we can write actual SQL code inside a Quarto document.

1. Translating dplyr code into SQL

The function dbListTables() in the DBI package will tell us what tables exist in the airlines database.

DBI::dbListTables(con_air)
[1] "planes"          "carriers"        "airports"        "flights_summary"
[5] "flights"        
flights <- dplyr::tbl(con_air, "flights")
carriers <- dplyr::tbl(con_air, "carriers")

1. Translating dplyr code into SQL

  • Over what years is the flights data taken?
yrs <- flights |>
  summarize(min_year = min(year), max_year = max(year))

yrs
# Source:   SQL [1 x 2]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:3306/airlines]
  min_year max_year
     <int>    <int>
1     2013     2015

1. Translating dplyr code into SQL

Because flights is not actually a data.frame in R (but instead a tbl in SQL), the work that was done above was actually performed in SQL. To see the SQL code, we can use the function show_query.

dplyr::show_query(yrs)
<SQL>
SELECT MIN(`year`) AS `min_year`, MAX(`year`) AS `max_year`
FROM `flights`

1. Translating dplyr code into SQL

  • Create a data set containing only flights between LAX and BOS in 2015.
la_bos <- flights |>
  filter(year == 2015 & ((origin == "LAX" & dest == "BOS") | 
           (origin == "BOS" & dest == "LAX"))) 

dplyr::show_query(la_bos)
<SQL>
SELECT `flights`.*
FROM `flights`
WHERE (`year` = 2015.0 AND ((`origin` = 'LAX' AND `dest` = 'BOS') OR (`origin` = 'BOS' AND `dest` = 'LAX')))

1. Translating dplyr code into SQL

  • dbplyr doesn’t translate every R command into SQL.

  • SQL is not a statistical software and doesn’t, for example, have a mechanism for creating data visualizations.

  • track which R commands are connected to SQL at the dbplyr reference sheet.

2. SQL queries via the DBI package

  • Look at the first few rows of the flights data.
DBI::dbGetQuery(con_air,
                "SELECT * FROM flights LIMIT 8;")
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013    10   1        2             10        -8      453            505
2 2013    10   1        4           2359         5      730            729
3 2013    10   1       11             15        -4      528            530
4 2013    10   1       14           2355        19      544            540
5 2013    10   1       16             17        -1      515            525
6 2013    10   1       22             20         2      552            554
7 2013    10   1       29             35        -6      808            816
8 2013    10   1       29             35        -6      449            458
  arr_delay carrier tailnum flight origin dest air_time distance cancelled
1       -12      AA  N201AA   2400    LAX  DFW      149     1235         0
2         1      FL  N344AT    710    SFO  ATL      247     2139         0
3        -2      AA  N3KMAA   1052    SFO  DFW      182     1464         0
4         4      AA  N3ENAA   2392    SEA  ORD      191     1721         0
5       -10      UA  N38473   1614    LAX  IAH      157     1379         0
6        -2      UA  N458UA    291    SFO  IAH      188     1635         0
7        -8      US  N551UW    436    LAX  CLT      256     2125         0
8        -9      AS  N402AS    108    ANC  SEA      181     1448         0
  diverted hour minute           time_hour
1        0    0     10 2013-10-01 00:10:00
2        0   23     59 2013-10-01 23:59:00
3        0    0     15 2013-10-01 00:15:00
4        0   23     55 2013-10-01 23:55:00
5        0    0     17 2013-10-01 00:17:00
6        0    0     20 2013-10-01 00:20:00
7        0    0     35 2013-10-01 00:35:00
8        0    0     35 2013-10-01 00:35:00

2. SQL queries via the DBI package

  • How many flights per year are in the flights table?
DBI::dbGetQuery(con_air, 
  "SELECT year, count(*) AS num_flights FROM flights GROUP BY year ORDER BY num_flights;")
  year num_flights
1 2015     5819079
2 2014     5819811
3 2013     6369482

3. Direct SQL queries via sql chunks

SQL queries can be written directly inside a sql chunk in RStudio.

```{sql}
#| connection: con_air

SELECT * FROM flights LIMIT 8;
```
8 records
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier tailnum flight origin dest air_time distance cancelled diverted hour minute time_hour
2013 10 1 2 10 -8 453 505 -12 AA N201AA 2400 LAX DFW 149 1235 0 0 0 10 2013-10-01 00:10:00
2013 10 1 4 2359 5 730 729 1 FL N344AT 710 SFO ATL 247 2139 0 0 23 59 2013-10-01 23:59:00
2013 10 1 11 15 -4 528 530 -2 AA N3KMAA 1052 SFO DFW 182 1464 0 0 0 15 2013-10-01 00:15:00
2013 10 1 14 2355 19 544 540 4 AA N3ENAA 2392 SEA ORD 191 1721 0 0 23 55 2013-10-01 23:55:00
2013 10 1 16 17 -1 515 525 -10 UA N38473 1614 LAX IAH 157 1379 0 0 0 17 2013-10-01 00:17:00
2013 10 1 22 20 2 552 554 -2 UA N458UA 291 SFO IAH 188 1635 0 0 0 20 2013-10-01 00:20:00
2013 10 1 29 35 -6 808 816 -8 US N551UW 436 LAX CLT 256 2125 0 0 0 35 2013-10-01 00:35:00
2013 10 1 29 35 -6 449 458 -9 AS N402AS 108 ANC SEA 181 1448 0 0 0 35 2013-10-01 00:35:00

3. Direct SQL queries via sql chunks

SQL queries can be written directly inside a sql chunk in RStudio.

```{sql}
#| connection: con_air

SELECT year, count(*) AS num_flights 
       FROM flights 
       GROUP BY year 
       ORDER BY num_flights;
```
3 records
year num_flights
2015 5819079
2014 5819811
2013 6369482

Good practice

Always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_air, shutdown = TRUE)

Using SQL in DBeaver

  • DBeaver is a free SQL client that supports MySQL

  • writing SQL code in R has some benefits (e.g., piping results tables into ggplot2 for visualizations)

  • using a SQL client that is designed for SQL queries has benefits as well.

  • to use DBeaver, download the client onto your computer and open it from your Applications.