SQL clauses

November 13, 2024

Jo Hardin

Agenda 11/13/24 + 11/17/24

  1. Order of clauses
  2. Similarities & differences with tidyverse
  3. SQL queries to R objects

Today’s example

Traffic Stops

Consider a database of information from Stanford Open Policing Project on over 200 million traffic stops in the US.

Establishing a SQL connection

con_traffic <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "traffic",
  host = Sys.getenv("TRAFFIC_HOST"),
  user = Sys.getenv("TRAFFIC_USER"),
  password = Sys.getenv("TRAFFIC_PWD")
)

Traffic Stops

88 tables in the traffic database! Lots of details on traffic stops across cities and highway patrols.

SHOW TABLES;
Displaying records 1 - 10
Tables_in_traffic
ar_little_rock_2020_04_01
az_gilbert_2020_04_01
az_mesa_2023_01_26
az_statewide_2020_04_01
ca_anaheim_2020_04_01
ca_bakersfield_2020_04_01
ca_long_beach_2020_04_01
ca_los_angeles_2020_04_01
ca_oakland_2020_04_01
ca_san_bernardino_2020_04_01

LAPD - glance through tables

SELECT * FROM ca_los_angeles_2020_04_01
LIMIT 0, 10;
Displaying records 1 - 10
raw_row_number date time district region subject_race subject_sex officer_id_hash type raw_descent_description
5692933 13:59:00 0753 WILSHIRE hispanic male 15ecd81b00 pedestrian HISPANIC
240731 2010-01-01 00:05:00 665 WEST TRAFFIC other male b707de41e0 vehicular OTHER
240592|240593 2010-01-01 00:10:00 1258 SEVENTY-SEVENTH hispanic male b49ae7078c pedestrian HISPANIC
241116 2010-01-01 00:10:00 1635 FOOTHILL hispanic male e70ce46248 vehicular HISPANIC
240681 2010-01-01 00:15:00 882 WEST LA other male 8587bd743c vehicular OTHER
240602 2010-01-01 00:20:00 559 HARBOR hispanic male 601bf1fdff vehicular HISPANIC
240594 2010-01-01 00:20:00 667 HOLLYWOOD hispanic male fb2e6d8ab3 pedestrian HISPANIC
241434 2010-01-01 00:25:00 1658 FOOTHILL hispanic female 5100c8c759 vehicular HISPANIC
240582 2010-01-01 00:30:00 1041 WEST VALLEY black female d54d90a4d0 vehicular BLACK
240642 2010-01-01 00:30:00 1162 NORTH EAST hispanic male 5d6fbef1d3 pedestrian HISPANIC

DESCRIBE tables

Still using a {sql} chunk. The DESCRIBE command shows the 10 field names (variables) in the ca_los_angeles_2020_04_01 table. Some of the variables are characters (text) and some are date or time.

DESCRIBE ca_los_angeles_2020_04_01;
Displaying records 1 - 10
Field Type Null Key Default Extra
raw_row_number text YES
date date YES
time time YES
district text YES
region text YES
subject_race text YES
subject_sex text YES
officer_id_hash text YES
type text YES
raw_descent_description text YES

Parts of a SQL table

  • Field is the name of the variable
  • Type is the type of the variable (numeric, character, etc.)
  • Null indicates whether or not NULL values are acceptable. NULL values are never allowed for a Primary Key.
  • Key designates whether the variable is a Primary Key. PRI is a primary key, all unique, no NULL values. UNI is a unique key, but NULL values are allowed. MUL is a non-unique (“Multiple”) key, allowing for duplicates, can speed up queries.
  • Default provides the default value of the variable if no value is provided when a row is added to the database.

SQL clauses

Order of clauses

Queries in SQL start with the SELECT keyword and consist of several clauses, which must be written in the following order:

  • SELECT
  • FROM
  • JOIN
  • WHERE
  • GROUP BY
  • HAVING
  • ORDER BY
  • LIMIT

The clauses are similar to data wrangling verbs in R, but the order in SQL is super important!

SELECT … FROM

  • start with a SELECT, include a corresponding FROM
  • columns may be specified or * returns every column

The shortest SQL query is the following SELECT command. Do not run this command!!! The ca_los_angeles_2020_04_01 table has 5 million rows, and we do not want to look at them (print to screen) simultaneously.

DO NOT RUN:  SELECT * FROM ca_los_angeles_2020_04_01;

SELECT … FROM

Much better for big tables:

SELECT * FROM ca_los_angeles_2020_04_01 LIMIT 0, 10;
Displaying records 1 - 10
raw_row_number date time district region subject_race subject_sex officer_id_hash type raw_descent_description
5692933 13:59:00 0753 WILSHIRE hispanic male 15ecd81b00 pedestrian HISPANIC
240731 2010-01-01 00:05:00 665 WEST TRAFFIC other male b707de41e0 vehicular OTHER
240592|240593 2010-01-01 00:10:00 1258 SEVENTY-SEVENTH hispanic male b49ae7078c pedestrian HISPANIC
241116 2010-01-01 00:10:00 1635 FOOTHILL hispanic male e70ce46248 vehicular HISPANIC
240681 2010-01-01 00:15:00 882 WEST LA other male 8587bd743c vehicular OTHER
240602 2010-01-01 00:20:00 559 HARBOR hispanic male 601bf1fdff vehicular HISPANIC
240594 2010-01-01 00:20:00 667 HOLLYWOOD hispanic male fb2e6d8ab3 pedestrian HISPANIC
241434 2010-01-01 00:25:00 1658 FOOTHILL hispanic female 5100c8c759 vehicular HISPANIC
240582 2010-01-01 00:30:00 1041 WEST VALLEY black female d54d90a4d0 vehicular BLACK
240642 2010-01-01 00:30:00 1162 NORTH EAST hispanic male 5d6fbef1d3 pedestrian HISPANIC

SELECT … FROM

How do we know how many traffic stops are in the database? Two different ways of counting the rows (that produce the same number):

SELECT COUNT(*), SUM(1) FROM ca_los_angeles_2020_04_01 LIMIT 0, 10;
1 records
COUNT(*) SUM(1)
5418402 5418402

More than 5.4 million LAPD traffic stops.

SELECT … FROM

Let’s look at New Orleans, LA

DESCRIBE la_new_orleans_2020_04_01;
Displaying records 1 - 10
Field Type Null Key Default Extra
raw_row_number text YES
date date YES
time time YES
location text YES
lat double YES
lng double YES
district double YES
zone text YES
subject_age bigint(20) YES
subject_race text YES

SELECT … FROM

How old was the car being driven?

Unfortunately, date is saved as a date variable, and vehicle_year is saved as a double. Fortunately, we can convert vehicle_year to a date variable.

SELECT
      date, vehicle_year,
      STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
   FROM la_new_orleans_2020_04_01
   LIMIT 0, 10;
Displaying records 1 - 10
date vehicle_year vehicle_date
2010-01-01 2005 2005-01-01
2010-01-01 2005 2005-01-01
2010-01-01 2005 2005-01-01
2010-01-01 2003 2003-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01

SELECT … FROM

Why can’t we find the difference between the two time variables?

SELECT
      date, vehicle_year,
      STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
      TIMESTAMPDIFF(YEAR, vehicle_date, date) AS car_age
   FROM la_new_orleans_2020_04_01
   LIMIT 0, 10;
Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'TIMESTAMPDIFF(YEAR, vehicle_date, date) AS car_age
   FROM la_new_orleans_2020_0' at line 4 [1064]

SELECT … FROM

Solution 1: two layers of SELECT

  • first SELECT (i.e., inside) layer creates the new variables
  • second SELECT (i.e., outside) layer subtracts the two times
SELECT 
   date,
   vehicle_date, 
   TIMESTAMPDIFF(YEAR, vehicle_date, date) AS car_age 
FROM (
   SELECT
      date,
      STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
   FROM la_new_orleans_2020_04_01)
   AS subquery_table
LIMIT 0, 10;
Displaying records 1 - 10
date vehicle_date car_age
2010-01-01 2005-01-01 5
2010-01-01 2005-01-01 5
2010-01-01 2005-01-01 5
2010-01-01 2003-01-01 7
2010-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01

SELECT … FROM

Solution 2: apply the STR_TO_DATE() function inside the TIMESTAMPDIFF() function

SELECT 
   date,
   vehicle_year,
   TIMESTAMPDIFF(YEAR, 
                 STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d'),
                 date) AS car_age 
FROM la_new_orleans_2020_04_01
LIMIT 0, 10;
Displaying records 1 - 10
date vehicle_year car_age
2010-01-01 2005 5
2010-01-01 2005 5
2010-01-01 2005 5
2010-01-01 2003 7
2010-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01
2010-01-01

Important note:

There is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set.

date and vehicle_year are columns in the original table - they are written to disk on the SQL server.

car_age exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) in your computer’s memory and is not written to disk.

SELECT DISTINCT

Returns only unique rows.

SELECT DISTINCT raw_actions_taken
FROM la_new_orleans_2020_04_01
LIMIT 0, 200;
Displaying records 1 - 10
raw_actions_taken
Stop Results: Citation Issued;Subject Type: Driver
Stop Results: Verbal Warning;Search Occurred: Yes;Search Types: Vehicle;Search Types: Pat-down;Search Types: Driver;Legal Basises: Consent to search;Legal Basises: Probable cause
Stop Results: Verbal Warning;Subject Type: Driver;Search Occurred: No;Evidence Seized: No
Stop Results: Verbal Warning;Subject Type: Driver
Stop Results: Verbal Warning;Subject Type: Driver;Search Occurred: No
Stop Results: No action taken;Search Occurred: No
Stop Results: Citation Issued;Subject Type: Driver;Search Occurred: No
Stop Results: Citation Issued;Subject Type: Driver;Search Occurred: No;Evidence Seized: No
Stop Results: No action taken;Search Occurred: No;Search Types: Pat-down
SELECT DISTINCT outcome
FROM la_new_orleans_2020_04_01
LIMIT 0, 10;
4 records
outcome
citation
warning
arrest
SELECT DISTINCT vehicle_make, vehicle_color
FROM la_new_orleans_2020_04_01
LIMIT 0, 10;
Displaying records 1 - 10
vehicle_make vehicle_color
DODGE BLACK
NISSAN BLUE
JEEP GRAY
GMC - GENERAL MOTORS CORP. BLUE
CHEVROLET YELLOW
OLDSMOBILE SILVER
TOYOTA GREEN
PONTIAC BRONZE
CHEVROLET WHITE

(note that we haven’t yet arranged the distinct values!)

WHERE

The WHERE clause is analogous to the filter() function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter() function. WHERE operates on the original data in the table and HAVING operates on the result set.

WHERE

What was the time of day for those traffic stops search_conducted was true and vehicle_make was Subaru?

Note that in SQL the equality logical is = and in R the equality logical is ==. Note also that the WHERE command is case-insensitive!

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE search_conducted = 1
   AND vehicle_make = "Subaru"
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
21:05:00 1 1996 SUBARU
21:05:00 1 1996 SUBARU
03:27:00 1 1998 SUBARU
19:28:00 1 2012 SUBARU
21:40:00 1 1998 SUBARU
10:38:00 1 2014 SUBARU
05:01:00 1 1998 SUBARU
22:00:00 1 1998 SUBARU
13:46:00 1 2007 SUBARU
22:50:00 1 1999 SUBARU

WHERE

BETWEEN can be used to specify a range of values for a numeric value. BETWEEN is inclusive.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
   AND vehicle_make = "Subaru"
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
20:16:00 0 1991 SUBARU
20:54:00 0 1998 SUBARU
17:09:00 0 1999 SUBARU
13:20:00 0 1996 SUBARU
20:44:00 0 1998 SUBARU
21:05:00 1 1996 SUBARU
21:05:00 1 1996 SUBARU
18:52:00 0 1995 SUBARU
17:36:00 0 1996 SUBARU
08:40:00 0 1999 SUBARU

WHERE

IN is similar to the dplyr %in% function which specifies distinct values for the variable.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make IN ("Subaru", "Dodge")
   AND search_conducted = 1
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
03:45:00 1 2010 DODGE
20:29:00 1 2012 DODGE
21:12:00 1 2011 DODGE
21:12:00 1 2011 DODGE
02:00:00 1 2014 DODGE
20:46:00 1 2008 DODGE
20:46:00 1 2008 DODGE
04:47:00 1 2014 DODGE
19:16:00 1 1995 DODGE
09:43:00 1 2009 DODGE

WHERE

AND takes precedent over OR in the order of operations, when there are no parentheses.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999 OR
      vehicle_make IN ("Subaru", "Dodge") AND 
      search_conducted = 1
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
16:03:00 0 1997 GMC - GENERAL MOTORS CORP.
16:03:00 0 1997 GMC - GENERAL MOTORS CORP.
07:54:00 0 1995 TOYOTA
21:01:00 0 1997 GMC - GENERAL MOTORS CORP.
21:06:00 0 1996 DODGE
21:16:00 0 1996 DODGE
10:20:00 0 1994 MERCEDES-BENZ
10:35:00 0 1998 PONTIAC
12:46:00 0 1996 ACURA
15:17:00 0 1997 OLDSMOBILE

same as:

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999 OR
      (vehicle_make IN ("Subaru", "Dodge") AND 
      search_conducted = 1)
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
16:03:00 0 1997 GMC - GENERAL MOTORS CORP.
16:03:00 0 1997 GMC - GENERAL MOTORS CORP.
07:54:00 0 1995 TOYOTA
21:01:00 0 1997 GMC - GENERAL MOTORS CORP.
21:06:00 0 1996 DODGE
21:16:00 0 1996 DODGE
10:20:00 0 1994 MERCEDES-BENZ
10:35:00 0 1998 PONTIAC
12:46:00 0 1996 ACURA
15:17:00 0 1997 OLDSMOBILE

WHERE

Parentheses take precedent over AND and OR.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE (vehicle_year BETWEEN 1970 and 1999 OR
      vehicle_make IN ("Subaru", "Dodge") ) AND 
      search_conducted = 1
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
03:45:00 1 2010 DODGE
07:11:00 1 1999 CADILLAC
20:21:00 1 1999 MERCURY
20:29:00 1 2012 DODGE
20:31:00 1 1993 INFINITY
21:12:00 1 2011 DODGE
21:12:00 1 2011 DODGE
02:00:00 1 2014 DODGE
14:31:00 1 1994 JEEP
04:39:00 1 1995 CHEVROLET

WHERE

IS NULL not = NULL (because NULL indicates unknown). (Just like how we need to use is.na() in R!)

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make IS NULL
LIMIT 0, 10;
Displaying records 1 - 10
time search_conducted vehicle_year vehicle_make
02:06:00 0
02:06:00 0
02:06:00 0
02:06:00 0
02:06:00 0
02:06:00 0
02:06:00 0
08:02:00 0
20:05:00 0
20:05:00 0
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make = NULL
LIMIT 0, 10;

A NULL example

The logic of NULL:1

  • If you do anything with NULL, you’ll just get NULL. For instance if \(x\) is NULL, then \(x > 3\), \(1 = x\), and \(x + 4\) all evaluate to NULL. Even \(x =\) NULL evaluates to NULL! if you want to check whether \(x\) is NULL, use x IS NULL or x IS NOT NULL.
  • NULL short-circuits with boolean operators. That means a boolean expression involving NULL will evaluate to:
    • TRUE, if it’d evaluate to TRUE regardless of whether the NULL value is really TRUE or FALSE.
    • FALSE, if it’d evaluate to FALSE regardless of whether the NULL value is really TRUE or FALSE.
    • Or NULL, if it depends on the NULL value.

A NULL example

Consider the following table and SQL query to evaluate WHERE age <= 20 OR num_dogs = 3:

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable;
4 records
name age num_dogs
Ace 20 4
Ada 3
Ben
Cho 27

A NULL example

Where does the WHERE clause do? It tells us that we only want to keep the rows satisfying the age <= 20 OR num_dogs = 3. Let’s consider each row one at a time:

  • For Ace, age <= 20 evaluates to TRUE so the claim is satisfied.
  • For Ada, age <= 20 evaluates to NULL but num_dogs = 3 evaluates to TRUE so the claim is satisfied.
  • For Ben, age <= 20 evaluates to NULL and num_dogs = 3 evaluates to NULL so the overall expression is NULL which has a FALSE value.
  • For Cho, age <= 20 evaluates to FALSE and num_dogs = 3 evaluates to NULL so the overall expression evaluates to NULL (because it depends on the value of the NULL).

Thus we keep only Ace and Ada.

A NULL example

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable
WHERE age <= 20 OR num_dogs = 3;
2 records
name age num_dogs
Ace 20 4
Ada 3

GROUP BY

The GROUP BY clause will direct SQL to carry out the query separately for each category in the grouped variable.

  • aggregate functions include COUNT(), SUM(), MAX(), MIN(), and AVG().
SELECT subject_race,
       COUNT(*) AS num_stops, 
       SUM(1) AS num_stops_also,
       SUM(2) AS double_stops
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
GROUP BY subject_race;
7 records
subject_race num_stops num_stops_also double_stops
2217 2217 4434
asian/pacific islander 315 315 630
black 31586 31586 63172
hispanic 2347 2347 4694
other 23 23 46
unknown 211 211 422
white 11475 11475 22950

GROUP BY

For those people whose cars are between 1970 and 1999, how old is the youngest and oldest person for each subject_race?

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
GROUP BY subject_race;
7 records
subject_race num_stops min_age max_age
2217
asian/pacific islander 315 17 78
black 31586 10 100
hispanic 2347 13 87
other 23 19 68
unknown 211 17 81
white 11475 12 99

GROUP BY

GROUP BY will work applied to multiple columns.

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday;
Displaying records 1 - 10
subject_race num_stops min_age max_age wday
1800 Friday
1590 Monday
1332 Saturday
1172 Sunday
1953 Thursday
1877 Tuesday
2006 Wednesday
asian/pacific islander 551 16 78 Friday
asian/pacific islander 484 13 78 Monday
asian/pacific islander 475 15 78 Saturday

ORDER BY

ORDER BY allows us to look at interesting aspects of the data by sorting the data.

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY min_age ASC;
Displaying records 1 - 10
subject_race num_stops min_age max_age wday
1332 Saturday
1172 Sunday
1877 Tuesday
2006 Wednesday
1953 Thursday
1800 Friday
1590 Monday
black 49945 10 110 Friday
black 42593 10 106 Saturday
black 39252 10 92 Sunday

WHAT?!?!! How are 10 year olds getting pulled over for traffic stops?

ORDER BY

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY max_age DESC;
Displaying records 1 - 10
subject_race num_stops min_age max_age wday
black 49945 10 110 Friday
black 56459 10 110 Tuesday
black 57153 10 110 Wednesday
black 54858 10 110 Thursday
hispanic 2007 11 110 Thursday
black 49557 10 110 Monday
white 18494 11 107 Friday
black 42593 10 106 Saturday
white 20764 11 105 Wednesday
white 15778 11 104 Sunday

So many people over 100 years old… the data seem fishy!

ORDER BY

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY wday, subject_race;
Displaying records 1 - 10
subject_race num_stops min_age max_age wday
black 2 17 24
white 2 23 57
1800 Friday
asian/pacific islander 551 16 78 Friday
black 49945 10 110 Friday
hispanic 1851 11 83 Friday
other 61 19 66 Friday
unknown 510 17 75 Friday
white 18494 11 107 Friday
1590 Monday

ORDER BY

Note that both GROUP BY and ORDER BY evaluate the data after it has been retrieved. Therefore, the functions operate on the results set, not the original rows of the data.

We are able to GROUP BY and ORDER BY on the new variables we had created, wday.

HAVING

Recall that WHERE acts only on the original data. If we are interested in traffic stops that happened on Friday, we need to use the derived variable wday instead of the raw variable date. Fortunately, HAVING works on the results set.

HAVING

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
HAVING wday = 'Friday';
7 records
subject_race num_stops min_age max_age wday
1800 Friday
asian/pacific islander 551 16 78 Friday
black 49945 10 110 Friday
hispanic 1851 11 83 Friday
other 61 19 66 Friday
unknown 510 17 75 Friday
white 18494 11 107 Friday

HAVING

While it worked out quite well for us that HAVING was able to filter the data based on the results set, the use of HAVING was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE, the query will be much faster and more efficient.

Note: HAVING requires a GROUP BY clause. And the variable(s) used in HAVING must also be part of the GROUP BY clause.

LIMIT

LIMIT truncates the query to specified rows. The first number is the offset (i.e., the number of rows to skip), the second number is the (maximum) number of rows to return. Here, we return rows 154219 through 154228.

The first number is optional.

SELECT * FROM la_new_orleans_2020_04_01 LIMIT 154218, 10;
Displaying records 1 - 10
raw_row_number date time location lat lng district zone subject_age subject_race subject_sex officer_assignment type arrest_made citation_issued warning_issued outcome contraband_found contraband_drugs contraband_weapons frisk_performed search_conducted search_person search_vehicle search_basis reason_for_stop vehicle_color vehicle_make vehicle_model vehicle_year raw_actions_taken raw_subject_race
442170 2017-04-20 13:15:00 Gentilly Blvd & Franklin Ave 30.0 -90.0 3 Y 31 black male 3rd District vehicular 0 0 1 warning 0 0 0 0 TRAFFIC VIOLATION BLACK DODGE CHARGER 2008 Stop Results: Verbal Warning;Subject Type: Driver;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No BLACK
442169 2017-04-20 13:15:00 Gentilly Blvd & Franklin Ave 30.0 -90.0 3 Y 32 black male 3rd District vehicular 1 0 0 arrest 0 0 0 0 TRAFFIC VIOLATION BLACK DODGE CHARGER 2008 Stop Results: Physical Arrest;Subject Type: Passenger;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No BLACK
442094 2017-04-20 01:16:00 Broadway & Freret St 29.9 -90.1 2 H 41 black male 2nd District pedestrian 0 0 0 0 0 0 1 1 1 1 consent SUSPECT PERSON Stop Results: No action taken;Subject Type: Pedestrian;Search Occurred: Yes;Search Types: Pat-down;Evidence Seized: No;Legal Basises: Consent to search;Consent To Search: Yes;Exit Vehicle: No BLACK
442171 2017-04-20 13:17:00 Franklin Ave & Gentilly Blvd 30.0 -90.0 3 Y 27 black male 3rd District vehicular 0 1 0 citation 0 0 0 0 TRAFFIC VIOLATION GRAY PONTIAC GRAND PRIX 2008 Stop Results: Citation Issued;Subject Type: Driver;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: Yes BLACK
442093 2017-04-20 01:22:00 S Claiborne Ave & Thalia St 29.9 -90.1 6 J 28 white female 6th District vehicular 0 0 1 warning 0 0 0 0 TRAFFIC VIOLATION BLUE JEEP GRAND CHEROKEE 2002 Stop Results: Verbal Warning;Subject Type: Driver;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No WHITE
442162 2017-04-20 13:27:00 007XX Saint Charles Ave 8 H 30 white male 8th District vehicular 0 1 0 citation 0 0 0 0 TRAFFIC VIOLATION YELLOW DODGE GRAND CARAVAN 2013 Stop Results: Citation Issued;Subject Type: Driver;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No WHITE
442160 2017-04-20 13:27:00 Tulane Ave & S Jefferson Davis Pkwy 30.0 -90.1 1 L 22 black female 1st District vehicular 0 1 1 citation 0 0 0 0 TRAFFIC VIOLATION WHITE CHEVROLET OTHER 2007 Stop Results: Citation Issued;Stop Results: Verbal Warning;Subject Type: Driver;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No BLACK
442159 2017-04-20 13:31:00 Diana St & Wagner St 29.9 -90.0 4 H 20 black female 4th District vehicular 0 0 1 warning 0 0 0 0 TRAFFIC VIOLATION WHITE NISSAN ALTIMA 2003 Stop Results: Verbal Warning;Subject Type: Driver;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No BLACK
442161 2017-04-20 13:35:00 S Jefferson Davis Pkwy & Tulane Ave 30.0 -90.1 1 L 28 black male 1st District vehicular 0 1 0 citation 0 0 0 0 1 1 1 consent TRAFFIC VIOLATION BLACK HONDA ACCORD 1991 Stop Results: Citation Issued;Subject Type: Driver;Search Occurred: Yes;Search Types: Vehicle;Evidence Seized: No;Legal Basises: Consent to search;Consent To Search: Yes;Exit Vehicle: Yes BLACK
442650 2017-04-20 13:37:00 Saint Louis St & N Peters St 30.0 -90.1 8 D 57 black male 8th District 0 1 0 citation 0 0 0 0 CRIMINAL VIOLATION Stop Results: Citation Issued;Subject Type: Pedestrian;Search Occurred: No;Evidence Seized: No;Consent To Search: No;Exit Vehicle: No BLACK

Saving SQL queries as R objects

If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations.

  • use #| output.var: "name_of_variable" inside the {sql} chunk.
  • name_of_variable will then be available to be used in the R environment.
```{sql}
#| connection: con_traffic
#| output.var: "new_table"

SELECT *, DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
LIMIT 0, 1000;
```

Saving SQL queries as R objects

```{r}
new_table |>
  drop_na(wday) |>
  ggplot(aes(x = vehicle_year, y = subject_age, color = wday)) + 
  geom_point() +
  xlim(1985, 2025)
```

Good practice

Always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_traffic, shutdown = TRUE)