February 24 + 26, 2025
A regular expression … is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.
Just to scratch the surface, here are a few special characters that cannot be directly coded. Therefore, they are escaped with a backslash, \
.
\'
: single quote.\"
: double quote.\n
: new line.\r
: carriage return.\t
: tab character.Quantifiers specify how many repetitions of the pattern.
*
: matches at least 0 times.+
: matches at least 1 times.?
: matches at most 1 times.{n}
: matches exactly n times.{n,}
: matches at least n times.{n,m}
: matches between n and m times.[1] "ab" "acb" "accb" "acccb" "accccb"
[1] "acb" "accb" "acccb" "accccb"
[1] "a"
[1] "ab" "acb"
[1] "accb"
[1] "accb" "acccb" "accccb"
[1] "accb" "acccb"
^
: matches the start of the string.$
: matches the end of the string.\b
: matches the boundary of a word. Don’t confuse it with ^ $
which marks the edge of a string.1strings <- c("apple", "applet", "pineapple", "apple pie",
"I love apple pie")
str_subset(strings, "\\bapple\\b")
[1] "apple" "apple pie" "I love apple pie"
[1] "apple"
[1] "apple pie" "I love apple pie"
[1] "apple pie"
[1] "apple" NA NA "apple" "apple"
[1] "apple" NA NA NA NA
[1] NA NA NA "apple pie" "apple pie"
[1] NA NA NA "apple pie" NA
.
: matches any single character,[...]
: a character list, matches any one of the characters inside the square brackets. A -
inside the brackets specifies a range of characters.[^...]
: an inverted character list, similar to [...]
, but matches any characters except those inside the square brackets.\
: suppress the special meaning of metacharacters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) \
. Since \
itself needs to be escaped in R, we need to escape metacharacters with double backslash like \\$
.|
: an “or” operator, matches patterns on either side of the |
.(...)
: grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string.(ab|cde)
or ab|cde
match either the string ab
or the string cde
. However, ab | cde
matches ab cde
(and does not match either of ab
or cde
) because the “or” is now whitespace on either side of |
.strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12", "a|b")
str_subset(strings, "ab.")
str_subset(strings, "ab[c-e]")
str_subset(strings, "ab[^c]")
str_subset(strings, "^ab")
str_subset(strings, "\\^ab")
str_subset(strings, "abc|abd")
str_subset(strings, "ab|c")
str_subset(strings, "ab | c")
str_subset(strings, "(ab)|c")
str_subset(strings, "(ab|c)")
str_subset(strings, "a(b|c)")
str_subset(strings, "a[b|c]")
str_extract(strings, "a[b|c]")
[1] "abc" "abd" "abe" "ab 12"
[1] "abc" "abd" "abe"
[1] "abd" "abe" "ab 12"
[1] "ab" "abc" "abd" "abe" "ab 12"
[1] "^ab"
[1] "abc" "abd"
[1] "^ab" "ab" "abc" "abd" "abe" "ab 12"
[1] "ab 12"
[1] "^ab" "ab" "abc" "abd" "abe" "ab 12"
[1] "^ab" "ab" "abc" "abd" "abe" "ab 12"
[1] "^ab" "ab" "abc" "abd" "abe" "ab 12"
[1] "^ab" "ab" "abc" "abd" "abe" "ab 12" "a|b"
[1] "ab" "ab" "ab" "ab" "ab" "ab" "a|"
Character classes allow specifying entire classes of characters, such as numbers, letters, etc. There are two flavors of character classes, one uses [:
and :]
around a predefined name inside square brackets and the other uses \
and a special character. They are sometimes interchangeable.
[:digit:]
or \d
: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9]
.\D
: non-digits, equivalent to [^0-9]
.[:lower:]
: lower-case letters, equivalent to [a-z]
.[:upper:]
: upper-case letters, equivalent to [A-Z]
.[:alpha:]
: alphabetic characters, equivalent to [[:lower:][:upper:]]
or [A-z]
.[:alnum:]
: alphanumeric characters, equivalent to [[:alpha:][:digit:]]
or [A-z0-9]
.\w
: word characters, equivalent to [[:alnum:]_]
or [A-z0-9_]
(letter, number, or underscore).\W
: not word, equivalent to [^A-z0-9_]
.[:blank:]
: blank characters, i.e. space and tab.[:space:]
: space characters: tab, new line, vertical tab, form feed, carriage return, space.\s
: whitespace.\S
: not whitespace.[:punct:]
: punctuation characters, ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.[:graph:]
: graphical (human readable) characters: equivalent to [[:alnum:][:punct:]]
.[:print:]
: printable characters, equivalent to [[:alnum:][:punct:]\\s]
..
matches any single character except a newline \n
..
does match whitespace (e.g., a space or tab)\s
matches any whitespace including: spaces, tabs, new lines, and carriage returns[ \t]
matches spaces and tabs only (not new lines or carriage returns)[^\s]
matches any character except whitespace (including spaces, tabs, and new lines)[^\s]
and [\S]
are functionally equivalent[\s\S]
matches any character including newlines and tabs.\w
matches any single word character (including letters, digits, and the underscore character _
)\B
: matches the empty string provided it is not at an edge of a word.More examples for practice!
Match only the word meter
in “The cemetery is 1 meter from the stop sign.”
Also match Meter
in “The cemetery is 1 Meter from the stop sign.”
Match only the word meter
in “The cemetery is 1 meter from the stop sign.”
Also match Meter
and meTer
…
^(1[012]|[1-9]):[0-5][0-9] (am|pm)$
The “or” operator, |
has the lowest precedence and parentheses have the highest precedence, which means that parentheses get evaluated before “or”.
\bMary|Jane|Sue\b
and \b(Mary|Jane|Sue)\b
?The “or” operator, |
has the lowest precedence and parentheses have the highest precedence, which means that parentheses get evaluated before “or”.
\bMary|Jane|Sue\b
and \b(Mary|Jane|Sue)\b
?str_*()
functions with regular expressionsA lookaround specifies a place in the regular expression that will anchor the string you’d like to match.
Figure 1: Image credit: Stefan Judis https://www.stefanjudis.com/blog/a-regular-expression-lookahead-lookbehind-cheat-sheet/
Data scraped from the wiki site for the TV series, Taskmaster.
Figure 2: Taskmaster Wiki https://taskmaster.fandom.com/wiki/Series_11
Goal: to scrape the Taskmaster wiki into a data frame including task, description, episode, episode name, air date, contestant, score, and series.1
results <- read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
html_element(".tmtable") |>
html_table() |>
mutate(episode = ifelse(startsWith(Task, "Episode"), Task, NA)) |>
fill(episode, .direction = "down") |>
filter(!startsWith(Task, "Episode"),
!(Task %in% c("Total", "Grand Total"))) |>
pivot_longer(cols = -c(Task, Description, episode),
names_to = "contestant",
values_to = "score") |>
mutate(series = 11)
# A tibble: 10 × 6
Task Description episode contestant score series
<chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Prize: Best thing you can ca… Episod… Charlotte… 1 11
2 1 Prize: Best thing you can ca… Episod… Jamali Ma… 2 11
3 1 Prize: Best thing you can ca… Episod… Lee Mack 4 11
4 1 Prize: Best thing you can ca… Episod… Mike Wozn… 5 11
5 1 Prize: Best thing you can ca… Episod… Sarah Ken… 3 11
6 2 Do the most impressive thing… Episod… Charlotte… 2 11
7 2 Do the most impressive thing… Episod… Jamali Ma… 3[1] 11
8 2 Do the most impressive thing… Episod… Lee Mack 3 11
9 2 Do the most impressive thing… Episod… Mike Wozn… 5 11
10 2 Do the most impressive thing… Episod… Sarah Ken… 4 11
Task Description episode contestant score series
1 Prize: Best thing… Episode 1… Charlotte… 1 11
1 Prize: Best thing… Episode 1… Jamali Ma… 2 11
1 Prize: Best thing… Episode 1… Lee Mack 4 11
1 Prize: Best thing… Episode 1… Mike Wozn… 5 11
1 Prize: Best thing… Episode 1… Sarah Ken… 3 11
2 Do the most… Episode 1… Charlotte… 2 11
2 Do the most… Episode 1… Jamali Ma… 3[1] 11
2 Do the most… Episode 1… Lee Mack 3 11
2 Do the most… Episode 1… Mike Wozn… 5 11
2 Do the most… Episode 1… Sarah Ken… 4 11
Currently, the episode
column contains entries like
score
columnscore
– ✔ ✘ 0 1 2 3 3[1] 3[2] 3[3] 4 4[2] 4[3] 5 DQ
7 1 1 11 37 42 47 1 3 1 49 1 1 55 13
How should the scores be stored? What is the cleaning task?
Figure 3: Taskmaster Wiki https://taskmaster.fandom.com/wiki/Series_11
Suppose we have the following string:
And we want to extract just the number “3”:
What if we don’t know which number to extract?
str_extract()
str_extract()
is an R function in the stringr package which can find regular expressions in strings of text.
str_extract()
returns the first match; str_extract_all()
allows more than one match.
What if I want to extract a number?
What will the result be for the following code?
What if I want to extract a number?
What will the result be for the following code?
What if I want to extract a number?
What will the result be for the following code?
The +
symbol in a regular expression means “repeated one or more times”
What if we have multiple instances across multiple strings? We need to be careful working with lists (instead of vectors).
What if we have multiple instances across multiple strings? We need to be careful working with lists (instead of vectors).
[1] "M" "M"
[[1]]
[1] "M" "y" "c" "a" "t" "i" "s" "3" "y" "e" "a" "r" "s" "o" "l" "d"
[[2]]
[1] "M" "y" "d" "o" "g" "i" "s" "1" "0" "y" "e" "a" "r" "s" "o" "l" "d"
[1] "My" "My"
[[1]]
[1] "My" "cat" "is" "3" "years" "old"
[[2]]
[1] "My" "dog" "is" "10" "years" "old"
Currently, the episode
column contains entries like:
How would I extract just the episode number?
Currently, the episode
column contains entries like:
How would I extract just the episode number?
Currently, the episode
column contains entries like:
How would I extract the episode name?
Goal: find a pattern to match: anything that starts with a :
, ends with a .
Let’s break down that task into pieces.
How can we find the period at the end of the sentence? What does each of these lines of code return?
[1] "Episode 2: The pie whisperer. (4 August 2015)"
We use an escape character when we actually want to choose a period:
Goal: find a pattern to match: anything that starts with a :
, ends with a .
Figure 4: Image credit: Stefan Judis https://www.stefanjudis.com/blog/a-regular-expression-lookahead-lookbehind-cheat-sheet/
(?<=y)x
– positive lookbehind (matches ‘x’ when it is preceded by ‘y’)
[1] "The pie whisperer. (4 August 2015)"
x(?=y)
– positive lookahead (matches ‘x’ when it is followed by ‘y’)
[1] "Episode 2: The pie whisperer"
Getting everything between the :
and the .
I want to extract just the air date. What pattern do I want to match?
Currently:
# A tibble: 270 × 1
episode
<chr>
1 Episode 1: It's not your fault. (18 March 2021)
2 Episode 1: It's not your fault. (18 March 2021)
3 Episode 1: It's not your fault. (18 March 2021)
4 Episode 1: It's not your fault. (18 March 2021)
5 Episode 1: It's not your fault. (18 March 2021)
6 Episode 1: It's not your fault. (18 March 2021)
7 Episode 1: It's not your fault. (18 March 2021)
8 Episode 1: It's not your fault. (18 March 2021)
9 Episode 1: It's not your fault. (18 March 2021)
10 Episode 1: It's not your fault. (18 March 2021)
# ℹ 260 more rows
One option:
results |>
select(episode) |>
mutate(episode_name = str_extract(episode, "(?<=: ).+(?=\\.)"),
air_date = str_extract(episode, "(?<=\\().+(?=\\))"),
episode = str_extract(episode, "\\d+")) |>
mutate(air_date2 = dmy(air_date))
# A tibble: 270 × 4
episode episode_name air_date air_date2
<chr> <chr> <chr> <date>
1 1 It's not your fault 18 March 2021 2021-03-18
2 1 It's not your fault 18 March 2021 2021-03-18
3 1 It's not your fault 18 March 2021 2021-03-18
4 1 It's not your fault 18 March 2021 2021-03-18
5 1 It's not your fault 18 March 2021 2021-03-18
6 1 It's not your fault 18 March 2021 2021-03-18
7 1 It's not your fault 18 March 2021 2021-03-18
8 1 It's not your fault 18 March 2021 2021-03-18
9 1 It's not your fault 18 March 2021 2021-03-18
10 1 It's not your fault 18 March 2021 2021-03-18
# ℹ 260 more rows
Another option:
# A tibble: 270 × 4
episode episode_name air_date air_date2
<chr> <chr> <chr> <date>
1 1 It's not your fault 18 March 2021 2021-03-18
2 1 It's not your fault 18 March 2021 2021-03-18
3 1 It's not your fault 18 March 2021 2021-03-18
4 1 It's not your fault 18 March 2021 2021-03-18
5 1 It's not your fault 18 March 2021 2021-03-18
6 1 It's not your fault 18 March 2021 2021-03-18
7 1 It's not your fault 18 March 2021 2021-03-18
8 1 It's not your fault 18 March 2021 2021-03-18
9 1 It's not your fault 18 March 2021 2021-03-18
10 1 It's not your fault 18 March 2021 2021-03-18
# ℹ 260 more rows