Permutation Tests

October 28 + 30, 2024

Jo Hardin

Agenda 10/28/24

  1. Hypothesis tests
  2. Permutation tests

Hypothesis testing

Whether permutation tests or non-computational hypothesis tests (like t-test), hypothesis testing has the same structure.

See class notes on hypothesis testing.

Example: helper or hinderer

In a study reported in the November 2007 issue of Nature, researchers investigated whether infants take into account an individual’s actions towards others in evaluating that individual as appealing or aversive, perhaps laying for the foundation for social interaction (Hamlin, Wynn, and Bloom 2007). In other words, do children who aren’t even yet talking still form impressions as to someone’s friendliness based on their actions? In one component of the study, 10-month-old infants were shown a “climber” character (a piece of wood with “googly” eyes glued onto it) that could not make it up a hill in two tries. Then the infants were shown two scenarios for the climber’s next try, one where the climber was pushed to the top of the hill by another character (the “helper” toy) and one where the climber was pushed back down the hill by another character (the “hinderer” toy). The infant was alternately shown these two scenarios several times. Then the child was presented with both pieces of wood (the helper and the hinderer characters) and asked to pick one to play with.

Example: helper or hinderer

Parts of a hypothesis test

  • What are the observational units?
    • infants
  • What is the variable? What type of variable?
    • choice of helper or hindered: categorical
  • What is the statistic?
    • \(\hat{p}\) = proportion of infants who chose helper = 14/16 = 0.875
  • What is the parameter?
    • p = proportion of all infants who might choose helper (not measurable!)

Hypotheses

\(H_0\): Null hypothesis. Babies (or rather, the population of babies under consideration) have no inherent preference for the helper or the hinderer shape.

\(H_A\): Alternative hypothesis. Babies (or rather, the population of babies under consideration) are more likely to prefer the helper shape over the hinderer shape.

p-value

p-value is the probability of our data or more extreme if nothing interesting is going on.

completely arbitrary cutoff \(\rightarrow\) generally accepted conclusion
p-value \(>\) 0.10 \(\rightarrow\) no evidence against the null model
0.05 \(<\) p-value \(<\) 0.10 \(\rightarrow\) moderate evidence against the null model
0.01 \(<\) p-value \(<\) 0.05 \(\rightarrow\) strong evidence against the null model
p-value \(<\) 0.01 \(\rightarrow\) very strong evidence against the null model

Computation

First find the statistic

# to control the randomness
set.seed(47)

# first create a data frame with the Infant data
Infants <- read.delim("http://www.rossmanchance.com/iscam3/data/InfantData.txt")

# find the observed number of babies who chose the helper
help_obs <- Infants |> 
  summarize(prop_help = mean(choice == "helper")) |> 
  pull()
help_obs
[1] 0.875

Computation

Find the sampling distribution under the condition that the null hypothesis is true.

# write a function to simulate a set of infants who are 
# equally likely to choose the helper or the hinderer

random_choice <- function(rep, num_babies){
  choice = sample(c("helper", "hinderer"), size = num_babies,
                  replace = TRUE, prob = c(0.5, 0.5))
  return(mean(choice == "helper"))
}
# repeat the function many times
map_dbl(1:10, random_choice, num_babies = 16)
 [1] 0.6875 0.3750 0.4375 0.3750 0.5000 0.5000 0.6250 0.4375 0.6875 0.6250
num_exper <- 5000
help_random <- map_dbl(1:num_exper, random_choice, 
                            num_babies = 16)

# visualize null sampling distribution
help_random |> 
  data.frame() |> 
  ggplot(aes(x = help_random)) + 
  geom_histogram() + 
  labs(x = "proportion of babies who chose the helper",
       title = "sampling distribution when null hypothesis is true",
       subtitle = "that is, no inherent preference for helper or hinderer")

Computation

Are the null values consistent with the observed value?

# the p-value!
sum(help_random >= help_obs) / num_exper
[1] 0.0022
# visualize null sampling distribution
help_random |> 
  data.frame() |> 
  ggplot(aes(x = help_random)) + 
  geom_histogram() + 
  geom_vline(xintercept = help_obs, color = "red") + 
  labs(x = "proportion of babies who chose the helper",
       title = "sampling distribution when null hypothesis is true",
       subtitle = "that is, no inherent preference for helper or hinderer")

All together: structure of a hypothesis test

  • decide on a research question (which will determine the test)
  • collect data, specify the variables of interest
  • state the null (and alternative) hypothesis values (often statements about parameters)
    • the null claim is the science we want to reject
    • the alternative claim is the science we want to demonstrate
  • generate a (null) sampling distribution to describe the variability of the statistic that was calculated along the way
  • visualize the distribution of the statistics under the null model
  • get p-value to measure the consistency of the observed statistic and the possible values of the statistic under the null model
  • make a conclusion using words that describe the research setting

Hypotheses

  • Hypothesis Testing compares data to the expectation of a specific null hypothesis. If the data are unusual, assuming that the null hypothesis is true, then the null hypothesis is rejected.

  • The Null Hypothesis, \(H_0\), is a specific statement about a population made for the purposes of argument. A good null hypothesis is a statement that would be interesting to reject.

  • The Alternative Hypothesis, \(H_A\), is a specific statement about a population that is in the researcher’s interest to demonstrate. Typically, the alternative hypothesis contains all the values of the population that are not included in the null hypothesis.

  • In a two-sided (or two-tailed) test, the alternative hypothesis includes values on both sides of the value specified by the null hypothesis.

  • In a one-sided (or one-tailed) test, the alternative hypothesis includes parameter values on only one side of the value specified by the null hypothesis. \(H_0\) is rejected only if the data depart from it in the direction stated by \(H_A\).

Agenda 10/30/24

  1. Two variable permutation tests

Statistics Without the Agonizing Pain

John Rauser of Pintrest (now Amazon), speaking at Strata + Hadoop 2014. https://blog.revolutionanalytics.com/2014/10/statistics-doesnt-have-to-be-that-hard.html

Logic of hypothesis tests

  1. Choose a statistic that measures the effect.

  2. Construct the sampling distribution under \(H_0\).

  3. Locate the observed statistic in the null sampling distribution.

  4. p-value is the probability of the observed data or more extreme if the null hypothesis is true

Logic of permutation tests

  1. Choose a test statistic.

  2. Shuffle the data (force the null hypothesis to be true). Using the shuffled statistics, create a null sampling distribution of the test statistic (under \(H_0\)).

  3. Find the observed test statistic on the null sampling distribution.

  4. Compute the p-value (observed data or more extreme). The p-value can be one or two-sided.

Applet for two sample permutation tests

High School & Beyond survey

Data: 200 randomly selected observations from the High School and Beyond survey, conducted on high school seniors by the National Center for Educational Statistics.

Research Question: in the population, do private school kids have a higher math score on average?

\[H_0: \mu_{private} = \mu_{public}\] \[H_A: \mu_{private} > \mu_{public}\]

\(\mu\) is the average math score in the population.

library(openintro)
hsb2
# A tibble: 200 × 11
      id gender race          ses   schtyp prog   read write  math science socst
   <int> <chr>  <chr>         <fct> <fct>  <fct> <int> <int> <int>   <int> <int>
 1    70 male   white         low   public gene…    57    52    41      47    57
 2   121 female white         midd… public voca…    68    59    53      63    61
 3    86 male   white         high  public gene…    44    33    54      58    31
 4   141 male   white         high  public voca…    63    44    47      53    56
 5   172 male   white         midd… public acad…    47    52    57      53    61
 6   113 male   white         midd… public acad…    44    52    51      63    61
 7    50 male   african amer… midd… public gene…    50    59    42      53    61
 8    11 male   hispanic      midd… public acad…    34    46    45      39    36
 9    84 male   white         midd… public gene…    63    57    54      58    51
10    48 male   african amer… midd… public acad…    57    55    52      50    51
# ℹ 190 more rows

Summary of the variables

hsb2 |> 
  group_by(schtyp) |> 
  summarize(ave_math = mean(math),
            med_math = median(math))
# A tibble: 2 × 3
  schtyp  ave_math med_math
  <fct>      <dbl>    <dbl>
1 public      52.2     52  
2 private     54.8     53.5

Visualize the relationship of interest

hsb2 |> 
  ggplot(aes(x = schtyp, y = math)) + 
  geom_boxplot()

Calculate the observed statistic(s)

For fun, we are calculating both the difference in averages as well as the difference in medians. That is, we have two different observed summary statistics to work with.

hsb2 |> 
  group_by(schtyp) |> 
  summarize(ave_math = mean(math),
            med_math = median(math))
# A tibble: 2 × 3
  schtyp  ave_math med_math
  <fct>      <dbl>    <dbl>
1 public      52.2     52  
2 private     54.8     53.5
hsb2 |> 
  group_by(schtyp) |> 
  summarize(ave_math = mean(math),
            med_math = median(math)) |> 
  summarize(ave_diff = diff(ave_math),
            med_diff = diff(med_math))
# A tibble: 1 × 2
  ave_diff med_diff
     <dbl>    <dbl>
1     2.51      1.5

Generate a null sampling distribution.

perm_data <- function(rep, data){
  data |> 
    select(schtyp, math) |> 
    mutate(math_perm = sample(math, replace = FALSE)) |> 
    group_by(schtyp) |> 
    summarize(obs_ave = mean(math),
              obs_med = median(math),
              perm_ave = mean(math_perm),
              perm_med = median(math_perm)) |> 
    summarize(obs_ave_diff = diff(obs_ave),
              obs_med_diff = diff(obs_med),
              perm_ave_diff = diff(perm_ave),
              perm_med_diff = diff(perm_med),
              rep = rep)
}

map(1:10, perm_data, data = hsb2) |> 
  list_rbind()
# A tibble: 10 × 5
   obs_ave_diff obs_med_diff perm_ave_diff perm_med_diff   rep
          <dbl>        <dbl>         <dbl>         <dbl> <int>
 1         2.51          1.5         2.62            2.5     1
 2         2.51          1.5         0.757          -1       2
 3         2.51          1.5         1.65            2       3
 4         2.51          1.5        -0.805          -0.5     4
 5         2.51          1.5         1.80            2.5     5
 6         2.51          1.5         2.92            3       6
 7         2.51          1.5         1.91            2.5     7
 8         2.51          1.5        -2.55           -3.5     8
 9         2.51          1.5        -1.21           -3       9
10         2.51          1.5         1.32            1      10

Visualize the null sampling distribution (average)

set.seed(47)
perm_stats <- 
  map(1:500, perm_data, data = hsb2) |> 
  list_rbind() 

perm_stats |> 
  ggplot(aes(x = perm_ave_diff)) + 
  geom_histogram() + 
  geom_vline(aes(xintercept = obs_ave_diff), color = "red")

Visualize the null sampling distribution (median)

perm_stats |> 
  ggplot(aes(x = perm_med_diff)) + 
  geom_histogram() + 
  geom_vline(aes(xintercept = obs_med_diff), color = "red")

p-value

perm_stats |> 
  summarize(p_val_ave = mean(perm_ave_diff > obs_ave_diff),
            p_val_med = mean(perm_med_diff > obs_med_diff))
# A tibble: 1 × 2
  p_val_ave p_val_med
      <dbl>     <dbl>
1     0.086      0.27

Conclusion

From these data, the observed differences seem to be consistent with the distribution of differences in the null sampling distribution.

There is no evidence to reject the null hypothesis.

We cannot claim that in the population the average math scores for private school kids is larger than the average math scores for public school kids (p-value = 0.086).

We cannot claim that in the population the median math scores for private school kids is larger than the median math scores for public school kids (p-value = 0.27).

Two-sided test:

\(H_0: \mu_{private} = \mu_{public}\) and \(H_A: \mu_{private} \ne \mu_{public}\)

Two-sided p-value

perm_stats |> 
    summarize(p_val_ave = 
                mean(perm_ave_diff > obs_ave_diff | 
                       perm_ave_diff < -obs_ave_diff),
              p_val_med = 
              mean(perm_med_diff > obs_med_diff | 
                     perm_med_diff < -obs_med_diff))
# A tibble: 1 × 2
  p_val_ave p_val_med
      <dbl>     <dbl>
1     0.154     0.534

Two-sided conclusion

From these data, the observed differences seem to be consistent with the distribution of differences in the null sampling distribution.

There is no evidence to reject the null hypothesis.

We cannot claim that there is a difference in average math scores in the population (p-value = 0.154).

We cannot claim that there is a difference in median math scores int he population (p-value = 0.534).

References

Hamlin, J. Kiley, Karen Wynn, and Paul Bloom. 2007. “Social Evaluation by Preverbal Infants.” Nature 450: 557–59.