Skip to contents

Different criteria for replication success

Whether a replication is successful can be assessed in many different ways - that answer different questions. Many criteria have been proposed, and some of them have been implemented in assess_replication_outcome() - we will briefly introduce them here, and then compare their results.

Repeated significance

Potentially the most basic question is whether a replication again yields a statistically significant result. This has been used to generate many of the headline findings about replicability, and is easily interpretable. However, there are some limitations:

  • It ignores the evidence provided by the original study entirely, and might thus assume that a difference in significance is statistically significant.
  • It does not consider the size of the effect, so that large-scale replications that find tiny effects may be considered successful.
  • It does not consider the quality of the replication, so that a replication with a tiny sample size may be reported as a failed replication, while it does not tell us anything about the original effect.

To assess replication success in that way, use criterion = "significance_r" in assess_replication_outcome().

assess_replication_outcome(es_o = .5, n_o = 50, es_r = .1, n_r = 1000,
                           criterion = "significance_r")
#>   outcome                    outcome_detailed
#> 1 success + replication effect is significant

Aggregated Significance

Another approach is to aggregate the effect sizes from the original and replication studies using a meta-analytic approach and assess whether the combined effect is significantly different from zero. This method considers the evidence from both studies together, offering a more integrated perspective.

However, it assumes that the quality of evidence from both studies is comparable - which may not be the case if the original study is likely to have been subject to selective reporting (i.e. publication bias) or other Questionable Research Practices. In that case, the reported effect sizes are likely inflated, so that more trust should be placed in the replication study.

To use this criterion, set criterion = "significance_agg" in assess_replication_outcome():

assess_replication_outcome(es_o = .5, n_o = 50, es_r = .1, n_r = 50,
                           criterion = "significance_agg")
#>   outcome                     outcome_detailed
#> 1 success + aggregated effect is larger than 0

Consistency with Replication Confidence Interval

This criterion checks whether the original effect size falls within the confidence interval of the replication effect size. If it does, it suggests that the replication is consistent with the original finding, taking into account the uncertainty of the replication study.

This method is beneficial because it directly compares the two studies, emphasizing whether the original result is plausible given the replication data. However, it makes lower-powered replications more likely to be “successful”, even if their point estimate is far from the original effect size. Conversely, it makes large (and thus more precise) replications more likely “fail”, even if their point estimate is close to the original effect size.

To assess this, use criterion = "consistency_ci".

assess_replication_outcome(es_o = .5, n_o = 50, es_r = .3, n_r = 50,
                           criterion = "consistency_ci")
#>   outcome                                  outcome_detailed
#> 1 success + Original effect size is within replication's CI
assess_replication_outcome(es_o = .5, n_o = 50, es_r = .4, n_r = 500,
                           criterion = "consistency_ci")
#>   outcome                                      outcome_detailed
#> 1 failure - Original effect size is not within replication's CI

Prediction Intervals

Prediction intervals offer another way to assess consistency by checking whether the replication effect size falls within the prediction interval derived from the original study. This interval accounts for the expected variability in replication results and provides a broader test than confidence intervals.

This method is beneficial because it considers the uncertainty in both the original and replication studies. However, particularly for small replication studies, the prediction interval may be very wide, making it very difficult to reject them through replications.

To use this method, set criterion = "consistency_pi"

assess_replication_outcome(es_o = .5, n_o = 50, es_r = .3, n_r = 500,
                           criterion = "consistency_pi")
#>   outcome                                            outcome_detailed
#> 1 success + Replication effect size is within the prediction interval

Homogeneity of Effects

This criterion tests whether the effects from the original and replication studies are statistically different from each other, using a heterogeneity test (i.e., a Q-test). This is the most rigorous test for the question whether a replication contradicts an original study (if the original study’s claim is taken to encompass its full statistical uncertainty).

However, this test generally has low power, again making replication failures for small original studies rather unlikely. Also, note that effect sizes can be homogeneous even if the joint effect is not significant, so that a replication might be considered successful even if the aggregate evidence does not support the existence of an effect.

To assess replication success using this approach, use criterion = "homogeneity":

assess_replication_outcome(es_o = .2, n_o = 100, es_r = 0, n_r = 500,
                           criterion = "homogeneity")
#>   outcome                outcome_detailed
#> 1 success + effects are not heterogeneous

Homogeneity and Significance

This criterion combines the assessment of homogeneity with the significance of the effect sizes. It considers a replication as successful if the effect is not statistically different from the original study and the combined effect is significantly different from zero. This method provides a balanced assessment that considers both consistency and evidence strength - yet still has low power to identify replication failures driven by an exaggeration of the reported effect size in small studies.

To apply this, set criterion = "homogeneity_significance":

assess_replication_outcome(es_o = .2, n_o = 100, es_r = 0, n_r = 500,
                           criterion = "homogeneity_significance")
#>   outcome                                   outcome_detailed
#> 1 failure - effect sizes are homogeneous but not significant

Small Telescopes

Derived from Simonsohn (2015), the small telescopes approach tests whether the replication effect size is larger than the effect size that would have given the original study a power of 33%. This criterion is based on the idea that a successful replication should indicate that the original study provided some evidence (rather than being a lucky guess). However, this can label replications with substantial effect sizes as failures if the original study was very small, so it is not an appropriate criterion if the focus is on the truth of claims, rather than the quality of evidence.

To use this method, set criterion = "small_telescopes":

assess_replication_outcome(es_o = .5, n_o = 50, es_r = .2, n_r = 100,
                           criterion = "small_telescopes")
#>   outcome                                               outcome_detailed
#> 1 success + original study had >= 33% power to detect replication effect

assess_replication_outcome(es_o = .5, n_o = 30, es_r = .2, n_r = 100,
                           criterion = "small_telescopes")
#>   outcome                                              outcome_detailed
#> 1 failure - original study had < 33% power to detect replication effect

Comparing different criteria for replication success

To compare the different criteria, we plot the aggregate outcomes based on the FReD dataset. For this, we ignore studies that have not been coded.

df <- load_fred_data() %>% 
  filter(!is.na(es_replication) & !is.na(es_original)) 
#> 72 effect sizes could not be converted to a standardised metric.
#> 48 effect sizes could not be converted to a standardised metric.

criteria <- tribble(
  ~criterion,                  ~name,                             
   "significance_r",            "Significance\n(replication)",               
   "significance_agg",          "Significance\n(aggregated)",             
   "consistency_ci",            "Consistency\n(confidence interval)", 
   "consistency_pi",            "Consistency\n(prediction interval)",   
   "homogeneity",               "Homogeneity",                    
   "homogeneity_significance",  "Homogeneity +\nSignificance",
   "small_telescopes",          "Small Telescopes"
)

results <- lapply(criteria$criterion, function(criterion) {
  df %>% 
    mutate(assess_replication_outcome(es_original, n_original, es_replication, n_replication, criterion = criterion)) %>% 
    count(outcome, outcome_detailed) %>% 
    mutate(criterion = criterion)
})

results <- results %>%
  bind_rows() %>%
  left_join(criteria, by = "criterion") %>%
  arrange(outcome != "success", -n) %>%
  mutate(criterion = factor(name, levels = unique(name)))

success_shares <- results %>% 
  bind_rows() %>%
  group_by(criterion) %>% 
  summarise(success_share = sum(n[!is.na(outcome) & outcome == "success"]) / sum(n),
            success_count = sum(n[!is.na(outcome) & outcome == "success"])) %>% 
  mutate(outcome = "success")

p <- results %>%
  ggplot(aes(criterion, n, fill = outcome, text = outcome_detailed)) +
  geom_col(position = "stack", color = "black") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(aes(label = paste(round(success_share*100, 1), "%"), text = NULL, y = success_count/2), data = success_shares) +
  labs(y = "k", x = "Success criterion") +
    scale_fill_manual(values = c(
    "success" = "#66C2A5",        
    "failure" = "#FC8D62",        
    "OS not significant" = "#BFBFBF", 
    "NA" = "#808080"              
  ))


p %>% 
  plotly::ggplotly(tooltip = c("text", "n"))