7 Discussion

7.0.0.1 Defining and Determining Replication Success

There is no strong consensus yet on what constitutes a replication success and some approaches can be biased (e.g., Schauer & Hedges, 2021) or imprecise (Patil et al., 2016b). Like in classical null hypothesis significance testing (NHST), replication researchers face the trade-off between dichotomizing something that is not dichotomous (success vs. failure) and making a clear decision about the outcome. On the one hand this is a question about statistical choices and their interpretation, namely how to compare original and replication effect sizes (or p-values) and how to interpret differences. On the other hand, it is a more complex question about how to interpret a mixed pattern of results, where some results are consistent across original and replication, while others are not. Here, it is important for replication researchers to specify which effects are of primary interest in their pre-registration, and how they will aggregate results, noting that requiring multiple effects to yield the same result will reduce statistical power.

Below, we present different approaches to assessing replication success as summarized by Heyard et al. (2025; see also Muradchanian et al., 2021; Röseler & Wallrich, 2024; Errington et al., 2021, Table 1).

Table 3

Criteria to operationalize replication success (excerpt from Heyard et al., 2025)

Success Criteria Table
Name	Question answered	Type of Reproducibility investigated
Bayes Factor: Independent Jeffreys-Zellner-Siow BF test	“What is the evidence for the effect being present or absent in light of a replication attempt, given that we know relatively little about the expected effect size beforehand?”	Different data - same analysis
Bayes Factor: Equality-of-effect-size BF test	“What is the evidence for the effect size in the replication attempt being equal vs. unequal to the effect size in the original study?”	Different data - same analysis
Bayes Factor: Fixed-effect meta-analysis BF Test	“When pooling all data, what is the evidence for the effect being present vs. absent?”	Different data - same analysis
Replication Bayes factor	“What is the evidence for the effect from the replication attempt being comparable to what was found in the original study, or absent?” - “Are the replication results more consistent with the original study or with a null effect?”	Different data - same analysis
Significance criterion	“Do the original and replication study both find a statistical significant effect in the same direction?”	Different data - same analysis; Same data - different analysis; Different data - different analysis
Difference in effect size	“To which degree do the effects from a replication study mirror the original”	Different data - same analysis; Same data - different analysis; Different data - different analysis
Confidence interval: original effect in replication 95% CI	“Given an original effect size, (what is the probability that) does a repetition of the experiment, with an independent sample of participants, produce(s) a CI that overlaps with the original effect?”	Different data - same analysis
Confidence interval: replication effect in original 95%CI	“Given an effect size and 95% CI, (what is the probability that) does a repetition of the experiment, with an independent sample of participants, give(s) an effect that falls within the original CI?”	Different data - same analysis
Prediction interval: replication effect in original 95% prediction interval	“Do the findings from the replication study align with a reasonable expectation, given the observed variation in the original study and replication study?” - “Are the replication estimates statistically consistent with the original estimates?”	Different data - same analysis
Small Telescopes	“Are the replication results consistent with an effect size big enough to have been detectable in the original?”	Different data - same analysis
Meta-analysis	“Given an original-replication study pair, does the pooled effect align with that of the original study?” - “Given a set of replications, is the effect size reproducible across studies?”	Different data - same analysis; Same data - different analysis; Different data - different analysis
Equivalence testing	“For the replication of an original null finding, does the replication study find an effect that is equally negligible?” - “Are the results from the replication statistically equivalent to the results of the original study?”	Different data - same analysis
Minimum effect testing	“Is the replication effect size significantly different from a minimal effect size of interest, required to support the original study?”	Different data - same analysis
Causal replication framework	“How can a replication failure be interpreted, from a causal perspective”	Different data - same analysis; Different data - different analysis
Text-based machine learning model to estimate reproducibility	“Given the text of an original paper, what is the probability of replication success?”	Different data - same analysis
Prediction market	“What do the participants in a prediction market predict as the probability that the original findings will replicate?”	Different data - same analysis
Presence/Absence of elements ensuring reproducibility,via proxies	“Do the design, methods and reporting of the original paper align with community standards of reproducible and transparent research?”	Same data - same analysis; Same data - different analysis; Different data - same analysis; Different data - different analysis
Quantified reproducibility assessment, QRA	“After performing multiple measurements of an object, what is the precision of the measured quantity obtained?”	Same data - same analysis; Different data - same analysis; Same data - different analysis; Different data - different analysis
Jaccard similarity coefficient	“By what extent do the results of two (or more) fMRI experiments overlap?”	Same data - same analysis
Sceptical \(p\)-value	“To what extent are the results of a replication study in conflict with the beliefs of a sceptic of the original study?”	Different data - same analysis
Modified Brinley plot	“Given a pre-specified desired effect and multiple replications, what is the share of replications that, represented graphically, achieve the desired effect?”	Same data - same analysis; Same data - different analysis; Different data - same analysis; Different data - different analysis
Likelihood-based approach for reproducibility	“Given a theoretically interesting effect size derived from the original study, what is the evidence for or against replicating this effect?”	Different data - same analysis
Bayesian mixture model for reproducibility rate	“Given the results (\(p\)-values) from a set of original and replication studies, what is the rate of reproducibility, and how is it related to certain aspects of the experiments?”	Different data - same analysis
Unified framework for estimating the credibility of published research	“For a specific published research work, what is the evidence for its credibility measured on four different dimensions: method and data transparency, analytic reproducibility, analytic robustness and effect reproducibility?”	Same data - same analysis; Same data - different analysis; Different data - same analysis; Different data - different analysis
Reproducibility scale of workflow execution - Tonkaz	“Given a certain original research paper with results based on computation, can the workflow to generate the results be executed and verified?”	Same data - same analysis; Same data - different analysis; Different data - same analysis; Different data - different analysis
Mean relative effect size	“What is the average ratio of replication study effects to original study effects?”	Different data - same analysis; Same data - same analysis
Correlation between effects	“Do the replication studies and the original studies produce effects that are correlated?”	Different data - same analysis; Same data - same analysis
Fragility Index	“Given the results of an original study were significant, what is the smallest change in the original data that is needed to deem the results non-significant? and vice-versa for original null results” - “How fragile are the original results to small changes in the underlying data?”	Same data - different analysis
Externally standardized residuals	“Is the original study consistent with the replication(s)?” - “Are all studies included in a meta-analysis replicable?”	Different data - same analysis; Same data - different analysis
Snapshot hybrid	“After replicating an original study, what is the evidence for a null, small, medium or large effect?”	Different data - same analysis
Bayesian Evidence Synthesis	“Given several conceptual replications with substantial diversity in data, design and methods but investigating the same theory, what is the evidence underlying a certain theory of interest?”	Different data - different analysis
Design analysis	“Given the results of an original study and an effect of a hypothetical replication study, what is the probability of the estimate being in the wrong direction, and what is the factor by which the magnitude of the effect is overestimated?”	Different data - same analysis
Reproducibility Maps	“For fMRI research, how many and which of the truly active voxels were strongly reproduced?”	Same data - same analysis; Same data - different analysis
Continuously cumulating meta-analytic approach	“Given subsequent replications that were performed to date, what is the current evidence for an effect?”	Different data - same analysis
Correspondence test	“To what extent does the effect size from the replication study differ or is equivalent to that of the original study?”	Different data - same analysis
Z-curve	“Do all studies combined provide credible evidence for a phenomenon?”	Different data - same analysis
Cross-validation methods	“To what extent can the stability of a result be trusted, and to what extent can the result be generalized?”	Different data - same analysis
Network Comparison Test, NCT	“Given two network structures, how similar are they to each other?”	Same data - same analysis; Different data - same analysis
Leave-one-out error	“Given a deep learning model, how generalizable are its results?”	Different data - same analysis
Subjective reproducibility assessment	“Does the replication team consider the replication as successful?” - “To what extent does the replication team trust in the reproducibility of a finding?”	Different data - same analysis
I squared - \(I^2\)	“Given a set of replications, to what extent is the total variation across study results due to heterogeneity?” - “How consistent are the results across replications?”	Different data - same analysis; Different data - different analysis
Credibility analysis	“How credible are the results of a study, in a Bayesian framework?”	Different data - same analysis
Consistency of original with replications, \(P_{\mbox{orig}}\)	“To what extent are the replication effect sizes consistent with the effect size of an original study?”	Different data - same analysis; Different data - different analysis
Proportion of population effects agreeing in direction with the original, \(\hat{P}_{>0}\)	“To what extent do the replication effect sizes agree with the sign found in the original study?”	Different data - same analysis; Different data - different analysis
RepliCATS	“How reliable do experts believe the claims from an original finding are?”	Different data - same analysis
RepeAT - Repeatability Assessment Tool	“Does the presented research align with community standards of reproducible biomedical research, using electronic health records?”	Same data - same analysis; Different data - same analysis
P interval	“Given the results of an original study, what is the range of \(p\)-values a replication (following the same design) would lie in with 80% probability?”	Different data - same analysis
RipetaScore	“Given certain trust in research, reproducibility and professionalism quality indicators, how high does a paper score?”	Same data - same analysis; Different data - same analysis
Bland-Altman Plot	“Do the effects estimated in several original-replication study pairs agree with each other?” - “How good is the agreement between repeated measures/studies?”	Same data - same analysis; Same data - different analysis; Different data - same analysis; Different data - different analysis
Sceptical Bayes Factor	“In light of the replication data, at which level of evidence can an advocate of the original study convince a sceptic?”	Different data - same analysis

Note. For reference and descriptions see the full table by Heyard et al. (2025, Table 4) also available at http://rachelheyard.com/reproducibility_metrics/#table. Reused in accordance with the CC-BY license of the published article and the OSF-project (https://osf.io/sbcy3).

7.0.0.2 Interpreting Divergent Results (Replication Failures) {#interpreting-divergent-results-(replication-failures)}

When replications succeed, the original claim gains further credence (as long as the methods are sound). However, when replications fail, many explanations and interpretations can be advanced, which need to be carefully considered and discussed in a report. While replication failure can highlight issues with statistical conclusion validity in the original studies (John et al., 2012; Nelson et al., 2018; Simmons et al., 2011), other explanations need to be considered, including issues with internal, external, and construct validity in both original and replication studies (Fabrigar et al., 2020; Vazire et al., 2022). For example, internal validity is threatened when attrition rates differ between experimental conditions in original or replication studies, creating potential confounds in the interpretation of treatment effects (Zhou & Fishbach, 2016). Construct validity is threatened when original or replication studies use unvalidated ad-hoc measures, fail to employ validated manipulations of the target construct, or when differences in sample characteristics between original and replication studies mean that manipulations and measures do not work as intended (Fabrigar et al., 2020; Fiedler et al., 2021; Flake & Fried, 2020). External validity is threatened when original findings do not generalize to the specifics of the replication study due to person and context differences between studies that moderate the effect. Thus, before making statements about the original finding’s robustness and generalizability, replication researchers need to critically discuss potential methodological shortcomings in both original studies and replication attempts that limit statistical conclusion, internal, external, and construct validity.

7.0.0.2.1 Hidden Moderator Account

One challenge for replication researchers is the identification of hidden/unknown confounds that may influence or bias the phenomenon under study. Each study has a set of potential extraneous or unknown moderator variables that is unique to it. These may seem trivial, such as the brightness of an experimental laboratory, or important, such as a cultural difference between the replicating and original studies. Yet even seemingly trivial differences could potentially change results. Often statistical and methodological choices are made to circumvent or attenuate these issues. However, for some paradigms, these variables could be unknown to the original researcher (Fiedler, 2011). These are referred to in the literature as unknown moderators, background variables, hidden moderators or fringe variables. While they are always a way to reject unpleasant replication results, they can potentially bias replications, which highlights that a single replication is never entirely conclusive (though it might raise enough doubts that researchers do not see the value in addressing the remaining uncertainty). It should be noted that the same argument could be applied to raise doubts about any original study, questioning whether the effect is really due to the hypothesised cause or due to some hidden moderator or background variable. Clearly a skeptic who stops at that level would not be taken very seriously, so that it is important to move conversations about replication failure beyond general suspicion of hidden moderators.

Bargh (2006) suggested that the evidence generated by empirical findings far outweighs the resources of (social) psychology to conceptualize and understand the mechanisms underlying their effects. Therefore, boundary conditions are not easily specified, which can impact both direct and conceptual replication success. Replication failure indicates that the original claim does not generalise to the setting of the replication. Whether that generalises to the setting of the original study needs to be considered in light of theory, and might be a legitimate matter of contention.

7.0.0.3 The Role of Differences for the Interpretation of Findings

Each replication outcome should be evaluated in the light of its closeness, which is why all deviations with the respective reasons and, if possible, their potential impact on the results should be discussed. Existing theories may help assess whether a deviation should affect the outcomes. For example, most psychological theories are agnostic towards age so that a different distribution of participants’ age will be unproblematic in most cases. Researchers may choose to evaluate replications from both phenomenon-focused / inductive and theory-focused / deductive views. Different types of interpretations are listed in Figure 5 and integrated from previous accounts by Borgstede and Scholz (2021) and Freese and Peterson (2017, Figure 3).

Figure 5

Interpretation of replication outcomes depend on similarity of closeness and results as well as the view (inductive vs. deductive).

Note.

Inductive or phenomenon-oriented views assume minimal generalizability of the original finding. For example, they cannot cast doubts on the original finding unless the replication is highly similar to the original study.
Deductive or theory-oriented views assume maximal generality of a theory. For example, different results (i.e., replication failures) cast doubts on the theory regardless of the replication type.

7.0.0.4 Comments from the Original Study’s Authors {#comments-from-the-original-study’s-authors}

If the replication results do not converge with the original results, replication researchers can reach out to the original study’s authors and ask for a comment that they can publish together with the replication report. A template for asking for a comment is in the appendix. Note that some journals (e.g., Journal of Replications and Comments in Economics) require such statements at the time of submission.