Framework for
Open and

Logo of FORRT is a fort.

Reversals & Replications


[What I propose] is not a reform of significance testing as currently practiced in soft-psych. We are making a more heretical point… We are attacking the whole tradition of null-hypothesis refutation as a way of appraising theories… Most psychology uses conventional Ho refutation in appraising the weak theories of soft psychology… [is] living in a fantasy world of “testing” weak theories by feeble methods.

Paul Meehl (1990)

A medical reversal is when an existing treatment is found to be ineffective and harmful. Psychology, for example, has been racking up reversals. In recent years, scholarship showed only 40-65% of some classic results were replicated, in the weak sense of finding statistical significance for the same direction of effect (less than zero or greater than zero effect). Even in those that replicated, the average effect found was half the originally reported effect. We realise that replications of social sciences are themselves intricate phenomena with analytical and researcher dependencies, but while such failures to replicate are far less costly to society than medical ones, it still pollutes science’s goal of accumulating knowledge.

Psychology is not alone: medicine, cancer biology, and economics all have their share of irreplicable results. It’d be wrong to write off psychology, or any other discipline for that matter, as not only scientific subfields differ a lot with respect to replication rates and effect-size shrinkage, thereby rendering field-wide generalizations uninformative but also because one reason psychology reversals are so prominent has to do with it’s unusual ‘openness’ in terms of code and data sharing. A less scientific field would never have caught its own bullshit.

Box 1. Reversals in the context of COVID-19.

A counterexample from the COVID-19 pandemic: the UK’s March 2020 policy was based on the idea of behavioural fatigue and Western resentment of restrictions; that a costly prohibition would only last a few weeks before the population revolt against it, and so it had to be delayed until the epidemic’s peak. Now, this policy was so politically toxic that we know it had to be based on some domain reasoning, and it is in a way heartening that the government tried to go beyond socially naive epidemiology. But it was strongly criticised by hundreds of other behavioural scientists, who noted that the evidence for these ideas was too weak to base policy on. Here’s a catalogue of bad psychological takes.

The following are empirical findings about empirical findings; they’re all open to re-reversal. Also it’s not that “we know these claims are false”: failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis. We’ve tried to ban ourselves from saying “successful” or “failed” replication, and to report the best-guess effect size rather than play the bad old Yes/No science game. Code for converting means to Cohen’s d and Hedge’s g here.

Andrew Gelman and others suggest deflating all single-study effect sizes you encounter in the social sciences, without waiting for the subsequent shrinkage from publication bias, measurement error, data-analytic degrees of freedom, and so on. There is no uniform factor, but it seems sensible to divide novel effect sizes by a number between 2 and 100 (depending on its sample size, method, measurement noise, maybe its p-value if it’s really tiny).

Selection Criteria

Claims are included if there was at least one of: several failed replications, several good meta-analyses with notably smaller d, very strong publication bias, clear fatal errors in the analysis, a formal retraction, or clear fraud. Cases like growth mindset are also included, where the eventual effect size, though positive, was a tiny fraction of the hyped original claim. To best interpret our list below, please compare it to the original paper’s effect size. Here, we do not provide an averaging of high-quality supporting papers. This is because thousands of potentially non-replicable papers are published every year, and filtering, reading, and listing them all would be a full-time job even if they were all included in systematic replication or reanalysis projects, ripe fruit. The rule is that if a spurious effect is discussed, or our community or contributors sees it in a book, or if it could hurt someone, it’s noteworthy.

Why trust replications more than originals?

One systematic problem with older results is that they were not pre-registered; we have no assurance that the published analysis was the only one, and so that the inferences presented are in fact valid.

Replication studies have very high rates of pre-registration, and higher rates of code and data sharing. For “ direct” replications, the original target study has in effect pre-registered their hypotheses, methods, and analysis plan.

But don’t trust any of them, in the sense of accepting them uncritically. Look for 3+ failed replications from different labs, just to save lots of rewriting, as the garden of forking paths and the mystery of the lefty p-curve unfold.

Project’s Motivation

The purpose of collating these reversal effects in social science is to encourage educators to incorporate replications of these effects into their students' project (e.g., third-year, thesis, course work) to provide them the opportunity to experience the research process directly, assess their ability to perform and report scientific research, and to help evaluate the robustness of the original study, thereby also helping them become good consumers of research. The below crowdsourced and community-curated resource aims to satisfy three of FORRT’s Goals:

  • Support scholars in their efforts to learn and stay up-to-date on best practices regarding open and reproducible research;
  • Facilitating conversations about the ethics and social impact of teaching substantive topics with due regard to scientific openness, epistemic uncertainty and the credibility revolution;
  • Foster social justice through the democratization of scientific educational resources and its pedagogies.

and four of FORRT’s Mission:

  • Dismantling hierarchies surrounding research, teaching, and service;
  • Building community among educators and various non-academic communities working to improve scientific communication and literacy across academia and the general public;
  • Building capacity for advocacy; and
  • Advocacy for the creation and maintenance of educational resources.

How to contribute?

Anyone can add reversals or replications by joining our initiative and then following the instructions in our crowdsource g-doc.

Reversals (organized per field)

Social Psychology

No good evidence for many forms of priming, automatic behaviour change from ‘related’ (often only metaphorically related) stimuli. Semantic priming is still solid, but the effect lasts only seconds.

Elderly priming, that hearing about old age makes people walk slower. The p-curve alone argues against the first 20 years of studies.

No good evidence for Money priming, that “images or phrases related to money cause increased faith in capitalism, and the belief that victims deserve their fate”.

  • Original paper: ‘Mere exposure to money increases endorsement of free-market systems and social inequality’, Caruso 2013. n between 30 and 168 (~120 citations).
  • Critiques: Rohrer 2015, n=136. Lodder 2019, a meta-analysis of 246 experiments. (total citations: ~70)
  • Original effect size: system justification d=0.8, just world d=0.44, dominance d=0.51
  • Replication effect size: For 47 preregistered experiments in Lodder:
  • g = 0.01 [-0.03, 0.05] for system justification,
    g = 0.11 [-0.08, 0.3] for belief in a just world,
    g = 0.07 [-0.02, 0.15] for fair market ideology.

Questionable evidence for Commitment priming (recall), participants exposed to a high-commitment prime would exhibit greater forgiveness.

Hostility priming (unscrambled sentences), exposing participants to more hostility-related stimuli caused them subsequently to interpret ambiguous behaviors as more hostile.

Intelligence priming (contemplation), participants primed with a category associated with intelligence (e.g. “professor”) performed 13% better on a trivia test than participants primed with a category associated with a lack of intelligence (“soccer hooligans”).

Moral priming (contemplation), participants exposed to a moral-reminder prime would demonstrate reduced cheating.

Death priming (Mortality Salience/Terror Management Theory), participants not exposed to mortality primes would show higher fear of death.

Verbal framing (temporal tense), participants who read what a person was doing showed enhanced accessibility of intention-related concepts and attribute more intentionality to the person, relative to what they did.

Gustatory Disgust on Moral Judgment, gustatory disgust triggers a heightened sense of moral wrongness.

No good evidence for the Macbeth effect, that moral aspersions induce literal physical hygiene.

A failed replication with opposite results of Social Class on Prosocial Behaviour such that people with high social class were more likely to be pro-social than those with low social class.

No good evidence of anything from the Stanford prison ‘experiment’. It was not an experiment; ‘demand characteristics’ and scripting of the abuse; constant experimenter intervention; faked reactions from participants; as Zimbardo concedes, they began with a complete “absence of specific hypotheses”.

  • Original paper: ‘Interpersonal dynamics in a simulated prison’, Zimbardo 1973 (1800 citations, but cited by books with hundreds of thousands of citations).
  • Critiques: convincing method & data inspection
    Le Texier 2019 (total citations: ~8)
    Reicher and Haslam 2011 (total citations: ~430)
  • Original effect size: Key claims were insinuation plus a battery of difference in means tests at up to 20% significance(!). n = 24, data analysis on 21.
  • Replication effect size: N/A

No good evidence from the famous Milgram experiment that 65% of people will inflict pain if ordered to. Experiment was riddled with researcher degrees of freedom, going off-script, implausible agreement between very different treatments, and “only half of the people who undertook the experiment fully believed it was real and of those, 66% disobeyed the experimenter.”

  • Original paper: Behavioral Study of obedience, Milgram 1963. n=40 (~6600 citations). (The full range of conditions was n=740.)
  • Critiques: Burger 2011, Perry 2012, Brannigan 2013; Griggs 2016 (total citations: ~240).
  • Original effect size: 65% of subjects said to administer maximum, dangerous voltage.
  • Replication effect size: Doliński 2017 is relatively careful, n=80, and found comparable effects to Milgram. Burger (n=70) also finds similar levels of compliance to Milgram, but the level didn’t scale with the strength of the experimenter prods (see Table 5: the only real order among the prompts led to universal disobedience), so whatever was going on, it’s not obedience. One selection of follow-up studies found average compliance of 63%, but suffer from the usual publication bias and tiny samples. (Selection was by a student of Milgram.) The most you can say is that there’s weak evidence for compliance, rather than obedience. (“Milgram’s interpretation of his findings has been largely rejected.")

No good evidence that tribalism arises spontaneously following arbitrary groupings and scarcity, within weeks, and leads to inter-group violence . The “spontaneous” conflict among children at Robbers Cave was orchestrated by experimenters; tiny sample (maybe 70?); an exploratory study taken as inferential; no control group; there were really three experimental groups - that is, the experimenters had full power to set expectations and endorse deviance; results from their two other studies, with negative results, were not reported.

  • Original paper: ‘Superordinate Goals in the Reduction of Intergroup Conflict’, Sherif 1958, n=22; (His books on the studies are more cited: ‘Groups in harmony and tension’ (1958) and Intergroup Conflict and Co-operation'.)
    (~7000 total citations including the SciAm puff piece).
  • Critiques: Billig 1976 in passing (729 citations), Perry 2018 (citations: 9)
  • Original effect size: Not that kind of psychology. (“results obtained through observational methods were cross-checked with results obtained through sociometric technique, stereotype ratings of in-groups and outgroups, and through data obtained by techniques adapted from the laboratory. Unfortunately, these procedures cannot be elaborated here.")
  • Replication effect size: N/A
  • (Set aside the ethics: the total absence of consent - the boys and parents had no idea they were in an experiment - or the plan to set the forest on fire and leave the boys to it.)
  • Tavris claims that the underlying “realistic conflict theory” is otherwise confirmed. Who knows.

Screen time and wellbeing. Lots of screen-time is not strongly associated with low wellbeing; it explains about as much of teen sadness as eating potatoes, 0.35%.

  • Original paper: Media speculation? (millions of ‘citations’).
  • Critiques: Orben 2019, n=355,358
  • Original effect size: N/A
  • Replication effect size: median association of technology use with adolescent well-being was β=−0.035, s.e.=0.004

No good evidence that female-named hurricanes are more deadly than male-named ones. Original effect size was a 176% increase in deaths, driven entirely by four outliers; reanalysis using a greatly expanded historical dataset found a nonsignificant decrease in deaths from female named storms.

  • Original paper: ‘Female hurricanes are deadlier than male hurricanes’, Jung 2014, n=92 hurricanes discarding two important outliers.
    (~76 citations).
  • Critiques: Christensen 2014. Smith 2016, n=420 large storms.
    (total citations: ~15)
  • Original effect size: d=0.65: 176% increase in deaths from flipping names from relatively masculine to relatively feminine
  • Replication effect size: Smith: 264% decrease in deaths (Atlantic); 103% decrease (Pacific).

At most weak use in implicit bias testing for racism. Implicit bias scores poorly predict actual bias, r = 0.15. The operationalisations used to measure that predictive power are often unrelated to actual discrimination (e.g. ambiguous brain activations). Test-retest reliability of 0.44 for race, which is usually classed as “unacceptable”. This isn’t news; the original study also found very low test-criterion correlations.

The Pygmalion effect, that a teacher’s expectations about a student affects their performance, is at most small, temporary, and inconsistent, r<0.1 with a reset after weeks. Rosenthal’s original claims about massive IQ gains, persisting for years, are straightforwardly false (“The largest gain… 24.8 IQ points in excess of the gain shown by the controls.”), and used an invalid test battery. Jussim: “90%–95% of the time, students are unaffected by teacher expectations”.

At most weak evidence for stereotype threat suppressing girls’ maths scores. i.e. the interaction between gender and stereotyping.

  • Original paper: ‘Stereotype Threat and Women’s Math Performance’, Spencer 1999, n=30 women (~3900 citations).
  • Critiques: Stoet & Geary 2012, meta-analysis of 23 studies. Ganley 2013, n=931. Flore 2015, meta-analysis of 47 measurements. Flore 2018, n=2064. (total citations: ~500)
  • Original effect size: Not reported properly; Fig.2 looks like control-group-women-mean-score = 17 with sd=20, and experiment-group-women-score = 5 with sd=15. Which might mean roughly d= −0.7.
  • Replication effect size:
    Stoet: d= −0.17 [−0.27, −0.07] for unadjusted scores.
    Ganley: various groups, d= -0.27 to -0.17.
    Flore 2015: g= −0.07 [−0.21; 0.06] after accounting for publication bias.
    Flore 2018: d= −0.05 [−0.18, 0.07]

Questionable evidence for an increase in “narcissism” (leadership, vanity, entitlement) in young people over the last thirty years. The basic counterargument is that they’re misidentifying an age effect as a cohort effect (The narcissism construct apparently decreases by about a standard deviation between adolescence and retirement.) “every generation is Generation Me”
All such “generational” analyses are at best needlessly noisy approximations of social change, since generations are not discrete natural kinds, and since people at the supposed boundaries are indistinguishable.

Figure 1

  • Table 3 here shows a mix of effects in 30 related constructs between 1977 and 2006, up and down.
  • Wetzel: d = -0.27 (1990 - 2010)

Be very suspicious of anything by Diederik Stapel. 58 retractions here.

Positive Psychology

No good evidence that taking a “power pose” lowers cortisol, raises testosterone, risk tolerance.

Weak evidence for facial-feedback (that smiling causes good mood and pouting bad mood).

(~2200 citations).

Reason to be cautious about mindfulness for mental health. Most studies are low quality and use inconsistent designs, there’s higher heterogeneity than other mental health treatments, and there’s strong reason to suspect reporting bias. None of the 36 meta-analyses before 2016 mentioned publication bias. The hammer may fall.

  • Critiques: Coronado-Montoya 2016
  • Original effect size: prima facie, d=0.3 for anxiety or depression
  • Replication effect size: Not yet.

No good evidence for Blue Monday, that the third week in January is the peak of depression or low affect ‘as measured by a simple mathematical formula developed on behalf of Sky Travel’. You’d need a huge sample size, in the thousands, to detect the effect reliably and this has never been done.

Cognitive Psychology

Good and robust evidence against ego depletion, that willpower is limited in a muscle-like fashion.

  • Original paper: ‘Ego Depletion: Is the Active Self a Limited Resource?’, Baumeister 1998, n=67 (~5700 citation)
  • Critiques: Hagger 2016, 23 independent conceptual replications
    (total citations: ~6
  • Critique: Vohs et al. 2021, multisite project, n = 3,531 over 36 sites. Altmetrics: * Original effect size: something like d = -1.96 between control and worst condition. (I hope I’m calculating that wrong Replication effect size: d = 0.04 [−0.07, 0.14]. (NB: not testing the construct the same wa* Replication effect size (Vohs et al. 2021) : d = 0.06.

Mixed evidence for the Dunning-Kruger effect. No evidence for the “Mount Stupid” misinterpretation.

  • Original paper: ‘Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments.’, Dunning & Kruger 1999, n=334 undergrads. This contains claims (1), (2), and (5) but no hint of (3) or (4). (~5660 citations)
  • Critiques: Gignac 2020, n=929; Nuhfer 2016 and Nuhfer 2017, n=1154; Luu 2015; Greenberg 2018, n=534; Yarkoni 2010.
    (total citations: ~20)
  • Original effect size: No sds reported so I don’t know. 2 of the 4 experiments showed a positive relationship between score and perceived ability; 2 showed no strong relationship. And the best performers tended to underestimate their performance. This replicates: the correlation between your IQ and your assessment of it is around r ≃ 0.3. (3) and (4) are not at all warranted.
    (5) is much shakier than (1). The original paper concedes that there’s a purely statistical explanation for (1): just that it is much easier to overestimate a low number which has a lower bound! And the converse: if I am a perfect performer, I am unable to overestimate myself. D&K just think there’s something notable left when you subtract this.
    It’s also confounded by (2)
  • Replication effect size (for claim 1): 3 of the 4 original studies can be explained by noisy tests, bounded scales, and artefacts in the plotting procedure. (“the primary drivers of errors in judging relative standing are general inaccuracy and overall biases tied to task difficulty”.) Only about 5% of low-performance people were very overconfident (more than 30% off) in the Nuhfer data
  • Gignac & Zajenkowski use IQ rather than task performance, and run two less-confounded tests, finding r = −0.05 between P and errors, and r = 0.02 for a quadratic relationship between self-described performance and actual performance
  • Jansen (2021) find independent support for claim 1 (n=3500) (the “performance-dependent estimation model”) and also argue for (5), since they find less evidence for an alternative explanation, Bayesian reasoning towards a prior of “I am mediocre”. (Fig 5b follows the original DK plot style, and is very unclear as a result.
  • Muller (2020) replicate claim (1) and add some EEG stuff* Some suggestions that claim (2) is WEIRD only.

Questionable evidence for a tiny “depressive realism” effect, of increased predictive accuracy or decreased cognitive bias among the clinically depressed.

  • Original paper: ‘Judgment of contingency in depressed and nondepressed students: sadder but wiser?’, 1979 (2450 citations).
  • Critiques: Moore & Fresco 2012
    (211 total citations)
  • Original effect size: d= -0.32 for bias about ‘contingency’, how much the outcome actually depends on what you do,
    n=96 students, needlessly binarised into depressed and nondepressed based on Beck score > 9. (Why?)
  • Replication effect size: d = -0.07 with massive sd=0.46, n=7305, includes a trim-and-fill correction for publication bias. “Overall, however, both dysphoric/depressed individuals (d= .14) and nondysphoric/nondepressed individuals evidenced a substantial positive bias (d= .29)”

Questionable evidence for the “hungry judge” effect, of massively reduced acquittals (d=2) just before lunch. Case order isn’t independent of acquittal probability (“unrepresented prisoners usually go last and are less likely to be granted parole”); favourable cases may take predictably longer and so are pushed until after recess; effect size is implausible on priors; explanation involved ego depletion.

No good evidence for multiple intelligences (in the sense of statistically independent components of cognition). Gardner, the inventor: “Nor, indeed, have I carried out experiments designed to test the theory… I readily admit that the theory is no longer current. Several fields of knowledge have advanced significantly since the early 1980s.

At most weak evidence for brain training (that is, “far transfer” from daily training games to fluid intelligence) in general, in particular from the Dual n-Back game.

  • Original paper: ‘Improving fluid intelligence with training on working memory’, Jaeggi 2008, n=70. (2200 citations).

  • Critiques: Melby-Lervåg 2013, meta-analysis of 23 studies.
    Gwern 2012, meta-analysis of 45 studies.

  • Original effect size: d=0.4 over control, 1-2 days after training

  • Replication effect size: Melby: d=0.19 [0.03, 0.37] nonverbal; d=0.13 [-0.09, 0.34] verbal. Gwern: d=0.1397 [-0.0292, 0.3085], among studies using active controls.

  • Maybe some effect on non-Gf skills of the elderly.
    A 2020 RCT on 572 first-graders finds an effect (d=0.2 to 0.4), but many of the apparent far-transfer effects come only 6-12 months later, i.e. well past the end of most prior studies.

  • In general, be highly suspicious of anything that claims a positive permanent effect on adult IQ. Even in children the absolute maximum is 4-15 points for a powerful single intervention (iodine supplementation during pregnancy in deficient populations)

  • See also the hydrocephaly claim under “Neuroscience”.

  • Good replication rate elsewhere.

Failed replications of automatic imitation claims.

Weak or no evidence for cross-domain congruency sequence effect.

Developmental Psychology

Some evidence for a tiny effect of growth mindset (thinking that skill is improvable) on attainment. Really we should distinguish the correlation of the mindset with attainment vs. the effect of a 1-hour class about the importance of growth-mindset on attainment. I cover the latter but check out Sisk for evidence against both.

Expertise attained after 10,000 hours practice” (Gladwell). Disowned by the supposed proponents.

  • No good evidence that tailoring **teaching **to students’ preferred learning styles has any effect on objective measures of attainment. There are dozens of these inventories, and really you’d have to look at each. (I won’t.)
  • Original paper: Multiple origins. e.g. the ‘Learning style inventory: technical manual’ (Kolb), ~4200 citations. The VARK questionnaire (Fleming). But it is ubiquitous in Western educational practice.
  • Critiques: Willingham 2015; Pashler 2009; Knoll 2017 (n=54); Husmann 2019
    (total citations: ~2400 )
  • Original effect size: ???
  • Replication effect size: [??], n=???

Personality psychology

Links Between Personality Traits and Consequential Life Outcomes. Pretty good? One lab’s systematic replications found that effect sizes shrank by 20% though (see comments below by Oliver C. Schultheiss).

Anything by Hans Eysenck should be considered suspect, but in particular these 26 ‘unsafe’ papers (including the one which says that reading prevents cancer).

Behavioural science

The effect of “nudges” (clever design of defaults) may be exaggerated in general. One big review found average effects were six times smaller than billed. (Not saying there are no big effects.). Here are a few cautionary pieces on whether, aside from the pure question of reproducibility, behavioural science is ready to steer policy.

Moving the signature box to the top of forms does not decrease dishonest reporting in the rest of the form.

One comment mentioned we need to consider frequently studied phenomena such as differential reinforcement, extinction bursts, functional communication training, derived relational responding, schedules of R+.


Brian Wansink accidentally admitted gross malpractice; fatal errors were found in 50 of his lab’s papers. These include flashy results about increased portion size massively reducing satiety.


No good evidence that brains contain one mind per hemisphere. The corpus callosotomy studies which purported to show “two consciousnesses” inhabiting the same brain were badly overinterpreted.

Very weak evidence for the existence of high-functioning (IQ ~ 100) hydrocephalic people. The hypothesis begins from extreme prior improbability; the effect of massive volume loss is claimed to be on average positive for cognition; the case studies are often questionable and involve little detailed study of the brains (e.g. 1970 scanners were not capable of the precision claimed).

  • Original paper: No paper; instead a documentary and a profile of the claimant, John Lorber. Also Forsdyke 2015 and the fraudulent de Oliveira 2012 ( citations).
  • Critiques: Hawks 2007; Neuroskeptic 2015; Gwern 2019
    (total citations: )
  • Alex Maier writes in with a cool 2007 case study of a man who got to 44 years old before anyone realised his severe hydrocephaly, through marriage and employment. IQ 75 (i.e. d=-1.7), which is higher than I expected, but still far short of the original claim, d=0.

Readiness potentials seem to be actually causal, not diagnostic. So Libet’s studies also do not show what they purport to. We still don’t have free will (since random circuit noise can tip us when the evidence is weak), but in a different way.

No good evidence for left/right hemisphere dominance correlating with personality differences. No clear hemisphere dominance at all in this study.

  • Original paper: Media speculation?
  • Critiques: \ (total citations: )
  • Original effect size: N/A?
  • Replication effect size: , n=


At most extremely weak evidence that psychiatric hospitals (of the 1970s) could not detect sane patients in the absence of deception.


No good evidence for precognition, undergraduates improving memory test performance by studying after the test. This one is fun because Bem’s statistical methods were “impeccable” in the sense that they were what everyone else was using. He is Patient Zero in the replication crisis, and has done us all a great service. (Heavily reliant on a flat / frequentist prior; evidence of optional stopping; forking paths analysis.)

Evolutionary psychology

  • Weak evidence for romantic priming, that looking at attractive women increases men’s conspicuous consumption, time discount, risk-taking. Weak, despite there being 43 independent confirmatory studies!: one of the strongest publication biases / p-hacking ever found.

    • Original paper: ‘Do pretty women inspire men to discount the future?’, Wilson and Daly 2003. n=209 (but only n=52 for each cell in the 2x2) (~560 citations).
    • Critiques: Shanks et al (2015): show that the 43 previous studies have an unbelievably bad funnel plot. They also run 8 failed replications. (total citations: ~80)

    • Original effect size: d=0.55 [-0.04, 1.13] for the difference between men and women. Meta-analytic d= 0.57 [0.49, 0.65]
    • Replication effect size: 0.00 [-0.12, 0.11]

Questionable evidence for the menstrual cycle version of the dual-mating-strategy hypothesis (that “heterosexual women show stronger preferences for uncommitted sexual relationships [with more masculine men] during the high-fertility ovulatory phase of the menstrual cycle, while preferring long-term relationships at other points”). Studies are usually tiny (median n=34, mostly over one cycle). Funnel plot looks ok though.

  • Original paper: ‘Menstrual cycle variation in women’s preferences for the scent of symmetrical men’, Gangestad and Thornhill (1998). (602 citations).
  • Critiques: Jones et al (2018) (total citations: 32)
  • Original effect size: g = 0.15, SE = 0.04, n=5471 in the meta-analysis. Massive battery of preferences included (…)
  • Replication effect size: Not a meta-analysis, just a list of recent well-conducted “null” studies and a plausible alternative explanation.
  • Note from a professor friend: the idea of a dual-mating hypothesis itself is not in trouble: the specific menstrual cycle research doesn’t seem to replicate well. However, to my knowledge the basic pattern of short vs long term relationship goals predicting [women’s] masculinity preferences is still robust.

No good evidence that large parents have more sons (Kanazawa); original analysis makes several errors and reanalysis shows near-zero effect. (Original effect size: 8% more likely.)

  • Original paper: ( citations).
  • Critiques: (total citations: )
  • Original effect size: , n=
  • Replication effect size: , n=

At most weak evidence that men’s strength in particular predicts opposition to egalitarianism.

  • Original paper: Petersen et al (194 citations).
  • Critiques: Measurement was of arm circumference in students, and effect disappeared when participant age is included. (total citations: 605)
  • Original effect size: N/A, battery of F-tests.
  • Replication effect size: Gelman: none as in zero. The same lab later returned with 12 conceptual replications on a couple of measures of (anti-)egalitarianism. They are very focussed on statistical significance instead of effect size. Overall male effect was b = 0.17 and female effect was b = 0.11, with a nonsignificant difference between the two (p = 0.09). (They prefer to emphasise the lab studies over the online studies, which showed a stronger difference.) Interesting that strength or “formidability” has an effect in both genders, whether or not their main claim about gender difference holds up.


At most very weak evidence that sympathetic nervous system activity predicts political ideology in a simple fashion. In particular, subjects’ skin conductance reaction to threatening or disgusting visual prompts - a noisy and questionable measure.

  • Original paper: Oxley et al, n=46 ( citations). p=0.05 on a falsely binarised measure of ideology.
  • Critiques: Six replications so far (Knoll et al; 3 from Bakker et al), five negative as in nonsignificant, one forking (“holds in US but not Denmark”) (total citations: )
  • Original effect size: , n=
  • Replication effect size: , n=

Behavioural genetics

No good evidence that 5-HTTLPR is strongly linked to depression, insomnia, PTSD, anxiety, and more. See also COMT and APOE for intelligence, BDNF for schizophrenia, 5-HT2a for everything.

Be very suspicious of any such “candidate gene” finding (post-hoc data mining showing large >1% contributions from a single allele). 0/18 replications in candidate genes for depression. 73% of candidates failed to replicate in psychiatry in general. One big journal won’t publish them anymore without several accompanying replications. A huge GWAS, n=1 million: “We find no evidence of enrichment for genes previously hypothesized to relate to risk tolerance.”

Applied Linguistics

Critical period hypothesis: Hartshorne, Tenenbaum and Pinker’s 2018 study on two-thirds of a million English speakers concluded one sharply defined critical age at 17.4 for all language learners. A reanalysis of the data showed that such a conclusion is based on artificial results ( van der Silk et al. 2021). There was no evidence for any critical age for language learning.

Educational Psychology

Findings regarding mindsets (aka implicit theories) have been mixed, with increasing failure of replication that puts the value of the theory and the derived interventions in question ( Brez et al, 2020; ). According to the meta-analysis by Sisk and colleagues ( 2018), the relationship between mindsets and academic achievement is weak: Of the 129 studies that they analyzed, only 37% found a positive relationship between mindset and academic outcomes. Furthermore, 58% of the studies found no relationship and 6% found a negative relationship between mindset and academic outcomes. Evidence on the efficacy of mindset interventions is not promising: of the 29 studies reviewed, only 12% had a positive effect, 86% of the studies found no effect of the intervention and 2% found a negative effect of the intervention. It should be noted that interventions seemed to work for low SES populations.

Further literature

A review of 2500 social science papers, showing the lack of correlation between citations and replicability, between journal status and replicability, and the apparent lack of improvement since 2009.

Discussion on Everything Hertz, Hacker News, Andrew Gelman, some star data thugs comment.

See also the popular literature with uncritical treatments of the original studies: