3 Integrity: Challenging questionable research practices

3.1 Replicability

There may be cases where you don’t expect to get the same results if you conduct the same study again. For example, if a study is based around a specific political event, it may be difficult or even impossible to replicate. But in many other types of investigation, we would expect to be able to get the same results when we run the same study again.

In Week 2, we talked about reproducibility – being able to get the same results when conducting the same analyses on the same data as the original study. Replicability is similar, in that it’s about getting the same results as the original study when running the same analyses, however, the difference is that now these analyses are run on new data. So, replication means conducting the same study again, and seeing if you get the same results.

Replication studies are deliberate attempts to do this. But what does the ‘same study’ mean? There are always going to be differences between the original study and the replication study. Replication studies vary on a spectrum from ‘direct’ to ‘conceptual’. Direct replications try to stay as close to the original study as possible, whereas conceptual replications purposefully vary some aspects to better understand the underlying phenomenon. Here are some examples, from most direct (the first) to most conceptual (the last):

A researcher makes a surprising finding in their research. To test whether they should rely on this result, they conduct a replication immediately after, using all the same materials and the same participant pool.
A researcher wants to replicate a study they’ve read about. The study is much older (from the 1990s), when open materials were not common. They only have the methods described in the original short paper to refer to, so they interpret these as best they can.
A researcher wants to replicate a study they’ve read about. They don’t think the original study was well-designed, but they think the hypothesis is interesting so they design a new study testing the same hypothesis but in a different way.

Now let’s dig deeper into the process of designing a replication study.

3.2 Replication studies

In the next video, psychologist Priya Silverstein talks about their first forays into conducting a replication study, and lessons learned. As you watch the video, think about what Priya’s results tell us about the process of running a good replication study.

Download the video

Video transcript

Hi everyone, my name’s Priya Silverstein and I’m a post-doctoral researcher for the Psychological Science Accelerator, and I’m also the author for this course. My pronouns are ‘they, them’.

As part of my PhD, I ran my first replication study. It wasn’t meant to be a big part of my PhD, but it ended up being one of the biggest parts!

I thought that before starting any of my own original research, it would make more sense to start with a replication study. However it wasn’t that simple, so when I ran the replication study, surprisingly, we got a null result, and I was a bit confused about why this might be. So the first thing that I did was I contacted the original researchers to ask them what they thought might be the problem.

They got back to me and they said that they thought it was because of some differences between the stimuli, so the things that I’d shown in my study versus the things that they showed in the original. And some of these differences were things I couldn’t have known, because they didn’t outline the specifics of that in their original paper.

I made some edits to the protocol to the way that I was going to run the study.

And then I thought, okay, now that I’ve had approval from the original authors this new version should be able to replicate the original study. So I ran it again and surprisingly I still wasn’t able to replicate the result.

Erm and so this was quite disappointing, both for me and the original authors, because it meant that I wasn’t able to find the same thing that they did.

So… This was my first experience of replications. And you might think that that was enough to put me off doing any more, but instead, quite the opposite. I ended up realising how important replications are.

So yeah, ever since starting with that first replication study as part of my PhD, I’ve now kind of made that my specialty.

My advice for anyone who would be conducting their own replication study comes from some of the mistakes that I made as part of that first replication study that I did.

So my first piece of advice would be to always contact the authors before you begin your replication study.

I think I was a bit naïve, and thought if I just follow what’s written in the paper then how can I go wrong? But papers don’t have enough space to include everything about a study that you would need to know in order to conduct a good replication.

So I’d recommend talking to the original authors, coming to an agreement with them, making sure that they agree that the protocol that you’ve proposed, they would agree that’s a good faith replication attempt of their study.

Another thing that I did wrong is that I only collected the same number of participants as in the original study, for my replication, because I thought that was more ‘replication-y’, because it was the same amount of participants as the original. But now, after learning more about both replication studies, but also sample size more generally, I would really recommend to go with a much larger sample than the original study that you’re replicating.

And this is just so that you can be a little bit more sure about what your findings mean. So, in my study, I wasn’t able to replicate the same result as the original authors, but this could just be because the true effect size that’s in the world for that effect that I was looking at might be smaller than what they kind of measured in the original study.

If I had used a much larger number of participants, if I still wasn’t able to replicate the study, we could be a bit more sure that it wasn’t just because of low sample size.

I ran my study just as a normal study where we finished the entire study and then submitted it to a journal for publication. And I was lucky that it was successful in getting published.

But it could have been a lot harder for me to publish, which would have been a bit disappointing and taken a lot of time. So what I would recommend instead is submitting any replication study as a registered report.

A registered report is essentially where your paper gets peer-reviewed before you have collected data. So the peer reviewers say whether your protocol makes sense, recommend any suggested changes, and then once they’ve accepted it, the journal agrees to accept your study, regardless of what the outcome is. So that would be my third piece of advice.

Video 2: Conducting a replication study

Write your comments on what Priya advises. Allow about 10 minutes. When you are ready, see our comments.

Show / Hide Discussion

Advice from Priya
Priya suggests getting in touch with the authors of the original study and asking for more detail than a published paper provides. Using a larger sample size increases confidence that your findings do (or don’t) support those of the previous study. Priya also recommends submitting a registered report, to increase your chances of getting published.

3.2.1 Limits to replication

There are fields and methodologies where the value of replication is hotly debated. For instance:

Some argue that replication should be encouraged in qualitative research, whereas others argue that there are still open questions about whether replication is possible, desirable, or even aligned with the fundamental principles of qualitative research.
Economics has had a long history with replication studies, but not under this name. In economics, replication often takes place as ‘robustness checks’, where researchers test if their results hold when they use different datasets.
Research in the humanities is primarily interpretive and context-specific, focusing on understanding human experiences, cultures, texts, and historical events. This interpretive nature makes exact replication more challenging.

It is important to think carefully about whether replication makes sense for your field and methodology.

If you are working in a field where replication is important, and if your study replicates the one you are trying to replicate, you can be pretty confident about the result.

But what does it mean if, like Priya’s first attempts, your study does not replicate? One explanation could be that the original result was a false positive , and so the failed replication is a ‘true negative’. Another explanation is that the replication result was a false negative, and that the original study was a ‘true positive’. It’s also possible that differences between the two studies are responsible for the different results.

Activity 1

Allow about 10 minutes.

This activity relates to our examples of typical direct and conceptual replication studies. By way of reminder:

Researcher A finds a surprising finding in their research. To test whether they should rely on this result, they conduct a replication immediately after, using all the same materials and the same participant pool. This is a direct replication.

Researcher B wants to replicate a study they’ve read about. They don’t think the original study was well-designed, but they think the hypothesis is interesting, so they design a new study testing the same hypothesis but in a different way. This is a conceptual replication.

Now imagine these two researchers both carry out their studies. List the reasons why each of these two researchers may not replicate the original result.

Show / Hide Discussion

Why were the two researchers unable to replicate the original results?
You might have listed:

The original result was a false positive
The replication result is a false negative
There are important differences between the original study and the replication study:

a. These could be small changes that researchers didn’t think should be important but that turned out to be (e.g.: which brand of a specific chemical was used).
b. It could be that the replication researcher didn’t realise these were differences because there wasn’t enough detail in the original paper to be able to work out how everything had been done.
c. The replication researcher might know they’re making a change from the original protocol, but approve this change because theoretically it shouldn’t make a difference to the result.

3.3 The replication crisis

This section will highlight some of the issues around replication in quantitative research. Replication is possible in qualitative research, and many qualitative researchers see the value of replication. So if you are a qualitative researcher, this section is still relevant to you. It will allow you to explore key issues faced by quantitative colleagues, learn how to read quantitative research papers more critically, and think about whether these issues could also apply to qualitative research, albeit manifested differently.

If we consider relatively direct replications, using the same materials as the original authors but conducted by different researchers, what percentage of published results do you imagine would replicate? It would be tempting to think that most published research findings are true, and therefore that a replication of a published research finding would be pretty likely to find the same result. However, researchers have found these percentages of findings could not be replicated:

Psychology: up to 60%
Cancer biology: up to 55%
Economics: up to 40%
Philosophy: up to 30%

The number of studies that could not be replicated was much higher than expected in certain fields, which has led some to refer to this as a ‘replication crisis’.

Why is it that so many quantitative studies cannot be replicated? It’s complicated!

Previously, you learned the three classifications of failed replications: the original result was a false positive, the replication result was a false negative, or differences between the two studies could have been responsible for the different results. However, these three interpretations are not all as likely as each other. There are ways to try and work out which of these are most likely.

The original result being a false positive is more likely than you would think. Researchers often do not publish all the research that they do. As a researcher, there is an incentive to publish papers in ‘high impact’ journals (journals that are regarded highly in the researcher’s discipline, and that publish papers that receive a high number of citations). Historically, it has been harder to publish negative (null) results than positive (statistically significant) results, as journals have prioritised headline-grabbing results that confirm popular or contemporary positions. This has been the case for all journals, but especially high-impact ones.

This means that researchers have an incentive to get positive results in their research and can feel disappointed, stressed, and even ashamed if they don’t get a significant result. This can entice them to turn to questionable research practices, to increase the likelihood of a false positive result.

3.4 Questionable research practices

Yes, you read the end of the previous section correctly. There are questionable research practices that researchers may feel pressurised to use. Here are some examples:

P-hacking: in quantitative research, p-hacking means exploiting techniques that increase the likelihood of obtaining a statistically significant result, for example by performing multiple analyses, or stopping data collection once a significant p-value is reached.
Selective reporting: when results from research are deliberately not fully or accurately reported, in order to suppress negative or undesirable findings. For example, researchers might run two analyses but only report the one with significant findings, or be selective in what results are included in a report aimed at particular audiences.
HARK-ing: is a shortening of ‘hypothesising after the results are known’. This is when researchers write their papers as if they had hypotheses that they then went on to test in their study, when really they made up the hypothesis after seeing their results, to pick one that best fit.
Post-hoc justifications: means stating, after the fact, justifications for decisions made during the research project. For example, if the researcher only managed to recruit women for a study after trying to recruit all genders, but claimed in the paper that this was intentional.

Although pressures to publish can sometimes be seen as barriers to transparency, the benefits of writing transparently can also be seen as a positive incentive, as the next section shows.

3.5 Writing transparently

When writing manuscripts, researchers should aim to be as transparent as possible, being honest about what happened in the study, how it was conducted, and when and why decisions were made. By using questionable research practices, researchers make it more likely that they get a false positive result, which can partially explain low replicability rates.

In the video, Priya introduced another important consideration for evaluating replication results: sample size (the number of samples in your study, e.g. participants). Smaller sample sizes make it more likely to get both a false positive and a false negative result. This is because smaller sample sizes provide less information about the population you are studying, which increases the variability and uncertainty in your results. With a small sample, the random variation (or ‘noise’) can more easily overshadow the true effect you are trying to measure. This means you might detect an effect that isn’t really there, a false positive, or miss an effect that actually exists, a false negative.

For instance, imagine trying to judge the average height of a population by looking at just a few individuals. Your estimate is more likely to be off compared to measuring a larger group, because you may happen to have either a very tall or very short person in your sample. So, if you have an original study with a small sample size and a (well-designed) replication with a large sample size, you could be more confident in the result of the replication than the result of the original study.

Activity 2: What not to do!

Allow about 10 minutes.

So far, you have considered good and bad writing practices. With these in mind, have a go at this ‘hack your way to scientific glory’ activity. First, choose a political party: Republican or Democrat (UK equivalents: Conservative or Labour). Then predict whether the party has a positive or negative impact on the economy. When you have done that, change aspects of the research (e.g. participant inclusion criteria and how you’re measuring your dependent variable) and see whether you can find a significant result (p < 0.05) in your predicted direction.

The reason this is an example of ‘what not to do’ is because when you first choose a political party and predict whether they will have a positive or negative impact on the economy, you are forming a hypothesis. But, if you then play around with the data until you get the result that you wanted, and only stop when you do, then you are fixing the result.

The activity involves various questionable research practices, such as P-hacking, HARK-ing, selective reporting. However, there is a way to do different analyses on the same data without any of these being a problem. If instead of deciding on a hypotheses first then confirming it, you were to conduct purely exploratory research (without a hypothesis) you could be transparent about all of the different ways you looked at the data and how the results differed when you tried different things. This could even lead to people conducting their own future studies to confirm your exploratory results!

When reading an academic paper, it’s important to read with a critical mindset and feel free to disagree with the methodological or analysis strategy, the interpretation of the results, or the conclusions drawn. Although we know that there are rare instances of outright fraud in science, we would expect that the researchers are truthfully describing what happened in the study, how it was conducted, and when and why decisions were made.

3.6 Generalisability

You have learned that replication studies vary on a spectrum from ‘direct’ to ‘conceptual’. However, most replication studies have some differences from the original study, even if these weren’t intentional. Consider one of the examples from before, where a researcher was replicating a paper from the 1990s. The materials they create will be different from the original materials, and if what they’re studying is context-dependent, a lot might have changed since then.

For example, a study on internet usage habits conducted in the 1990s would yield very different results if replicated today, due to the dramatic changes in technology and how people use the internet. Similarly, a study examining public attitudes toward mental health in the 1990s might produce different findings now because societal awareness and acceptance of mental health issues have evolved significantly over the past few decades.

For this reason, some consider that most replication studies are actually generalisability studies. Generalisability means whether a particular result generalises beyond the specific participants and conditions of the study to broader groups of samples, settings, methods, or measures. For example, if we’re interested in public attitudes to mental health, it wouldn’t make sense for us to only ask people aged 50-60, or only men, or only those living in cities. It’s possible that any of these characteristics could affect people’s opinions on mental health, meaning the results would be biased and not representative of the full population.

Without generalisability studies, it might be possible that the theoretical explanation for why the finding occurred might be incorrect. For example, there could even be a mistake in the design of the study that biased the results. For instance, imagine a biological study investigating the effects of a new drug using a specific strain of lab mice. If this particular strain has a unique genetic mutation that makes it respond differently to the drug compared to other strains, the study’s results might not generalize to other mice or to humans. This could lead to an incorrect conclusion about the drug’s overall effectiveness and safety.

Researchers wishing to be transparent when writing their papers should declare possible ‘Constraints on generality’ in the discussion section. This could take the form of a statement that identifies and justifies the target populations for the reported findings, and other considerations the authors think would be necessary for replicating their result. This could help other researchers to sample from the same populations when conducting a direct replication, or to test the boundaries of generalisability when conducting a conceptual replication.

3.7 Studying generalisability

So you think your research has potential do good in the world, but don’t know how widely it can be applied? There are lots of different ways to study generalisability:

Systematic reviews: these look at how an outcome varies in the published literature across samples, settings, measures and methods (meta-analyses do this statistically). This can be done without conducting any new studies. For example, UNICEF’s Evidence and Gap Map Research Briefs provide an overview of available evidence of the effectiveness of interventions to improve child well-being in low- and middle-income countries.
Comparative studies: comparing results from different populations using the same (adapted) materials can show where there may be similarities and differences. For example, Hofstede’s cultural dimensions theory, which identified and measured cultural differences across countries, particularly in the workplace context.
Big team science: when researchers from around the world conduct the same study and pool their results, they can look at various factors affecting the presence or size of the effect they’re interested in. For example, the first ManyGoats project is examining goat responses to different human attentional states, and will be testing a diverse range of goats in different living conditions.

Activity 3

Allow about 10 minutes.

Think about a study in your field that would or wouldn’t generalise. Consider why this might be the case.

Take some notes of your thoughts first, and see our discussion.

Show / Hide Discussion

Why a study may or may not generalise

There are lots of reasons why a study may or may not generalise. Imagine a study evaluating a new therapy for depression in a university clinic with primarily urban-based participants.

While the therapy showed significant improvement in depressive symptoms over ten weeks among a diverse sample, including college students and middle-aged adults of various ethnicities recruited through local health centres and university channels, its applicability to other populations and settings may be limited. Factors such as regional differences in mental health resources, demographic diversity beyond the studied age groups, and recruitment biases could affect the therapy's effectiveness in rural or suburban areas and among older adults or adolescents.

3.8 Quiz

This self-test quiz tackles key ideas in replication and the principle of generalisability.

What is the difference between replicability and reproducibility? (Select one)

There isn’t a difference.Replicability refers to getting the same results when conducting the same study again, while reproducibility refers to getting different results when conducting the same study again.Replicability refers to getting the same results when running the same analyses on new data, while reproducibility refers to getting the same results when conducting the same analyses on the same data.Replicability refers to getting different results when running the same analyses on new data, while reproducibility refers to getting the same results when conducting different analyses on the same data.

Answers:

There isn’t a difference. False
Replicability refers to getting the same results when conducting the same study again, while reproducibility refers to getting different results when conducting the same study again. False
Replicability refers to getting the same results when running the same analyses on new data, while reproducibility refers to getting the same results when conducting the same analyses on the same data. True
Replicability refers to getting different results when running the same analyses on new data, while reproducibility refers to getting the same results when conducting different analyses on the same data. False

Which of the following is an example of a conceptual replication? (Select one)

A researcher replicates a study from the 1990’s using only the methods described in the original short paper.A researcher designs a study to explore how mental concepts replicate themselves.A researcher designs a new study testing the same hypothesis but in a different way because they believe the original study was not well-designed.A researcher conducts a replication immediately after the original study, using all the same materials and the same participant pool.

Answers:

A researcher replicates a study from the 1990’s using only the methods described in the original short paper. False
A researcher designs a study to explore how mental concepts replicate themselves. False
A researcher designs a new study testing the same hypothesis but in a different way because they believe the original study was not well-designed. True
A researcher conducts a replication immediately after the original study, using all the same materials and the same participant pool. False

Why might a replication study fail to replicate the original result? (Select one or more)

The original result was a false positive.The replication study was a waste of time.The differences between the two studies are responsible for the different results.The replication result was a false negative.

Feedback: All of these are possible causes for failure to replicate. Far from being a waste of time, a failure to replicate can provide very useful information.

The original result was a false positive. True
The replication study was a waste of time. False
The differences between the two studies are responsible for the different results. True
The replication result was a false negative. True

Which of the following is the best way for researchers to study generalisability? (Select one)

By comparing results from different populations using the same materials.By re-running the original analyses from a piece of research.By using questionable research practices.By conducting direct replications with the same materials and participant pool.

Answers:

By comparing results from different populations using the same materials. True
By re-running the original analyses from a piece of research. False
By using questionable research practices. False
By conducting direct replications with the same materials and participant pool. False

Which of the following explanations are most likely if the result of an original study which suggested an effect to be ‘true’, is not replicated in a replication study? (Select one or more)

The original researcher was fraudulent.The original study cannot possibly have been a ‘true’ result, because it did not replicate.The original effect might be ‘true’, but only under very narrow conditions.The replication study might have had too small a sample size to find a result.

Feedback: The other two cannot be deduced from the information provided in the question.

The original researcher was fraudulent. True
The original study cannot possibly have been a ‘true’ result, because it did not replicate True
The original effect might be ‘true’, but only under very narrow conditions False
The replication study might have had too small a sample size to find a result False

3.9 Summary

In this week, you learned about an important aspect of integrity: replicability. Replicability relates to whether or not a study ‘replicates’, i.e. whether or not, when you repeat the study with new data, you get the same result. You learned some reasons why replicability may be low in many fields, and how differences between studies may sometimes contribute to this. You also learned about the importance of generalisability in research. Next week, you’ll learn techniques which can support both the integrity and the transparency of your research.