A friend posted a question to a group of research colleagues recently:

“Three weeks ago, I ran a 100 person two-condition study on Mturk. Result:* t* = 2.95, *p* = .004. Today I ran another 100 person two condition study on Mturk, using the identical measure. No differences in what came before that measure. Result? *t* = 0.13, *p* = .89.”

The friend was exasperated and didn’t know what to do – What are the best practices for how researchers should adjudicate conflicting study results like these? I wrote the friend a long response, but I realized that my advice might be of use to others too.

The group had several suggestions for courses of action. I list the options below and explain my preferred option.

**Drop the project.**This is an unsatisfactory choice, because as we will see below, the first two studies were likely underpowered, so we’re risking missing out on a true effect by abandoning the research question too soon (i.e., we risk a Type II error).**Report the significant study and ignore the non-significant one.**Ok, no one actually recommended this choice. But I think this is what a mentor might have recommended back in the old days. We know now that file drawering the non-significant study substantially inflates the Type I error rate of the published literature, which would be dishonest and not cool.**Look for a moderator.**Perhaps the first study was run on a Tuesday, and the effect only shows up on Tuesday. Or perhaps, more interestingly, the first study had more women participants, and the effect is stronger for women participants. These post-hoc moderators*could*explain why the effect shows up in one study but not the other. However, there are an infinite number of these potential moderators, and we have no way of knowing for sure which one is actually responsible. The most likely explanation is simple sampling error.**Meta-analyze and use the meta-analytic confidence interval to test significance of the effect.**This is not a terrible choice, and in the absence of more resources to conduct further research, this is probably a researcher’s best bet. But ultimately, without additional data, we can’t be very confident whether Study 1 was a false positive or Study 2 was a false negative.**Use the meta-analytic effect size estimate to determine the needed sample size for a third study with 80% power.**This is my recommended, best practices, option for the reasons outlined in point 4. Note that this third study should not be viewed as a tiebreaker, but rather as a way to get a more precise estimate of the actual effect size in question.

What follows is a step-by-step guide using the R statistics software package to conduct the meta-analysis and estimate the number of participants needed for Study 3.

**Step 0** – Download the *compute.es*, *metafor*, and *pwr* libraries if you don’t have them already. This step only needs to be completed once per computer. You’ll need to remove the # first.

```
#install.packages("compute.es", repos='http://cran.us.r-project.org')
#install.packages("metafor", repos='http://cran.us.r-project.org')
#install.packages("pwr", repos='http://cran.us.r-project.org')
```

**Step 1** – Then load the packages:

```
library(compute.es)
library(metafor)
```

```
## Loading required package: Matrix
## Loading 'metafor' package (version 1.9-7). For an overview
## and introduction to the package please type: help(metafor).
```

`library(pwr)`

**Step 2** – Compute the effect sizes for your studies.

`study1<-tes(t=2.95,n.1=50,n.2=50)`

```
## Mean Differences ES:
##
## d [ 95 %CI] = 0.59 [ 0.18 , 1 ]
## var(d) = 0.04
## p-value(d) = 0
## U3(d) = 72.24 %
## CLES(d) = 66.17 %
## Cliff's Delta = 0.32
##
## g [ 95 %CI] = 0.59 [ 0.18 , 0.99 ]
## var(g) = 0.04
## p-value(g) = 0
## U3(g) = 72.09 %
## CLES(g) = 66.06 %
##
## Correlation ES:
##
## r [ 95 %CI] = 0.29 [ 0.09 , 0.46 ]
## var(r) = 0.01
## p-value(r) = 0
##
## z [ 95 %CI] = 0.29 [ 0.09 , 0.5 ]
## var(z) = 0.01
## p-value(z) = 0
##
## Odds Ratio ES:
##
## OR [ 95 %CI] = 2.92 [ 1.4 , 6.08 ]
## p-value(OR) = 0
##
## Log OR [ 95 %CI] = 1.07 [ 0.33 , 1.81 ]
## var(lOR) = 0.14
## p-value(Log OR) = 0
##
## Other:
##
## NNT = 4.98
## Total N = 100
```

`study2<-tes(t=0.13,n.1=50,n.2=50)`

```
## Mean Differences ES:
##
## d [ 95 %CI] = 0.03 [ -0.37 , 0.42 ]
## var(d) = 0.04
## p-value(d) = 0.9
## U3(d) = 51.04 %
## CLES(d) = 50.73 %
## Cliff's Delta = 0.01
##
## g [ 95 %CI] = 0.03 [ -0.37 , 0.42 ]
## var(g) = 0.04
## p-value(g) = 0.9
## U3(g) = 51.03 %
## CLES(g) = 50.73 %
##
## Correlation ES:
##
## r [ 95 %CI] = 0.01 [ -0.19 , 0.21 ]
## var(r) = 0.01
## p-value(r) = 0.9
##
## z [ 95 %CI] = 0.01 [ -0.19 , 0.21 ]
## var(z) = 0.01
## p-value(z) = 0.9
##
## Odds Ratio ES:
##
## OR [ 95 %CI] = 1.05 [ 0.51 , 2.15 ]
## p-value(OR) = 0.9
##
## Log OR [ 95 %CI] = 0.05 [ -0.67 , 0.77 ]
## var(lOR) = 0.13
## p-value(Log OR) = 0.9
##
## Other:
##
## NNT = 135.9
## Total N = 100
```

**Step 3** – Meta-analyze the studies (random effects meta-analysis), with effect sizes extracted from Step 2.

`rma(yi=c(study1$g,study2$g),vi=c(study1$var.g,study2$var.g))`

```
##
## Random-Effects Model (k = 2; tau^2 estimator: REML)
##
## tau^2 (estimated amount of total heterogeneity): 0.1168 (SE = 0.2217)
## tau (square root of estimated tau^2 value): 0.3418
## I^2 (total heterogeneity / total variability): 74.49%
## H^2 (total variability / sampling variability): 3.92
##
## Test for Heterogeneity:
## Q(df = 1) = 3.9200, p-val = 0.0477
##
## Model Results:
##
## estimate se zval pval ci.lb ci.ub
## 0.3100 0.2800 1.1071 0.2682 -0.2388 0.8588
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Step 4** – Look at the estimate from the random effects meta-analysis. In this case it is 0.31 (this is in standardized units). Its 95% CI is [-0.23, 0.86]. There is significant heterogeneity (*Q* = 3.92, *p* = .048), but who cares? In this case, it just means that the two estimates are pretty far apart.

**Step 5** – Run a post-hoc power analysis to see what the combined power of the two first studies was. The *n* is per cell, so we have *n*=100 over the two studies. The *d* is the estimate from the meta-analysis.

`pwr.t.test(n=100,d=.31,sig.level=.05,power=NULL,type="two.sample",alternative="two.sided")`

```
##
## Two-sample t test power calculation
##
## n = 100
## d = 0.31
## sig.level = 0.05
## power = 0.587637
## alternative = two.sided
##
## NOTE: n is number in *each* group
```

**Step 6** – The post-hoc power is .59 based on a true ES of 0.31. This means that given a true ES of 0.31, 59% of the time, we’d expect the combined estimate from the two studies to be statistically significant. Now we’ll run an a priori power analysis to see how many participants a researcher needs to get 80% power based on d = 0.31.

`pwr.t.test(n=NULL,d=.31,sig.level=.05,power=.80,type="two.sample",alternative="two.sided")`

```
##
## Two-sample t test power calculation
##
## n = 164.3137
## d = 0.31
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
```

**Conclusion:** The test says my friend needs 165 participants per group to get 80% power for *d* = 0.31. Of course, if researchers want to be more efficient, they could also try out sequential analysis.

I hope this guide is useful for researchers looking for a practical “what to do” guide in situations involving conflicting study results. I’m also interested in feedback – what would you do in a similar situation? Drop me a line on Twitter (@katiecorker), or leave a comment here.

Great post! I really like this approach when the two studies use the same manipulations and measures. However, when they are different, the two effect sizes may be (probably are) heterogenous and the meta-analytic estimate may not be useful in planning a replication. In that case, I would choose to do a large replication of the *significant* study, planning for a small effect size — say, d = .2, which requires about 400 participants per group — if the resources were available.

Great point Erika – I agree!

Examine the many ways in which your p-value may be spurious.

That’s a very interesting post. As I said on Twitter I think the problem of (nigh) infinite moderators applies most to social and personality psychology experiments. Not that we don’t have the problem in other fields but there we usually have more controlled situations, even other subfields of psychology. This makes it much harder to argue that a failed replication is due to some unknown factor. Sure, there are a lot of reasons even the most basic psychophysics experiment can produce bad quality data but that’s not the same as a moderator.

I had a thought last night that summarises my whole criticisms of the social priming literature I discussed on my blog: The more unknown moderators you think there are, the smaller the effect size you can expect should be. For example elderly priming, where the stimulus material is claimed to make a big difference, an effect that supposedly works differently in different countries/languages or even in different states of the US, and that should also strongly depend on the variability between the participants, it just doesn’t seem credible to me that the effect size could be d=0.8-1ish.

The advice to collect a third additional study is also something that isn’t applicable to many fields. In neuroimaging we can often be glad if there was the time, funding, and the resources to even replicate it once. You will also inevitably run into contaminated data because the subject pool is exhausted and you start scanning subjects that produce less-than-optimal data. That leads to a lower effect size estimate than the true population effect. I often wonder in how far this problem also pertains to social psychology research. The problem is perhaps smaller there, especially on Mturk (although that one probably introduces other sources of noise) but it doesn’t seem to be completely controlled either.

Hi, this is a very helpful post.

I have a technique question: what function shall I chose to compute the ES if I used the within subject design? For instance, I used 3(A1, A2, A3) * 2(B1, B2) within subject design and wish meta-analyzed two conditions, say, B1A1 vs. B1A2. Could I still use “tes” fucntion?

Hi there! The

tes()function is for converting independentt‘s to effect sizes. My guess is that you’d want to check out theMBESSpackage for help computing an effect size for a more complex design. In general, when you want to analyze a pairedtit helps to have the correlation between the variables.Thanks a lot, Katie