When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers

A friend posted a question to a group of research colleagues recently:

“Three weeks ago, I ran a 100 person two-condition study on Mturk. Result: t = 2.95, p = .004. Today I ran another 100 person two condition study on Mturk, using the identical measure. No differences in what came before that measure. Result? t = 0.13, p = .89.”

The friend was exasperated and didn’t know what to do – What are the best practices for how researchers should adjudicate conflicting study results like these? I wrote the friend a long response, but I realized that my advice might be of use to others too.

The group had several suggestions for courses of action. I list the options below and explain my preferred option.

  1. Drop the project. This is an unsatisfactory choice, because as we will see below, the first two studies were likely underpowered, so we’re risking missing out on a true effect by abandoning the research question too soon (i.e., we risk a Type II error).
  2. Report the significant study and ignore the non-significant one. Ok, no one actually recommended this choice. But I think this is what a mentor might have recommended back in the old days. We know now that file drawering the non-significant study substantially inflates the Type I error rate of the published literature, which would be dishonest and not cool.
  3. Look for a moderator. Perhaps the first study was run on a Tuesday, and the effect only shows up on Tuesday. Or perhaps, more interestingly, the first study had more women participants, and the effect is stronger for women participants. These post-hoc moderators could explain why the effect shows up in one study but not the other. However, there are an infinite number of these potential moderators, and we have no way of knowing for sure which one is actually responsible. The most likely explanation is simple sampling error.
  4. Meta-analyze and use the meta-analytic confidence interval to test significance of the effect. This is not a terrible choice, and in the absence of more resources to conduct further research, this is probably a researcher’s best bet. But ultimately, without additional data, we can’t be very confident whether Study 1 was a false positive or Study 2 was a false negative.
  5. Use the meta-analytic effect size estimate to determine the needed sample size for a third study with 80% power. This is my recommended, best practices, option for the reasons outlined in point 4. Note that this third study should not be viewed as a tiebreaker, but rather as a way to get a more precise estimate of the actual effect size in question.

What follows is a step-by-step guide using the R statistics software package to conduct the meta-analysis and estimate the number of participants needed for Study 3.

Step 0 – Download the compute.es, metafor, and pwr libraries if you don’t have them already. This step only needs to be completed once per computer. You’ll need to remove the # first.

#install.packages("compute.es", repos='http://cran.us.r-project.org')
#install.packages("metafor", repos='http://cran.us.r-project.org')
#install.packages("pwr", repos='http://cran.us.r-project.org')

Step 1 – Then load the packages:

library(compute.es)
library(metafor)
## Loading required package: Matrix
## Loading 'metafor' package (version 1.9-7). For an overview 
## and introduction to the package please type: help(metafor).
library(pwr)

Step 2 – Compute the effect sizes for your studies.

study1<-tes(t=2.95,n.1=50,n.2=50)
## Mean Differences ES: 
##  
##  d [ 95 %CI] = 0.59 [ 0.18 , 1 ] 
##   var(d) = 0.04 
##   p-value(d) = 0 
##   U3(d) = 72.24 % 
##   CLES(d) = 66.17 % 
##   Cliff's Delta = 0.32 
##  
##  g [ 95 %CI] = 0.59 [ 0.18 , 0.99 ] 
##   var(g) = 0.04 
##   p-value(g) = 0 
##   U3(g) = 72.09 % 
##   CLES(g) = 66.06 % 
##  
##  Correlation ES: 
##  
##  r [ 95 %CI] = 0.29 [ 0.09 , 0.46 ] 
##   var(r) = 0.01 
##   p-value(r) = 0 
##  
##  z [ 95 %CI] = 0.29 [ 0.09 , 0.5 ] 
##   var(z) = 0.01 
##   p-value(z) = 0 
##  
##  Odds Ratio ES: 
##  
##  OR [ 95 %CI] = 2.92 [ 1.4 , 6.08 ] 
##   p-value(OR) = 0 
##  
##  Log OR [ 95 %CI] = 1.07 [ 0.33 , 1.81 ] 
##   var(lOR) = 0.14 
##   p-value(Log OR) = 0 
##  
##  Other: 
##  
##  NNT = 4.98 
##  Total N = 100
study2<-tes(t=0.13,n.1=50,n.2=50)
## Mean Differences ES: 
##  
##  d [ 95 %CI] = 0.03 [ -0.37 , 0.42 ] 
##   var(d) = 0.04 
##   p-value(d) = 0.9 
##   U3(d) = 51.04 % 
##   CLES(d) = 50.73 % 
##   Cliff's Delta = 0.01 
##  
##  g [ 95 %CI] = 0.03 [ -0.37 , 0.42 ] 
##   var(g) = 0.04 
##   p-value(g) = 0.9 
##   U3(g) = 51.03 % 
##   CLES(g) = 50.73 % 
##  
##  Correlation ES: 
##  
##  r [ 95 %CI] = 0.01 [ -0.19 , 0.21 ] 
##   var(r) = 0.01 
##   p-value(r) = 0.9 
##  
##  z [ 95 %CI] = 0.01 [ -0.19 , 0.21 ] 
##   var(z) = 0.01 
##   p-value(z) = 0.9 
##  
##  Odds Ratio ES: 
##  
##  OR [ 95 %CI] = 1.05 [ 0.51 , 2.15 ] 
##   p-value(OR) = 0.9 
##  
##  Log OR [ 95 %CI] = 0.05 [ -0.67 , 0.77 ] 
##   var(lOR) = 0.13 
##   p-value(Log OR) = 0.9 
##  
##  Other: 
##  
##  NNT = 135.9 
##  Total N = 100

Step 3 – Meta-analyze the studies (random effects meta-analysis), with effect sizes extracted from Step 2.

rma(yi=c(study1$g,study2$g),vi=c(study1$var.g,study2$var.g))
## 
## Random-Effects Model (k = 2; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 0.1168 (SE = 0.2217)
## tau (square root of estimated tau^2 value):      0.3418
## I^2 (total heterogeneity / total variability):   74.49%
## H^2 (total variability / sampling variability):  3.92
## 
## Test for Heterogeneity: 
## Q(df = 1) = 3.9200, p-val = 0.0477
## 
## Model Results:
## 
## estimate       se     zval     pval    ci.lb    ci.ub          
##   0.3100   0.2800   1.1071   0.2682  -0.2388   0.8588          
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4 – Look at the estimate from the random effects meta-analysis. In this case it is 0.31 (this is in standardized units). Its 95% CI is [-0.23, 0.86]. There is significant heterogeneity (Q = 3.92, p = .048), but who cares? In this case, it just means that the two estimates are pretty far apart.

Step 5 – Run a post-hoc power analysis to see what the combined power of the two first studies was. The n is per cell, so we have n=100 over the two studies. The d is the estimate from the meta-analysis.

pwr.t.test(n=100,d=.31,sig.level=.05,power=NULL,type="two.sample",alternative="two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.31
##       sig.level = 0.05
##           power = 0.587637
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Step 6 – The post-hoc power is .59 based on a true ES of 0.31. This means that given a true ES of 0.31, 59% of the time, we’d expect the combined estimate from the two studies to be statistically significant. Now we’ll run an a priori power analysis to see how many participants a researcher needs to get 80% power based on d = 0.31.

pwr.t.test(n=NULL,d=.31,sig.level=.05,power=.80,type="two.sample",alternative="two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 164.3137
##               d = 0.31
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Conclusion: The test says my friend needs 165 participants per group to get 80% power for d = 0.31. Of course, if researchers want to be more efficient, they could also try out sequential analysis.

I hope this guide is useful for researchers looking for a practical “what to do” guide in situations involving conflicting study results. I’m also interested in feedback – what would you do in a similar situation? Drop me a line on Twitter (@katiecorker), or leave a comment here.