When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers

A friend posted a question to a group of research colleagues recently:

“Three weeks ago, I ran a 100 person two-condition study on Mturk. Result: t = 2.95, p = .004. Today I ran another 100 person two condition study on Mturk, using the identical measure. No differences in what came before that measure. Result? t = 0.13, p = .89.”

The friend was exasperated and didn’t know what to do – What are the best practices for how researchers should adjudicate conflicting study results like these? I wrote the friend a long response, but I realized that my advice might be of use to others too.

The group had several suggestions for courses of action. I list the options below and explain my preferred option.

Drop the project. This is an unsatisfactory choice, because as we will see below, the first two studies were likely underpowered, so we’re risking missing out on a true effect by abandoning the research question too soon (i.e., we risk a Type II error).
Report the significant study and ignore the non-significant one. Ok, no one actually recommended this choice. But I think this is what a mentor might have recommended back in the old days. We know now that file drawering the non-significant study substantially inflates the Type I error rate of the published literature, which would be dishonest and not cool.
Look for a moderator. Perhaps the first study was run on a Tuesday, and the effect only shows up on Tuesday. Or perhaps, more interestingly, the first study had more women participants, and the effect is stronger for women participants. These post-hoc moderators could explain why the effect shows up in one study but not the other. However, there are an infinite number of these potential moderators, and we have no way of knowing for sure which one is actually responsible. The most likely explanation is simple sampling error.
Meta-analyze and use the meta-analytic confidence interval to test significance of the effect. This is not a terrible choice, and in the absence of more resources to conduct further research, this is probably a researcher’s best bet. But ultimately, without additional data, we can’t be very confident whether Study 1 was a false positive or Study 2 was a false negative.
Use the meta-analytic effect size estimate to determine the needed sample size for a third study with 80% power. This is my recommended, best practices, option for the reasons outlined in point 4. Note that this third study should not be viewed as a tiebreaker, but rather as a way to get a more precise estimate of the actual effect size in question.

What follows is a step-by-step guide using the R statistics software package to conduct the meta-analysis and estimate the number of participants needed for Study 3.

Step 0 – Download the compute.es, metafor, and pwr libraries if you don’t have them already. This step only needs to be completed once per computer. You’ll need to remove the # first.

#install.packages("compute.es", repos='http://cran.us.r-project.org')
#install.packages("metafor", repos='http://cran.us.r-project.org')
#install.packages("pwr", repos='http://cran.us.r-project.org')

Step 1 – Then load the packages:

library(compute.es)
library(metafor)

## Loading required package: Matrix
## Loading 'metafor' package (version 1.9-7). For an overview 
## and introduction to the package please type: help(metafor).

library(pwr)

Step 2 – Compute the effect sizes for your studies.

study1<-tes(t=2.95,n.1=50,n.2=50)

## Mean Differences ES: 
##  
##  d [ 95 %CI] = 0.59 [ 0.18 , 1 ] 
##   var(d) = 0.04 
##   p-value(d) = 0 
##   U3(d) = 72.24 % 
##   CLES(d) = 66.17 % 
##   Cliff's Delta = 0.32 
##  
##  g [ 95 %CI] = 0.59 [ 0.18 , 0.99 ] 
##   var(g) = 0.04 
##   p-value(g) = 0 
##   U3(g) = 72.09 % 
##   CLES(g) = 66.06 % 
##  
##  Correlation ES: 
##  
##  r [ 95 %CI] = 0.29 [ 0.09 , 0.46 ] 
##   var(r) = 0.01 
##   p-value(r) = 0 
##  
##  z [ 95 %CI] = 0.29 [ 0.09 , 0.5 ] 
##   var(z) = 0.01 
##   p-value(z) = 0 
##  
##  Odds Ratio ES: 
##  
##  OR [ 95 %CI] = 2.92 [ 1.4 , 6.08 ] 
##   p-value(OR) = 0 
##  
##  Log OR [ 95 %CI] = 1.07 [ 0.33 , 1.81 ] 
##   var(lOR) = 0.14 
##   p-value(Log OR) = 0 
##  
##  Other: 
##  
##  NNT = 4.98 
##  Total N = 100

study2<-tes(t=0.13,n.1=50,n.2=50)

## Mean Differences ES: 
##  
##  d [ 95 %CI] = 0.03 [ -0.37 , 0.42 ] 
##   var(d) = 0.04 
##   p-value(d) = 0.9 
##   U3(d) = 51.04 % 
##   CLES(d) = 50.73 % 
##   Cliff's Delta = 0.01 
##  
##  g [ 95 %CI] = 0.03 [ -0.37 , 0.42 ] 
##   var(g) = 0.04 
##   p-value(g) = 0.9 
##   U3(g) = 51.03 % 
##   CLES(g) = 50.73 % 
##  
##  Correlation ES: 
##  
##  r [ 95 %CI] = 0.01 [ -0.19 , 0.21 ] 
##   var(r) = 0.01 
##   p-value(r) = 0.9 
##  
##  z [ 95 %CI] = 0.01 [ -0.19 , 0.21 ] 
##   var(z) = 0.01 
##   p-value(z) = 0.9 
##  
##  Odds Ratio ES: 
##  
##  OR [ 95 %CI] = 1.05 [ 0.51 , 2.15 ] 
##   p-value(OR) = 0.9 
##  
##  Log OR [ 95 %CI] = 0.05 [ -0.67 , 0.77 ] 
##   var(lOR) = 0.13 
##   p-value(Log OR) = 0.9 
##  
##  Other: 
##  
##  NNT = 135.9 
##  Total N = 100

Step 3 – Meta-analyze the studies (random effects meta-analysis), with effect sizes extracted from Step 2.

rma(yi=c(study1$g,study2$g),vi=c(study1$var.g,study2$var.g))

## 
## Random-Effects Model (k = 2; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 0.1168 (SE = 0.2217)
## tau (square root of estimated tau^2 value):      0.3418
## I^2 (total heterogeneity / total variability):   74.49%
## H^2 (total variability / sampling variability):  3.92
## 
## Test for Heterogeneity: 
## Q(df = 1) = 3.9200, p-val = 0.0477
## 
## Model Results:
## 
## estimate       se     zval     pval    ci.lb    ci.ub          
##   0.3100   0.2800   1.1071   0.2682  -0.2388   0.8588          
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4 – Look at the estimate from the random effects meta-analysis. In this case it is 0.31 (this is in standardized units). Its 95% CI is [-0.23, 0.86]. There is significant heterogeneity (Q = 3.92, p = .048), but who cares? In this case, it just means that the two estimates are pretty far apart.

Step 5 – Run a post-hoc power analysis to see what the combined power of the two first studies was. The n is per cell, so we have n=100 over the two studies. The d is the estimate from the meta-analysis.

pwr.t.test(n=100,d=.31,sig.level=.05,power=NULL,type="two.sample",alternative="two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.31
##       sig.level = 0.05
##           power = 0.587637
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Step 6 – The post-hoc power is .59 based on a true ES of 0.31. This means that given a true ES of 0.31, 59% of the time, we’d expect the combined estimate from the two studies to be statistically significant. Now we’ll run an a priori power analysis to see how many participants a researcher needs to get 80% power based on d = 0.31.

pwr.t.test(n=NULL,d=.31,sig.level=.05,power=.80,type="two.sample",alternative="two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 164.3137
##               d = 0.31
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Conclusion: The test says my friend needs 165 participants per group to get 80% power for d = 0.31. Of course, if researchers want to be more efficient, they could also try out sequential analysis.

I hope this guide is useful for researchers looking for a practical “what to do” guide in situations involving conflicting study results. I’m also interested in feedback – what would you do in a similar situation? Drop me a line on Twitter (@katiecorker), or leave a comment here.

That’s a very interesting post. As I said on Twitter I think the problem of (nigh) infinite moderators applies most to social and personality psychology experiments. Not that we don’t have the problem in other fields but there we usually have more controlled situations, even other subfields of psychology. This makes it much harder to argue that a failed replication is due to some unknown factor. Sure, there are a lot of reasons even the most basic psychophysics experiment can produce bad quality data but that’s not the same as a moderator.

I had a thought last night that summarises my whole criticisms of the social priming literature I discussed on my blog: The more unknown moderators you think there are, the smaller the effect size you can expect should be. For example elderly priming, where the stimulus material is claimed to make a big difference, an effect that supposedly works differently in different countries/languages or even in different states of the US, and that should also strongly depend on the variability between the participants, it just doesn’t seem credible to me that the effect size could be d=0.8-1ish.

The advice to collect a third additional study is also something that isn’t applicable to many fields. In neuroimaging we can often be glad if there was the time, funding, and the resources to even replicate it once. You will also inevitably run into contaminated data because the subject pool is exhausted and you start scanning subjects that produce less-than-optimal data. That leads to a lower effect size estimate than the true population effect. I often wonder in how far this problem also pertains to social psychology research. The problem is perhaps smaller there, especially on Mturk (although that one probably introduces other sources of noise) but it doesn’t seem to be completely controlled either.

7 thoughts on “When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers”

Erika Salomon

August 7, 2015 at 3:51 pm

Great post! I really like this approach when the two studies use the same manipulations and measures. However, when they are different, the two effect sizes may be (probably are) heterogenous and the meta-analytic estimate may not be useful in planning a replication. In that case, I would choose to do a large replication of the *significant* study, planning for a small effect size — say, d = .2, which requires about 400 participants per group — if the resources were available.

- Katie Corker
  
  August 7, 2015 at 3:56 pm
  
  Great point Erika – I agree!
  
Mayo

August 7, 2015 at 8:29 pm

Examine the many ways in which your p-value may be spurious.

Sam Schwarzkopf

August 8, 2015 at 6:35 am

That’s a very interesting post. As I said on Twitter I think the problem of (nigh) infinite moderators applies most to social and personality psychology experiments. Not that we don’t have the problem in other fields but there we usually have more controlled situations, even other subfields of psychology. This makes it much harder to argue that a failed replication is due to some unknown factor. Sure, there are a lot of reasons even the most basic psychophysics experiment can produce bad quality data but that’s not the same as a moderator.

I had a thought last night that summarises my whole criticisms of the social priming literature I discussed on my blog: The more unknown moderators you think there are, the smaller the effect size you can expect should be. For example elderly priming, where the stimulus material is claimed to make a big difference, an effect that supposedly works differently in different countries/languages or even in different states of the US, and that should also strongly depend on the variability between the participants, it just doesn’t seem credible to me that the effect size could be d=0.8-1ish.

The advice to collect a third additional study is also something that isn’t applicable to many fields. In neuroimaging we can often be glad if there was the time, funding, and the resources to even replicate it once. You will also inevitably run into contaminated data because the subject pool is exhausted and you start scanning subjects that produce less-than-optimal data. That leads to a lower effect size estimate than the true population effect. I often wonder in how far this problem also pertains to social psychology research. The problem is perhaps smaller there, especially on Mturk (although that one probably introduces other sources of noise) but it doesn’t seem to be completely controlled either.

hcp4175

August 13, 2015 at 9:24 am

Hi, this is a very helpful post.
I have a technique question: what function shall I chose to compute the ES if I used the within subject design? For instance, I used 3(A1, A2, A3) * 2(B1, B2) within subject design and wish meta-analyzed two conditions, say, B1A1 vs. B1A2. Could I still use “tes” fucntion?

- Katie Corker
  
  August 13, 2015 at 10:01 am
  
  Hi there! The tes() function is for converting independent t‘s to effect sizes. My guess is that you’d want to check out the MBESS package for help computing an effect size for a more complex design. In general, when you want to analyze a paired t it helps to have the correlation between the variables.
  
  - hcp4175
    
    August 13, 2015 at 10:19 am
    
    Thanks a lot, Katie

	Katie Corker on Dr. Wide Net has a lower False…
	christiancrandall on Dr. Wide Net has a lower False…
	aidan on Dr. Wide Net has a lower False…
	David Mellor on So You Want to Pre-Register a…
	hcp4175 on When Study 1 & Study 2 Dis…

science of psych

psych research methods, stats, pedagogy, and more

When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers

7 thoughts on “When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers”

Leave a comment Cancel reply

Share this:

Related

7 thoughts on “When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers”

Leave a comment Cancel reply