When Study 1 & Study 2 Disagree: Practical Recommendations for Researchers

A friend posted a question to a group of research colleagues recently:

“Three weeks ago, I ran a 100 person two-condition study on Mturk. Result: t = 2.95, p = .004. Today I ran another 100 person two condition study on Mturk, using the identical measure. No differences in what came before that measure. Result? t = 0.13, p = .89.”

The friend was exasperated and didn’t know what to do – What are the best practices for how researchers should adjudicate conflicting study results like these? I wrote the friend a long response, but I realized that my advice might be of use to others too.

The group had several suggestions for courses of action. I list the options below and explain my preferred option.

  1. Drop the project. This is an unsatisfactory choice, because as we will see below, the first two studies were likely underpowered, so we’re risking missing out on a true effect by abandoning the research question too soon (i.e., we risk a Type II error).
  2. Report the significant study and ignore the non-significant one. Ok, no one actually recommended this choice. But I think this is what a mentor might have recommended back in the old days. We know now that file drawering the non-significant study substantially inflates the Type I error rate of the published literature, which would be dishonest and not cool.
  3. Look for a moderator. Perhaps the first study was run on a Tuesday, and the effect only shows up on Tuesday. Or perhaps, more interestingly, the first study had more women participants, and the effect is stronger for women participants. These post-hoc moderators could explain why the effect shows up in one study but not the other. However, there are an infinite number of these potential moderators, and we have no way of knowing for sure which one is actually responsible. The most likely explanation is simple sampling error.
  4. Meta-analyze and use the meta-analytic confidence interval to test significance of the effect. This is not a terrible choice, and in the absence of more resources to conduct further research, this is probably a researcher’s best bet. But ultimately, without additional data, we can’t be very confident whether Study 1 was a false positive or Study 2 was a false negative.
  5. Use the meta-analytic effect size estimate to determine the needed sample size for a third study with 80% power. This is my recommended, best practices, option for the reasons outlined in point 4. Note that this third study should not be viewed as a tiebreaker, but rather as a way to get a more precise estimate of the actual effect size in question.

What follows is a step-by-step guide using the R statistics software package to conduct the meta-analysis and estimate the number of participants needed for Study 3.

Step 0 – Download the compute.es, metafor, and pwr libraries if you don’t have them already. This step only needs to be completed once per computer. You’ll need to remove the # first.

#install.packages("compute.es", repos='http://cran.us.r-project.org')
#install.packages("metafor", repos='http://cran.us.r-project.org')
#install.packages("pwr", repos='http://cran.us.r-project.org')

Step 1 – Then load the packages:

library(compute.es)
library(metafor)
## Loading required package: Matrix
## Loading 'metafor' package (version 1.9-7). For an overview 
## and introduction to the package please type: help(metafor).
library(pwr)

Step 2 – Compute the effect sizes for your studies.

study1<-tes(t=2.95,n.1=50,n.2=50)
## Mean Differences ES: 
##  
##  d [ 95 %CI] = 0.59 [ 0.18 , 1 ] 
##   var(d) = 0.04 
##   p-value(d) = 0 
##   U3(d) = 72.24 % 
##   CLES(d) = 66.17 % 
##   Cliff's Delta = 0.32 
##  
##  g [ 95 %CI] = 0.59 [ 0.18 , 0.99 ] 
##   var(g) = 0.04 
##   p-value(g) = 0 
##   U3(g) = 72.09 % 
##   CLES(g) = 66.06 % 
##  
##  Correlation ES: 
##  
##  r [ 95 %CI] = 0.29 [ 0.09 , 0.46 ] 
##   var(r) = 0.01 
##   p-value(r) = 0 
##  
##  z [ 95 %CI] = 0.29 [ 0.09 , 0.5 ] 
##   var(z) = 0.01 
##   p-value(z) = 0 
##  
##  Odds Ratio ES: 
##  
##  OR [ 95 %CI] = 2.92 [ 1.4 , 6.08 ] 
##   p-value(OR) = 0 
##  
##  Log OR [ 95 %CI] = 1.07 [ 0.33 , 1.81 ] 
##   var(lOR) = 0.14 
##   p-value(Log OR) = 0 
##  
##  Other: 
##  
##  NNT = 4.98 
##  Total N = 100
study2<-tes(t=0.13,n.1=50,n.2=50)
## Mean Differences ES: 
##  
##  d [ 95 %CI] = 0.03 [ -0.37 , 0.42 ] 
##   var(d) = 0.04 
##   p-value(d) = 0.9 
##   U3(d) = 51.04 % 
##   CLES(d) = 50.73 % 
##   Cliff's Delta = 0.01 
##  
##  g [ 95 %CI] = 0.03 [ -0.37 , 0.42 ] 
##   var(g) = 0.04 
##   p-value(g) = 0.9 
##   U3(g) = 51.03 % 
##   CLES(g) = 50.73 % 
##  
##  Correlation ES: 
##  
##  r [ 95 %CI] = 0.01 [ -0.19 , 0.21 ] 
##   var(r) = 0.01 
##   p-value(r) = 0.9 
##  
##  z [ 95 %CI] = 0.01 [ -0.19 , 0.21 ] 
##   var(z) = 0.01 
##   p-value(z) = 0.9 
##  
##  Odds Ratio ES: 
##  
##  OR [ 95 %CI] = 1.05 [ 0.51 , 2.15 ] 
##   p-value(OR) = 0.9 
##  
##  Log OR [ 95 %CI] = 0.05 [ -0.67 , 0.77 ] 
##   var(lOR) = 0.13 
##   p-value(Log OR) = 0.9 
##  
##  Other: 
##  
##  NNT = 135.9 
##  Total N = 100

Step 3 – Meta-analyze the studies (random effects meta-analysis), with effect sizes extracted from Step 2.

rma(yi=c(study1$g,study2$g),vi=c(study1$var.g,study2$var.g))
## 
## Random-Effects Model (k = 2; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 0.1168 (SE = 0.2217)
## tau (square root of estimated tau^2 value):      0.3418
## I^2 (total heterogeneity / total variability):   74.49%
## H^2 (total variability / sampling variability):  3.92
## 
## Test for Heterogeneity: 
## Q(df = 1) = 3.9200, p-val = 0.0477
## 
## Model Results:
## 
## estimate       se     zval     pval    ci.lb    ci.ub          
##   0.3100   0.2800   1.1071   0.2682  -0.2388   0.8588          
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4 – Look at the estimate from the random effects meta-analysis. In this case it is 0.31 (this is in standardized units). Its 95% CI is [-0.23, 0.86]. There is significant heterogeneity (Q = 3.92, p = .048), but who cares? In this case, it just means that the two estimates are pretty far apart.

Step 5 – Run a post-hoc power analysis to see what the combined power of the two first studies was. The n is per cell, so we have n=100 over the two studies. The d is the estimate from the meta-analysis.

pwr.t.test(n=100,d=.31,sig.level=.05,power=NULL,type="two.sample",alternative="two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.31
##       sig.level = 0.05
##           power = 0.587637
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Step 6 – The post-hoc power is .59 based on a true ES of 0.31. This means that given a true ES of 0.31, 59% of the time, we’d expect the combined estimate from the two studies to be statistically significant. Now we’ll run an a priori power analysis to see how many participants a researcher needs to get 80% power based on d = 0.31.

pwr.t.test(n=NULL,d=.31,sig.level=.05,power=.80,type="two.sample",alternative="two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 164.3137
##               d = 0.31
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Conclusion: The test says my friend needs 165 participants per group to get 80% power for d = 0.31. Of course, if researchers want to be more efficient, they could also try out sequential analysis.

I hope this guide is useful for researchers looking for a practical “what to do” guide in situations involving conflicting study results. I’m also interested in feedback – what would you do in a similar situation? Drop me a line on Twitter (@katiecorker), or leave a comment here.

Post-Performance Activation: No Longer an Afterthought

by Jesse Bogacz, Kenyon ’18

Stereotypes and racism have plagued the United States since it’s founding.  At the core of stereotypes is the desire for one group to marginalize and minimize those who are different from them.  To not allow for the acknowledgment of these groups as “normal” and as capable as the ruling group, white men.  Naturally, researchers have devoted a lot of time to studying the effects of stereotypes on the mental health of minorities.  However, a predominant amount of these studies have focused on simply the effects of stereotypes.  Hardly any tests have focused on post-performance activation.  The researchers Thiem, Stuart, Barden, and Evans decided to focus on the effects of stereotypes after participants were given an intellectual test.  The comparisons of the participants’ evaluations would be the base of their research. The researchers also drew upon the self-validation hypothesis, which states that individuals are more likely to change their opinions after seeing evidence against their stance.

Continue reading

Global Citizenship and the Ethics of Consumerism

by Brianna Levesque, Kenyon ’17

You are a citizen of your town, your state, and your country: but do you consider yourself a citizen of the world? Could whether or not you identify as a global citizen have an economic impact based upon the items you choose to buy? A recent study by researchers Gerhard Reese and Fabienne Kohlman set out to discover how defining citizenship in a global context affected the choices people made at the check-out stand. Their hypothesis was that those who more strongly identified as citizens of the world consequently were more likely to purchase fairtrade items as opposed to conventional items.

Continue reading

How Agreeable and Open and Religious People Compared to Others?

by Jia He, Kenyon ’17

Are religious people really more friendly, kind, and accepting than the average person? In the field of psychology, there has been much speculation in regard to how religiousness may intersect with certain personality traits. People who report religion and spirituality as a huge aspect of their lives have consistently reported higher scores on the Agreeableness and Openness to Experience facets in the Big Five personality inventory. But seeing that such results are often obtained through self-reports, one is inclined to question whether the link between religion and these traits is real or exaggerated by religious people who value Agreeableness and Openness as traits that they wish to possess.

Continue reading

Academic and Life Domains: How We Are Motivated

by Andrew Weinert, Kenyon ’17

According to Edward L. Deci and Richard M. Ryan at the University of Rochester, autonomous motivation is a major predictor of persistence and adherence and is very advantageous for deep-thinking tasks that involve a lot of processing (Deci & Ryan). In other words autonomous motivation is explicitly related to goal progress. Like autonomous motivation, specific personality traits are also a major predictor of goal setting and the attainment of these goals. For instance, someone very low in conscientiousness may not set a goal to clean his or her room because this does not complement the individual’s personality trait of low conscientiousness. Likewise, the goal of traveling the world would complement the personality trait of someone who is high in openness. Personality traits and autonomous motivation, however, do not necessarily need to be independent of one another.

Continue reading

Extraverts Categorize their Daily Experiences by Specific Social Relationships

by Kyra Baldwin, Kenyon ’18

Being extraverted or introverted defines the way an individual interacts with the world around them. The relationships a person seeks, the career they enter, how they carry themselves at a party are all affected by being high or low in trait extraversion. William Tov and Kelly Koh, researchers in Singapore, carried out a 2014 study to examine how exactly extraverts and introverts differ in understanding and categorizing their daily experiences. The results showed that extraverts tended to categorize their lives based on who they were socializing with more than how, when, or where they were socializing.

Continue reading